What comes to your mind when you think of a lake? Serene blue waters? A large but limited reservoir? Or perhaps, a stagnant mass of water that does not move?
Now, what would you think of if we ask you to think of data in terms of a lake?
In 2011, the CTO at software vendor Pentaho – James Dixon – coined the term ‘data lake.’ Now, the term is floating around in the big data space, courtesy various vendors; moreover, variants like ‘business data lake’ and ‘enterprise data lake’ have already forayed into the market place.
So, what is a data lake?
Well, a data lake is an idea that all enterprise data is stored in Hadoop from where all business applications are able to access and use it. It replaces the need for data warehouses, data marts, and gradually, all operational databases.
Is that a good thing?
Data lakes are basically storage grounds of all sorts of data – structured, unstructured, or semi-structured – without having any rules as to how they are governed, defined, or secured. It aims to solve two problems: First, of having several independent collections of data or information silos and second, of providing unrestricted data to big data initiatives. In the first case, the collection of all data in the unmanaged data lake results in increased and varied sharing and decreased license and server costs. Big data initiatives, to produce worthwhile insights, require access to data without boundaries of algorithms or structures – which brings us to the second case. A data lake responds to a big data query with all its data, and from there come buried insights – the main purpose of involving big data.
But, is the ‘sans constraint’ model worth the hype?
The verdict is yet to be out. However, experts have voiced their opinions.
Barry Devlin, expert consultant and lecturer on business insight and data warehousing, and author of Data Warehouse: From Architecture to Implementation, has some issues with the term itself. He suspects that ‘it was to contrast with the highly structured, well-organized image we have of a data warehouse’ that the idea of data lake came up. According to him, the combination of highly structured and agile data environments which while optimize particular needs, are also interlinked through assimilation process or metadata, is what will work.
On the other hand, Andrew White, VP and eminent analyst at Gartner says, “The need for increased agility and accessibility for data analysis is the primary driver for data lakes.” However, he maintains that despite the fact that data lakes can be of much value to different parts of an organization, the idea of data management throughout a business enterprise has yet to come true. Though data lakes benefit IT in the short term – as they don’t have to bother about how it works and just dump everything in the lake – an absence of information governance will lead to everything becoming highly vague and irrelevant, according to White.
Well, a lot more also depends on what the organization wants. If independent analysis is preferred, data lakes fit the bill nicely. However, formal analysis that generates value-based skills requires something more than the waves of a data lake.
What do you think?
Share your thoughts with us through the comments section.