We all love our jargon and buzzwords, don’t we ? Off late, I get pulled into a lot of discussions on “data lakes”. People are either huge fans of the the data lake approach, or they are super negative. And that is what makes these discussions a lot of fun.
I did a google search on what exactly is a data lake and got this as the number one hit. A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. If we stick to this definition (which I like), it also makes it easy to understand the tradeoffs of this approach.
The major benefit of this approach is that you reduce upfront costs associated transformation. Instead of ETL (extract-transform-load), we just extract and load, and leave transform for some other time. On the bright side with a data lake we can cross our hearts and honestly say “no more silos of data” and “one repository to rule them all” . This has its advantages – better deals on HW and SW costs and potential some breathing space to stagger analytics projects to make use of this data.
The whole reason for dumping all data into data lake is to analyze it at some point. For analysis, we need to know the meaning of the data that we are trying to crunch. When we try to do analysis at some point in time – we in all probability cannot make sense of what the data is anymore. Even in the tightly controlled world of datawarehouses, we spend a lot of time in rework when a new data source is introduced . We run to the people who manage the source systems and try to establish rules and governance and so on so that there is less error in steady state and data can be meaningfully combined.
So while data lakes are definitely serving a good purpose – there is an aspect of dealing with the “kicking the can down the road” that often comes with it. A data lake that is meant to prevent data silos, could just end up enabling silos at scale when there is no governance around it. This is the part of the message that marketing peeps usually don’t say out loud. Perhaps the reason they don’t say it is because once governance is established, data lakes might look very similar to the now uncool data warehouses – maybe not physically, but certainly logically.
Data lakes are quite useful as long as the expectations are correctly set. I would not shy away from it. Just keep the expectations real on the trade offs. I was joking with someone that all the fans of data lakes should be made to work in a data curation project for some time on a mandatory basis. That will give them a much more balanced perspective !
This is my second attempt at writing this blog. My (much longer) last trial yesterday night had to be deleted 🙂
11 thoughts on “Before you take a dip in a data lake”
Vijay, love these blogs…very simple yet profound. In my humble opinion, I would lean towards Data Lake and store all kinds of data that various business transactions and other sources generate, even if at this time I am kicking the can down the road. With analytics expanding its horizons many fold, every moment, data that does not make sense to me today, may be a very valuable element tomorrow. I would not want to regret or spend more time and money at a later date to collect, extract or recreate that data. Also with storage becoming cheaper and more efficient, storing extra bytes does not really break the bank. However it is imperative that this stored data is properly cataloged and inventoried for future use by different individuals.
It looks more like a safety/backup initiative for analysis than serious purpose, unless reasonable use cases are defined and expectation frameworks are set upfront.
Adaptive warehouses are available to re-transform the data for current business needs
Data quality issues, when not understood and fixed early, may prove to be detrimental in future usage of data.
With people churn, who would know what is in the data lake and how to leverage most part of the data.
Vijay: Good overview on data lakes. However, it shrinks its intended purpose to a single thread. A data lake not just stores all data, but should also be able to run analytics without moving data and should be self referential. I agree it was a jargon but I think it makes sense as it enables full analytics life cycle by bringing data, analytics and applications in a single eco-system.
if we store data without transformations – then of course analytics without moving it really hard.
Vijay – great perspective on Enterprise Data Lakes. I have been in thick of these discussions myself – ETL vs ELT et al. Lot of platform vendors selling Enterprise Data Lake concept are selling it to Business user that you don’t need to depend on IT to get access to insights from your (own) data. But we ,who have done IT for a long time , know that Data without a context does not serve a useful purpose. Somebody (whether it is IT in LOB or Centralized IT or a business analyst or a business user(doubtful)) has to do data mapping from source system. So I agree with you that it is like “Kicking the can down” – but nevertheless data is put in a central location and due to advancements in technology (hadoop, NoSQL, in-memory/on-chip computing etc), the ELT part is probably faster time-to-value compared to old legacy Data Warehouses/Data marts. Happy to discuss this with you more when you can talk live.
lets chat about this the next time connect – very curious to hear your thoughts Vineet
Good one. New jargons every few months.
IT has always been a jargon friendly ecosystem 🙂
What a brilliant excuse for selling enormous systems 😉
do we really need an excuse ? 🙂