We all love our jargon and buzzwords, don’t we ? Off late, I get pulled into a lot of discussions on “data lakes”. People are either huge fans of the the data lake approach, or they are super negative. And that is what makes these discussions a lot of fun.
I did a google search on what exactly is a data lake and got this as the number one hit. A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. If we stick to this definition (which I like), it also makes it easy to understand the tradeoffs of this approach.
The major benefit of this approach is that you reduce upfront costs associated transformation. Instead of ETL (extract-transform-load), we just extract and load, and leave transform for some other time. On the bright side with a data lake we can cross our hearts and honestly say “no more silos of data” and “one repository to rule them all” . This has its advantages – better deals on HW and SW costs and potential some breathing space to stagger analytics projects to make use of this data.
The whole reason for dumping all data into data lake is to analyze it at some point. For analysis, we need to know the meaning of the data that we are trying to crunch. When we try to do analysis at some point in time – we in all probability cannot make sense of what the data is anymore. Even in the tightly controlled world of datawarehouses, we spend a lot of time in rework when a new data source is introduced . We run to the people who manage the source systems and try to establish rules and governance and so on so that there is less error in steady state and data can be meaningfully combined.
So while data lakes are definitely serving a good purpose – there is an aspect of dealing with the “kicking the can down the road” that often comes with it. A data lake that is meant to prevent data silos, could just end up enabling silos at scale when there is no governance around it. This is the part of the message that marketing peeps usually don’t say out loud. Perhaps the reason they don’t say it is because once governance is established, data lakes might look very similar to the now uncool data warehouses – maybe not physically, but certainly logically.
Data lakes are quite useful as long as the expectations are correctly set. I would not shy away from it. Just keep the expectations real on the trade offs. I was joking with someone that all the fans of data lakes should be made to work in a data curation project for some time on a mandatory basis. That will give them a much more balanced perspective !
This is my second attempt at writing this blog. My (much longer) last trial yesterday night had to be deleted 🙂