It rained heavily in Chandler yesterday and our front yard looked a whole lot nicer than the usual dusty appearance . That somehow made me think of the new look and feel of the world of data I live in as a professional .
For those of us who grew up implementing big data warehousing projects , it should not be a shock really to look back and figure out that most DW projects started because business had an analytics problem , but in the end 90% of effort was spent on the plumbing – the management of data (ETL, data modeling and so on) and only 10% on actual analytics (or even just basic reporting in many cases) .
This is true not just in design and build – it’s the case with supporting and maintaining the Data warehouse too . Companies have spent countless dollars on DW implementations and no one is truly happy about it . Yet, no one I know has any plans of fully replacing their DW implementations either (which of course is the right thing to do ).
Along came “big data” promising to make life better for everyone and setting very high expectations . Vast majority of customer executives that I speak to think of big data as an answer to their analytics solutions . Even amongst the CIO community , very few realize that most of the conversation they have heard is about the data management aspects ( 3V model is familiar to everyone and it’s about data management , not analytics). So in the past few years , I have seen several of my clients jump into big data initiatives to accelerate the realization of their analytics needs .
The fall from grace is rather rapid – mostly because of unrealistic expectations . To begin with the minimum requirement for big data projects in many cases is to meet the SLAs of their existing data warehouses and data marts . It doesn’t take too long to realize that ain’t gonna happen .
Then comes the dejavous realization that big data projects also need most of the time spent in ETL just like data warehouses did in past . Usually this leads to a quick reduction in scope of the projects – usually by eliminating some sources of data that are more complex or less clean , and of course this means analytics is compromised too .
Finally the reality of “data lakes need a lot of curation” kicks in . No company has enough man power to curate all the data that it needs for analysis . And at some point , the data lake just becomes a data dump with the idea that “curation can wait while we figure out what we need to analyze”. That is rarely practical – data scientists won’t always know the context of the data unless an expert curated it beforehand . And the world doesn’t have enough data scientists today to make them do data cleansing for most of their time .
Till such time as AI/Cognitive capabilities take the stress of curation away , I think analytics will continue to get short changed and the promise of big data ( and specifically data lakes) giving powerful analytics for busines users will not exactly work as advertised .
It’s not all gloomy though . Customers who start small with well defined analytics requirements have already started realizing benefits from their big data investments . They don’t take a “build, and they will come” approach . They just build intelligently as requirements come up and plan to have more comprehensive solutions down the road . They value business flexibility and agility over technical elegance . Many of them have taken the time to formulate a strategy and a roadmap on what they want to do – leading with analytics that satisfy specific business requirements and working back to data management , and not the other way around .
Of course we need both – but It’s time we put the horse (analytics) in front of the cart (data management).
10 thoughts on “Big Data : it’s about time we stopped putting the cart before the horse ”
One quick question – why not replacing the DWH (For Eg SAP BW) with big data is looked at as an option?
For several reasons . For example – DWH can be highly optimized for answering the same question repeatedly. And data – like say customer master – can be more comprehensively governed in a rigid environment like a DW
I should have said “99% based on services” as Cloudera and MapR have proprietary components which require licensing. The same reasoning still applies, though.
A problem I particularly have with how big data is approached today is that the Hadoop monetization model is 100% based on services, and hence it’s unlikely that we’ll see tools looking into making the “curation” easier, as those thousands of mandays of services is exactly what the distro companies are aiming for. Hadoop will get better and more efficient by the day, but it won’t necessarily (and probably won’t at all) get easier to use and more user friendly.
Don’t get me wrong, I love open source and it’s a fact that distributed environments are the future of data management, however until a new financial model is proposed, I don’t see the leap you’re talking about happening. Most likely, for it to happen, another company will need to create a layer of simple, user-friendly management & modeling tools on top of Hadoop or another distributed engine altogether. If this company is successful in creating this abstraction layer, it’ll be less and less mandatory to have this underlying layer necessarily as Hadoop, as I’m sure most companies will only be as faithful as they realize how complex it is to develop & maintain stuff on it.
It’s more or less what BI & DWs did with Databases a decade, 15 years ago, however with distributed engines it’s a bit more complicated since the partitioning adds another degree of freedom to the equation. But a solution is still possible, nevertheless.
CDH makes money on SW. So does MapR. Horton is the only one that is purely support and services.
thank you..please keep writing more and more such blogs..off late, we are missing them 😦 whenever possible please talk sometime about cognitive analytics..
Not inspired enough to write these days – but i do have some thoughts on cognitive that I will jot down soon
Great point. Always need to find the right balance between vision and practical approach.
yes – always critical to strike that balance
Great point -specially in cloud Saas solutions