Big Data : Switching Between Top Dog And Fire Hydrant


Most of you know I am a huge fan of big data. In my mind – this is the new top dog in enterprise software world. Companies can leap into a totally new world of insights and in all probability will revolutionize how business world makes use of information. I have not met a customer this year who does not have plans to go for some big data initiative. And no wonder – the vendor side of the world is all excited too. It is heartening for me to see both customers and vendors excited at the same time – that should curtail the hype cycle for big data.

Big data is capable of changing everything for the better – notions of what a platform does, what an application does, and rapidly advance data science in the academic world.  The technology and the theory of big data is fast changing – both on open source side, and on commercial vendor side.. There are a variety of options available for customers now for all aspects of big data. And while choice is good – it also tends to increase confusion. It is important to make sure that the bets on big data are made with eyes open, and not based on hype.

Big data has real benefits – and while the technology is evolving , there is already plenty available to make good use of for many use cases – ergo, you can use it today. CFOs will always have budget to spend for making more money – or to reduce cost. But they won’t be happy to write a check with neither possibility. So lets try to keep big data in its top dog persona, and try to avoid the evil twin , the fire hydrant.

There are 2 general ways to use big data

1. Find answers to questions you already have by sifting through a lot of data

2. Keep looking at a lot of data, and see if you can spot something – without having a very specific question to ask upfront. Once you spot something, you start asking questions like in option 1.

Option 1 is mostly just an extension of traditional BI – but with more data, coming in at a higher speed and probably of more types than we have dealt with before. 

As I think through all those past BI projects (and the associated blood, sweat and tears) – I can say with some certainty that most of the data pumped into data warehouses were never used. Vast majority of the customers got most of their answers from less than 30% or so of the data available to them. Of course it can be argued that they probably did not even know that the other 70% existed. But these are big companies with excellent top line and bottom line and they believe they have a mature BI platform. So lets say 50% of the data is useful instead of 30% I estimated. Yet, if you ask for the requirements for the next project at these customers – I will bet you dollars to doughnuts , they will ask for everything just like they did in the past. The fact that they don’t use half their data is of no consequence for them. That is how the BI world rolls.  Now with big data – there is a chance that more of the previously unused data will be used to enhance the quality of insights

Any kind of big data solution that is built for a customer who follows the option 1 route needs very cheap storage, or a way to store useful information alone, or have a BI solution that can sift through a lot of unnecessary data quickly to find answers.  This also means BI practitioners need even more due diligence in figuring out requirements so that wastage is kept to a minimum. Easier said than done .

In our familiar world of data warehouses – there is some data duplication across layers for various reasons (performance, transformations and so on). When you think of big data – don’t assume redundant data just goes away magically. On the contrary, many big data solutions (including Hadoop) need redundant copies of data . Storage is getting cheaper – but you will need a LOT of it if you keep agreeing to user requirements like before. There are also trade offs between using SAN vs several machines networked together with their own (cheap) disks.

What about performance? You can of course get excellent performance – but that depends on the question. If a system needs to use massively parallel processing – the question you ask should be able to split the data it looks at into many chunks and look at all the chunks in parallel, and then add up the results.

If you ask ” how many icecreams of each flavor did we sell today?” – and you have a million transactions, you can easily do that chunking and aggregation. However, not all questions can be answered that way. What if I ask “What is the correlation of the brand of icecream sold and rise and fall of local day time temperature ?”. This question is hard to split into many chunks because there are more variables . So while it can be computed, it is a fairly serial process. Now of course you can try some other way of solving this by looking at a smaller set of data (whether you cook one pot of rice or one barrel, you check if it is done in both cases using just a spoonful ), Or if you knew upfront that this question will be asked, you can do the old data warehousing technique of pre calculating data in useful forms and wait for the question to be asked and so on.

Essentially – depending on the question you ask, you might need a combination of big data solutions (say like Hana and Hadoop) to get a good answer in an acceptable time.  You can reuse and build on a lot of skills you already have in the shop today. But walk into option 1 knowing all the trade offs you have to live with. And I haven’t even skimmed the surface of all the things you need to consider.

What about option 2 ? No predefined questions – but you look at data and see if there is anything useful there.  The good news is that the technology is already there to do this. Bad news is that you need a lot of hardware, consulting etc to get it done. Well there is one more thing to keep an eye on – in the wrong hands, it is fairly easy to bark up the wrong tree when you are dealing with this kind of big data. False positives and false negatives galore. You might be chasing solutions to non-existent problems. An interesting side question while we are at it – have you ever run into a data scientist who said he/she has enough data ? I have not – they all would like even more data.  I am told they exist somewhere 🙂

What about disaster recovery ? Better start getting used to recovering a few petabytes of data.  High availability is probably not a big issue since it is kind of part and parcel of the big data design in most cases. And of course option 2 has to deal with all the issues with option 1 – just that you might not know upfront of extent of the stuff you have to deal with.

Big data will be a fun ride – but keep your seat belts fastened low and tight across your lap for any little bumps along the way.

 

Advertisements