Are you ready for big data getting bigger and faster and never getting deleted ?


It is a rather scary thought, even for a guy like me who loves playing with data for fun over the weekend, after pretty much doing the same at work during the week 🙂

The quantity of data that gets generated today is huge, and it is getting created faster than you can think about it. Just my smart phone alone probably generates more data than all the emails I have ever written in my life till date. More the data – more the options to play with it and see if there is anything useful.

However, at some point – one would start to question if it is worth hoardinig the data indefinitely.

I am yet to see a customer who said any data is useless. People (including me) hold on to emails from 15 years ago that they will never read again in future, with the intention that some day there “might” be a need for that. Since storage became cheap, this has become even more easy. I have emails in my archive about a camera I bought in 2000 and I lost it in 2005. I have no reason to keep that email – yet I do. Why? mostly because deleting that one email does not add any value – my yahoo email service doesnt charge me to hold it, and the time taken to find and delete those useless emails is larger than the benefit on first thoughts. I am only one person – and I probably have several GB of such information that I have no use for. If I multiply this by the billions of people hoarding the information – it is a few exabytes of useless data that is lying around in servers around the world, probably with redundant copies.

Now of course there is the question of whether this data is useful in aggregate. If Yahoo can do text analytics across all emails it holds – maybe insights like ” 18 % Indian men lose cameras every 5 years” might turn up. That might not be useful information directly for Yahoo – but might be something Yahoo can sell to a camera manufacturer or an insurance company. Of course the privacy laws and so on will need to be addressed – and a good number of insights might just get killed strictly for legal reasons.

Then there is the idea of fast and furious data . In theory,  retailer can analyze your buying pattern, your geo location and your  social media  exhaust and alert you to pick up dog food on your way back home from work while you are driving, maybe with a map sent to your car’s navigation system.  This might freak out some of us due to privacy concerns – but might be useful for some others on an opt-in basis .  In this example – not only is every transaction useful to predict what could happen next, you can also aggregate it over years and make even more people freak out by out thinking them 🙂 . There might not be a hard distinction between signal and noise any more – what is noise in one scenario could be a useful signal in another.

There is lies and opportunity and the big cost that comes with it. There is some way of making use of any and all data – which means nothing ever will get deleted going forward. And it comes at a big cost – more storage, more compute, more power, more cooling, more people to manage all the system – essentially more of everything.  As long as the quality of insights lead to value that is greater than cost of “more of everything”, life is good. But at some point – the cost will need to be curtailed. How do we do that?

At a high level – there are two things to keep in mind

For starters, starting out on big data adventures without boiling the ocean will be a good thing. Sure every bit of data you have can be used somehow – but because it can be used does not mean you have to do it today. Having an up front strategy for making use of more and more data over time will be the sensible thing to do. If you don’t like this idea, walk over to your BI team and ask them about the history of BI projects that happen without a BI strategy in place. I will bet on you being convinced. If your BI team does not scare you sufficiently, just ping me and I will put you in touch with people who have those scars to show.

Second – choose your big data partners carefully AND for the long term. It is a nascent area – and rapidly evolving.  Big data is not something you buy over the counter today – it is a combination of a lot of different things . Neither customers nor vendors will have everything right upfront. So partner with firms who you can trust to not drop the ball when the going gets tough.Think through everything – what questions need real time responses, and what can wait ? Do you want to host the data in house or will you off load to someone else with adequate protections? Do you like high end servers with good DR, or would you keep commodity hardware will redundant data? Which existing solutions will you rip and replace, and which ones will you integrate ? How does this affect your BI strategy? How much do you want to invest in additional data quality? And a lot more. You don’t need everything answered upfront – but you need to keep tweaking your strategy as your needs and the vendor capabilities evolve.

There is also a chance that you don’t really need all the data to do what you want your big data project to do. Better use of statistics principles can give good enough insights in many cases from a smaller sample. And as soon as you know what is the question you are trying to find an answer to – you can start eliminating unnecessary data from your system ( or at least stop keeping redundant copies of things you probably don’t need).

Or just may be – instead of investing in “platforms” and do your own thing from scratch – may be you want your vendors to build big data applications for you, which you can then tailor for yourself more closely along the way.

Fun times ahead !

Operatics – Thorsten Franz’s Next Adventure


First off – read this post, please

http://scn.sap.com/community/hana-in-memory/blog/2013/07/31/betting-my-livelihood-on-the-oltp-olap-convergence-meet-operatics-thorsten-s-hana-startup

I cannot be happier for my dear friend, Thorsten Franz. If you don’t know Thorsten – reach out to him on twitter ( @thorstenster) or Facebook https://www.facebook.com/operatics . Trust me it will be worth your time to know him – and you can thank me later 🙂

Thorsten is an SAP Mentor – but I think of him as a mentor of SAP Mentors. He epitomizes everything that program stands for. He is one of those people who can blend deep technical expertise effortlessly with his love for writing and teaching . Couple of years ago, I got a thorough training from him in one hour on how to use River. This was done on a piece of paper and his laptop, when we were both at SAP Teched Madrid. In that one hour – I learned more than what I could have learned by myself in a month or more. He understood not just how it worked – but also how it was designed, what it could not do then, and where he expected it to go in future.

One of my biggest regrets in professional life is that I could not find a way to work with him in the same team. I tried many different times – but it just did not work out. But now with his new start up – I think we will have a chance finally to collaborate closely. I know for sure that I will learn a lot from that.

What is also pretty impressive to me personally is that he walked away from a cushy big company job , to try his hand at something he is passionate about. This is not a guy who is known as a dare devil risk taker. I have talked to Thorsten quite a bit and have taken his advice on many topics – I have never heard him say anything without careful consideration and deep thought. So when he chose to bet his future on Hana – you can imagine how thrilled I was.

Thorsten is living proof of how realistic the vision Hasso and Vishal have laid out for SAP Hana. He literally bet his livelihood on the convergence of OLTP and OLAP. How cool is that ? And having followed his very successful weight loss campaign on social media – I know he can meet any goal he sets for himself .

Thorsten , I wish you nothing but the BEST in both personal and professional life . It is an honor to have you as a friend . I am sure Operatics will scale great heights , and I look forward to celebrating your success – with a bottle of single malt of course.

Big Data : Switching Between Top Dog And Fire Hydrant


Most of you know I am a huge fan of big data. In my mind – this is the new top dog in enterprise software world. Companies can leap into a totally new world of insights and in all probability will revolutionize how business world makes use of information. I have not met a customer this year who does not have plans to go for some big data initiative. And no wonder – the vendor side of the world is all excited too. It is heartening for me to see both customers and vendors excited at the same time – that should curtail the hype cycle for big data.

Big data is capable of changing everything for the better – notions of what a platform does, what an application does, and rapidly advance data science in the academic world.  The technology and the theory of big data is fast changing – both on open source side, and on commercial vendor side.. There are a variety of options available for customers now for all aspects of big data. And while choice is good – it also tends to increase confusion. It is important to make sure that the bets on big data are made with eyes open, and not based on hype.

Big data has real benefits – and while the technology is evolving , there is already plenty available to make good use of for many use cases – ergo, you can use it today. CFOs will always have budget to spend for making more money – or to reduce cost. But they won’t be happy to write a check with neither possibility. So lets try to keep big data in its top dog persona, and try to avoid the evil twin , the fire hydrant.

There are 2 general ways to use big data

1. Find answers to questions you already have by sifting through a lot of data

2. Keep looking at a lot of data, and see if you can spot something – without having a very specific question to ask upfront. Once you spot something, you start asking questions like in option 1.

Option 1 is mostly just an extension of traditional BI – but with more data, coming in at a higher speed and probably of more types than we have dealt with before. 

As I think through all those past BI projects (and the associated blood, sweat and tears) – I can say with some certainty that most of the data pumped into data warehouses were never used. Vast majority of the customers got most of their answers from less than 30% or so of the data available to them. Of course it can be argued that they probably did not even know that the other 70% existed. But these are big companies with excellent top line and bottom line and they believe they have a mature BI platform. So lets say 50% of the data is useful instead of 30% I estimated. Yet, if you ask for the requirements for the next project at these customers – I will bet you dollars to doughnuts , they will ask for everything just like they did in the past. The fact that they don’t use half their data is of no consequence for them. That is how the BI world rolls.  Now with big data – there is a chance that more of the previously unused data will be used to enhance the quality of insights

Any kind of big data solution that is built for a customer who follows the option 1 route needs very cheap storage, or a way to store useful information alone, or have a BI solution that can sift through a lot of unnecessary data quickly to find answers.  This also means BI practitioners need even more due diligence in figuring out requirements so that wastage is kept to a minimum. Easier said than done .

In our familiar world of data warehouses – there is some data duplication across layers for various reasons (performance, transformations and so on). When you think of big data – don’t assume redundant data just goes away magically. On the contrary, many big data solutions (including Hadoop) need redundant copies of data . Storage is getting cheaper – but you will need a LOT of it if you keep agreeing to user requirements like before. There are also trade offs between using SAN vs several machines networked together with their own (cheap) disks.

What about performance? You can of course get excellent performance – but that depends on the question. If a system needs to use massively parallel processing – the question you ask should be able to split the data it looks at into many chunks and look at all the chunks in parallel, and then add up the results.

If you ask ” how many icecreams of each flavor did we sell today?” – and you have a million transactions, you can easily do that chunking and aggregation. However, not all questions can be answered that way. What if I ask “What is the correlation of the brand of icecream sold and rise and fall of local day time temperature ?”. This question is hard to split into many chunks because there are more variables . So while it can be computed, it is a fairly serial process. Now of course you can try some other way of solving this by looking at a smaller set of data (whether you cook one pot of rice or one barrel, you check if it is done in both cases using just a spoonful ), Or if you knew upfront that this question will be asked, you can do the old data warehousing technique of pre calculating data in useful forms and wait for the question to be asked and so on.

Essentially – depending on the question you ask, you might need a combination of big data solutions (say like Hana and Hadoop) to get a good answer in an acceptable time.  You can reuse and build on a lot of skills you already have in the shop today. But walk into option 1 knowing all the trade offs you have to live with. And I haven’t even skimmed the surface of all the things you need to consider.

What about option 2 ? No predefined questions – but you look at data and see if there is anything useful there.  The good news is that the technology is already there to do this. Bad news is that you need a lot of hardware, consulting etc to get it done. Well there is one more thing to keep an eye on – in the wrong hands, it is fairly easy to bark up the wrong tree when you are dealing with this kind of big data. False positives and false negatives galore. You might be chasing solutions to non-existent problems. An interesting side question while we are at it – have you ever run into a data scientist who said he/she has enough data ? I have not – they all would like even more data.  I am told they exist somewhere 🙂

What about disaster recovery ? Better start getting used to recovering a few petabytes of data.  High availability is probably not a big issue since it is kind of part and parcel of the big data design in most cases. And of course option 2 has to deal with all the issues with option 1 – just that you might not know upfront of extent of the stuff you have to deal with.

Big data will be a fun ride – but keep your seat belts fastened low and tight across your lap for any little bumps along the way.