It is a rather scary thought, even for a guy like me who loves playing with data for fun over the weekend, after pretty much doing the same at work during the week 🙂
The quantity of data that gets generated today is huge, and it is getting created faster than you can think about it. Just my smart phone alone probably generates more data than all the emails I have ever written in my life till date. More the data – more the options to play with it and see if there is anything useful.
However, at some point – one would start to question if it is worth hoardinig the data indefinitely.
I am yet to see a customer who said any data is useless. People (including me) hold on to emails from 15 years ago that they will never read again in future, with the intention that some day there “might” be a need for that. Since storage became cheap, this has become even more easy. I have emails in my archive about a camera I bought in 2000 and I lost it in 2005. I have no reason to keep that email – yet I do. Why? mostly because deleting that one email does not add any value – my yahoo email service doesnt charge me to hold it, and the time taken to find and delete those useless emails is larger than the benefit on first thoughts. I am only one person – and I probably have several GB of such information that I have no use for. If I multiply this by the billions of people hoarding the information – it is a few exabytes of useless data that is lying around in servers around the world, probably with redundant copies.
Now of course there is the question of whether this data is useful in aggregate. If Yahoo can do text analytics across all emails it holds – maybe insights like ” 18 % Indian men lose cameras every 5 years” might turn up. That might not be useful information directly for Yahoo – but might be something Yahoo can sell to a camera manufacturer or an insurance company. Of course the privacy laws and so on will need to be addressed – and a good number of insights might just get killed strictly for legal reasons.
Then there is the idea of fast and furious data . In theory, retailer can analyze your buying pattern, your geo location and your social media exhaust and alert you to pick up dog food on your way back home from work while you are driving, maybe with a map sent to your car’s navigation system. This might freak out some of us due to privacy concerns – but might be useful for some others on an opt-in basis . In this example – not only is every transaction useful to predict what could happen next, you can also aggregate it over years and make even more people freak out by out thinking them 🙂 . There might not be a hard distinction between signal and noise any more – what is noise in one scenario could be a useful signal in another.
There is lies and opportunity and the big cost that comes with it. There is some way of making use of any and all data – which means nothing ever will get deleted going forward. And it comes at a big cost – more storage, more compute, more power, more cooling, more people to manage all the system – essentially more of everything. As long as the quality of insights lead to value that is greater than cost of “more of everything”, life is good. But at some point – the cost will need to be curtailed. How do we do that?
At a high level – there are two things to keep in mind
For starters, starting out on big data adventures without boiling the ocean will be a good thing. Sure every bit of data you have can be used somehow – but because it can be used does not mean you have to do it today. Having an up front strategy for making use of more and more data over time will be the sensible thing to do. If you don’t like this idea, walk over to your BI team and ask them about the history of BI projects that happen without a BI strategy in place. I will bet on you being convinced. If your BI team does not scare you sufficiently, just ping me and I will put you in touch with people who have those scars to show.
Second – choose your big data partners carefully AND for the long term. It is a nascent area – and rapidly evolving. Big data is not something you buy over the counter today – it is a combination of a lot of different things . Neither customers nor vendors will have everything right upfront. So partner with firms who you can trust to not drop the ball when the going gets tough.Think through everything – what questions need real time responses, and what can wait ? Do you want to host the data in house or will you off load to someone else with adequate protections? Do you like high end servers with good DR, or would you keep commodity hardware will redundant data? Which existing solutions will you rip and replace, and which ones will you integrate ? How does this affect your BI strategy? How much do you want to invest in additional data quality? And a lot more. You don’t need everything answered upfront – but you need to keep tweaking your strategy as your needs and the vendor capabilities evolve.
There is also a chance that you don’t really need all the data to do what you want your big data project to do. Better use of statistics principles can give good enough insights in many cases from a smaller sample. And as soon as you know what is the question you are trying to find an answer to – you can start eliminating unnecessary data from your system ( or at least stop keeping redundant copies of things you probably don’t need).
Or just may be – instead of investing in “platforms” and do your own thing from scratch – may be you want your vendors to build big data applications for you, which you can then tailor for yourself more closely along the way.
Fun times ahead !