Defining Hadoop with a straight face

Unlike datalake, I don’t think of Hadoop as a buzzword at all. It is a real thing – a real project you can touch and feel. Also, I love elephants (The one in the picture is an eighty year old Guruvayoor Padmanabhan, owned by a famous temple in Kerala)  – which might explain why I have a particular soft spot for Hadoop 🙂 . Hadoop is perhaps not yet ready to be an elephant this size, but it is not such a little baby elephant as shown in the marketing pictures either.


By now, it is stable enough even for an obsolete programmer like me to play with it minimum fuss. In my day job, I don’t program any more (sadly – but probably a good thing for my team, and perhaps for the rest of humanity). But there is hardly a week that passes me without a need for explaining hadoop to someone. Even when I was between jobs and enjoying a nice vacation, I was pulled into “help me understand hadoop” discussions.

The primary problem for me personally is that the folks I usually have to explain hadoop to are not always conversant with open source. I get to talk mostly to people who are smart enterprise technologists, but who have very little idea of open source per se. Vast majority of them equate open source to free software. And it is not as if open source is a good all encompassing way to explain hadoop (hello MapR). And if I somehow manage to get these folks to understand open source and a few licensing models, I lose them again on the idea of what a distribution is and what is different between the distros. I think its mostly a mental block that there are commercial vendors who make money of software that is supposedly free – and they all do it in different ways. Can’t blame them – I had those questions too for a long time.

Once you get past open source, then there is a question of what constitutes Hadoop.

No one in my circles apparently cares about Hadoop common , so I no longer utter a word about that. And the few times I have mentioned it, mostly to friends who come from coding backgrounds – I have to deal with “why would anyone use java for building this?” 🙂 . Hence no “common” – nothing but grief from that conversation.

Then I talk about HDFS and MapReduce . Enter more rat holes – why does the world need yet another file system – why not use GPFS? I have heard of Spark – so when do we use MapReduce and why not just use Spark all the time? If I get a breather, I also get to mention YARN and Arun’s explanation of datacentre operating system, and that there is a MapReduce2. People get why YARN is a good idea almost instantaneously. Everyone appreciates the vision of pluggability – but invariably someone will ask about its compatibility with stuff that came before YARN .

This is usually where I start with the idea of different kind of nodes, mechanics of replication, why 3 is a default value, and why the whole thing is built for commodity servers daisy chained together. You would think this is the easy part – but it is not. We are dealing with people who have spent their entire careers working on high end servers that have all kinds of resiliency. It is a really hard thing for them to visualize the world hadoop is built for. Usual starting questions include “if I just repurpose all my high end servers, can I just avoid replicating thrice?” or “I already have tooling that takes care of HA and DR in all my data centers. Surely there are APIs that I can use to connect my existing tools to this hadoop thing ?”. This is a game of how many times you get to say “it depends” in any one conversation without taking a breath.

Just when I am ready to wind down the conversation – impatient listener will ask “can you fill me in on hive, HBase etc”. Sure why not – so I explain how there are a bunch of other projects that play alongside hadoop. “Are they all java?” – well, they are not ALL java, but ideally you should not have to worry given they all have interfaces you can use relatively easy. I can see relief !

HBase seems to be a trigger for starting on the NoSQL part of this Odyssey. This is particularly so because my friends know I spent some time working at MongoDB. Sadly – as in REALLY sadly – 90% of my conversations include a part of convincing people that MongoDB is not hadoop. And at a minimum at that point I have to touch on Cassandra to explain there are more NoSQL options out there. Invariably this opens up the question of “does MongoDB work with Hadoop” – thankfully it does and I explain the connector. It also usually leads to my friends from ops background sigh aloud “can’t we just use a general purpose database that does all these things” ? . I no longer fight them on this topic – mostly for lack of energy.

“So you have explained how data sits in hadoop, but you have not explained how I can put it in there or how I get these awesome insights from that data”. Ah I forgot all about that. So I go on to explain even more projects ( at some point it is overwhelming to remember all the names and the right spellings ) – and also manage to get in a plug that a lot of closed source tools can access hadoop. This is usually the point at which my ops friends give up on me and say “good Lord,  it is complicated to manage all this” and my dev friends get all excited that there are so many cool new toys to play with.

Usually the final question is on security. On the bright side, Ranger and Sentry are both easy names to remember. On the not too bright side – I don’t want ( or even know fully well enough) to explain why two major distros have two different approaches. And this usually leads to other examples like impala to show that not all distros share the same implementation philosophy. I should use something other than impala in future as my example – since that goes into “how many possible ways can you do SQL on hadoop?”. My usual temptation is to say “more ways than enough for you to get a costco size jar of aspirin” 🙂

I have not counted, but there are certainly more than 30 projects that can somehow be called a part of hadoop. I have personally played with less than half of them. And every day more projects are starting and it is getting hard to keep up. No wonder why my friends/customers/random people all get stressed out trying to understand hadoop and how to make use of it. And yet – everyone is excited about their hadoop journey, despite all its twists and turns. Marketers like my friend Ingrid who recently joined HortonWorks as CMO should have a fun time articulating a message that makes it much simpler to understand.

Alright, so who wants to ask me about Hadoop next ?


Big data opportunities and challenges are getting a bit clearer

From time to time, I take a few days off work to reflect on things I don’t get to think about in “regulation time” . Its a bit of spring cleaning of my mind.


I am in the middle of one such break today. Other than sleeping a lot, and recovering from India’s loss to Australia last week in cricket world cup – I have been busy reading, listening to Ilayaraja songs non-stop, installing a new patio door, following the progress of my dog who is on a dogshow circuit in midwest with his handler (probably the closest to a good training for me and my wife for when our kiddo leaves for college) , catching up with my friends/mentors/old customers/school mates etc.

Yesterday night, I finally put an end to my month long misery of not being able to crack the 2048 game ( it is a super addictive game – my advice is to not start on it unless you don’t mind spending every spare minute on your phone playing it, and it is a big culprit on the battery drain front). The first thing I did after getting the 2048 tile was to take a screenshot to show my daughter who challenged me to do it, and the next thing I did was to delete the game on my phone. All of today, I have been fighting the withdrawal . As of 5 PM PST, I can report that I could resist the temptation of not downloading the game again and playing it all over again 🙂

Spending the time talking shop with all the peeps I managed to get a hold of these last few days – one thing hit me immediately. Customers and vendors who have started on their big data journey in the last year or two have a new appreciation for the opportunities and challenges in front of them. The opportunity part is pretty straightforward – customers are recognizing that some of the hype around big data is justified, and that real verifiable customer stories are now available. Of course they also know the koolaid firehose is still running full 🙂

Here are some recurring themes on the challenges.

1. Talent shortage

Vendors need technical pre-sales people and developers the most. Customers need developers and ops people either in house or from consulting companies. And such people are apparently in unicorn category. And when these people are available – the employers just don’t know how to evaluate their skills.

Another issue that customers seem to be running into is breadth vs depth. They can usually find an expert in one technology for the right money. But a project typically needs more than one new technology – like maybe hive, mongodb and say elastic search. People who can integrate all of them in real life are rarer than unicorns in rainbow color.

2. How exactly does open source work ?

The people who understand the nuances of open source are overwhelmingly on the vendor side of the house. This includes legal experts. Some customers are also finding their trusted buyer’s agents are not yet smart on open source models. There is some silver lining though – Subscription models are better understood compared to a year or two ago.

3. Procurement cannot figure out what motivates sales people any more.

This one made me smile quite a bit. A good part of my grey hair can be attributed directly to wrestling with procurement folks over the years. Here is how one guy explained it to me ” It was pretty simple in the past – the larger the check I could write, the more benefits I could extract from the salesman. It no longer seems to be the case across the board. Sales reps selling BI and big data things to me all seem to have incentives that are rather unique. Some don’t even want big checks anymore. Some like cloud and some others talk me out of it . I feel like I need to take classes on dealing with them”.

And an IT director buddy – someone who has planned and executed 100s of millions of dollars worth of projects in his career told me “I have a hard time with financial models for projects now given the mix of perpetual and subscription models for all the different software I need. I can barely understand all the pricing and terms nuances , let alone explain the full picture to the controllers and other stakeholders”.  The impact is a weird situation – he takes more time planning a project than actual execution, and he hates it.

4. Development is not the big worry anymore – maintenance is 

They all unanimously agree that these new technologies all reduce development time significantly and give great flexibility to make changes relatively quickly. However, they all have the same worry on maintenance – especially my friends who work in consulting/outsourcing companies. These new technologies all have different security models, different ways to backup and restore and different ways to provision new instances. Each one is built individually to be maintainable and scalable – their worry is how to do all of them together with tight SLAs.

5. Minimal vertical messaging 

I never thought I would hear customers ask for more marketing – but that did happen!  What is the world coming to ? 🙂

These folks have all heard it loud and clear that data is big and bad these days and these new technologies can all help them to tame the bad ass big data beast. But they are looking for specific examples of how it helps customers in their industry. On the bright side most of them are not hesitant to try proof of concepts for new use cases.

I did not offer any solutions to these challenges – my intention was just to listen and get a feel for where we are headed at a big picture level. But now that I have thought about it a little bit, I have some rough initial thoughts on things that can help make life easier on this front. When these thoughts are a little better formed, I will make an attempt to scribble them and share.

I am very curious to hear from all of you on whether these themes are showing up in your big data journey. Let me know !

Are you ready for big data getting bigger and faster and never getting deleted ?

It is a rather scary thought, even for a guy like me who loves playing with data for fun over the weekend, after pretty much doing the same at work during the week 🙂

The quantity of data that gets generated today is huge, and it is getting created faster than you can think about it. Just my smart phone alone probably generates more data than all the emails I have ever written in my life till date. More the data – more the options to play with it and see if there is anything useful.

However, at some point – one would start to question if it is worth hoardinig the data indefinitely.

I am yet to see a customer who said any data is useless. People (including me) hold on to emails from 15 years ago that they will never read again in future, with the intention that some day there “might” be a need for that. Since storage became cheap, this has become even more easy. I have emails in my archive about a camera I bought in 2000 and I lost it in 2005. I have no reason to keep that email – yet I do. Why? mostly because deleting that one email does not add any value – my yahoo email service doesnt charge me to hold it, and the time taken to find and delete those useless emails is larger than the benefit on first thoughts. I am only one person – and I probably have several GB of such information that I have no use for. If I multiply this by the billions of people hoarding the information – it is a few exabytes of useless data that is lying around in servers around the world, probably with redundant copies.

Now of course there is the question of whether this data is useful in aggregate. If Yahoo can do text analytics across all emails it holds – maybe insights like ” 18 % Indian men lose cameras every 5 years” might turn up. That might not be useful information directly for Yahoo – but might be something Yahoo can sell to a camera manufacturer or an insurance company. Of course the privacy laws and so on will need to be addressed – and a good number of insights might just get killed strictly for legal reasons.

Then there is the idea of fast and furious data . In theory,  retailer can analyze your buying pattern, your geo location and your  social media  exhaust and alert you to pick up dog food on your way back home from work while you are driving, maybe with a map sent to your car’s navigation system.  This might freak out some of us due to privacy concerns – but might be useful for some others on an opt-in basis .  In this example – not only is every transaction useful to predict what could happen next, you can also aggregate it over years and make even more people freak out by out thinking them 🙂 . There might not be a hard distinction between signal and noise any more – what is noise in one scenario could be a useful signal in another.

There is lies and opportunity and the big cost that comes with it. There is some way of making use of any and all data – which means nothing ever will get deleted going forward. And it comes at a big cost – more storage, more compute, more power, more cooling, more people to manage all the system – essentially more of everything.  As long as the quality of insights lead to value that is greater than cost of “more of everything”, life is good. But at some point – the cost will need to be curtailed. How do we do that?

At a high level – there are two things to keep in mind

For starters, starting out on big data adventures without boiling the ocean will be a good thing. Sure every bit of data you have can be used somehow – but because it can be used does not mean you have to do it today. Having an up front strategy for making use of more and more data over time will be the sensible thing to do. If you don’t like this idea, walk over to your BI team and ask them about the history of BI projects that happen without a BI strategy in place. I will bet on you being convinced. If your BI team does not scare you sufficiently, just ping me and I will put you in touch with people who have those scars to show.

Second – choose your big data partners carefully AND for the long term. It is a nascent area – and rapidly evolving.  Big data is not something you buy over the counter today – it is a combination of a lot of different things . Neither customers nor vendors will have everything right upfront. So partner with firms who you can trust to not drop the ball when the going gets tough.Think through everything – what questions need real time responses, and what can wait ? Do you want to host the data in house or will you off load to someone else with adequate protections? Do you like high end servers with good DR, or would you keep commodity hardware will redundant data? Which existing solutions will you rip and replace, and which ones will you integrate ? How does this affect your BI strategy? How much do you want to invest in additional data quality? And a lot more. You don’t need everything answered upfront – but you need to keep tweaking your strategy as your needs and the vendor capabilities evolve.

There is also a chance that you don’t really need all the data to do what you want your big data project to do. Better use of statistics principles can give good enough insights in many cases from a smaller sample. And as soon as you know what is the question you are trying to find an answer to – you can start eliminating unnecessary data from your system ( or at least stop keeping redundant copies of things you probably don’t need).

Or just may be – instead of investing in “platforms” and do your own thing from scratch – may be you want your vendors to build big data applications for you, which you can then tailor for yourself more closely along the way.

Fun times ahead !

Big Data : Switching Between Top Dog And Fire Hydrant

Most of you know I am a huge fan of big data. In my mind – this is the new top dog in enterprise software world. Companies can leap into a totally new world of insights and in all probability will revolutionize how business world makes use of information. I have not met a customer this year who does not have plans to go for some big data initiative. And no wonder – the vendor side of the world is all excited too. It is heartening for me to see both customers and vendors excited at the same time – that should curtail the hype cycle for big data.

Big data is capable of changing everything for the better – notions of what a platform does, what an application does, and rapidly advance data science in the academic world.  The technology and the theory of big data is fast changing – both on open source side, and on commercial vendor side.. There are a variety of options available for customers now for all aspects of big data. And while choice is good – it also tends to increase confusion. It is important to make sure that the bets on big data are made with eyes open, and not based on hype.

Big data has real benefits – and while the technology is evolving , there is already plenty available to make good use of for many use cases – ergo, you can use it today. CFOs will always have budget to spend for making more money – or to reduce cost. But they won’t be happy to write a check with neither possibility. So lets try to keep big data in its top dog persona, and try to avoid the evil twin , the fire hydrant.

There are 2 general ways to use big data

1. Find answers to questions you already have by sifting through a lot of data

2. Keep looking at a lot of data, and see if you can spot something – without having a very specific question to ask upfront. Once you spot something, you start asking questions like in option 1.

Option 1 is mostly just an extension of traditional BI – but with more data, coming in at a higher speed and probably of more types than we have dealt with before. 

As I think through all those past BI projects (and the associated blood, sweat and tears) – I can say with some certainty that most of the data pumped into data warehouses were never used. Vast majority of the customers got most of their answers from less than 30% or so of the data available to them. Of course it can be argued that they probably did not even know that the other 70% existed. But these are big companies with excellent top line and bottom line and they believe they have a mature BI platform. So lets say 50% of the data is useful instead of 30% I estimated. Yet, if you ask for the requirements for the next project at these customers – I will bet you dollars to doughnuts , they will ask for everything just like they did in the past. The fact that they don’t use half their data is of no consequence for them. That is how the BI world rolls.  Now with big data – there is a chance that more of the previously unused data will be used to enhance the quality of insights

Any kind of big data solution that is built for a customer who follows the option 1 route needs very cheap storage, or a way to store useful information alone, or have a BI solution that can sift through a lot of unnecessary data quickly to find answers.  This also means BI practitioners need even more due diligence in figuring out requirements so that wastage is kept to a minimum. Easier said than done .

In our familiar world of data warehouses – there is some data duplication across layers for various reasons (performance, transformations and so on). When you think of big data – don’t assume redundant data just goes away magically. On the contrary, many big data solutions (including Hadoop) need redundant copies of data . Storage is getting cheaper – but you will need a LOT of it if you keep agreeing to user requirements like before. There are also trade offs between using SAN vs several machines networked together with their own (cheap) disks.

What about performance? You can of course get excellent performance – but that depends on the question. If a system needs to use massively parallel processing – the question you ask should be able to split the data it looks at into many chunks and look at all the chunks in parallel, and then add up the results.

If you ask ” how many icecreams of each flavor did we sell today?” – and you have a million transactions, you can easily do that chunking and aggregation. However, not all questions can be answered that way. What if I ask “What is the correlation of the brand of icecream sold and rise and fall of local day time temperature ?”. This question is hard to split into many chunks because there are more variables . So while it can be computed, it is a fairly serial process. Now of course you can try some other way of solving this by looking at a smaller set of data (whether you cook one pot of rice or one barrel, you check if it is done in both cases using just a spoonful ), Or if you knew upfront that this question will be asked, you can do the old data warehousing technique of pre calculating data in useful forms and wait for the question to be asked and so on.

Essentially – depending on the question you ask, you might need a combination of big data solutions (say like Hana and Hadoop) to get a good answer in an acceptable time.  You can reuse and build on a lot of skills you already have in the shop today. But walk into option 1 knowing all the trade offs you have to live with. And I haven’t even skimmed the surface of all the things you need to consider.

What about option 2 ? No predefined questions – but you look at data and see if there is anything useful there.  The good news is that the technology is already there to do this. Bad news is that you need a lot of hardware, consulting etc to get it done. Well there is one more thing to keep an eye on – in the wrong hands, it is fairly easy to bark up the wrong tree when you are dealing with this kind of big data. False positives and false negatives galore. You might be chasing solutions to non-existent problems. An interesting side question while we are at it – have you ever run into a data scientist who said he/she has enough data ? I have not – they all would like even more data.  I am told they exist somewhere 🙂

What about disaster recovery ? Better start getting used to recovering a few petabytes of data.  High availability is probably not a big issue since it is kind of part and parcel of the big data design in most cases. And of course option 2 has to deal with all the issues with option 1 – just that you might not know upfront of extent of the stuff you have to deal with.

Big data will be a fun ride – but keep your seat belts fastened low and tight across your lap for any little bumps along the way.


Big Data Solutions – Do Questions Matter ?

I have Ray Wang to thank for this post. Off late, I have a serious problem of writers block. I just cant find a topic interesting enough to write about, and consequently have become a ratherirregular blogger – at least compared to last year. Any way – back to the topic of this post.

Ray tweeted this few minutes ago

A lot of BI blueprinting sessions from my consulting career flashed through my mind when I saw that. A key principle for a good BI system design is in finding out upfront most of the questions a user would ask the system, and then designing a solution around that. Unfortunately this is a blessing and a curse – while we can really optimize getting fast and accurate responses to predefined questions , this also curtails our ability to change our mind and ask different questions. More experienced BI experts will second guess other questions that users “may” ask and leave some room in design to cater for that, but it is clearly not a scalable way to do things.

Somehow, users were also trained along the way to agree to some lack of flexibility in BI systems. While the complaints never went away fully, most users think by now that it is normal for BI team to ask for some time to change the datamodels and create new reports and so on. It is a sort of “marriage of convenience” if you will – with tradeoffs understood by both sides.

So when we let go of “ordinary” data and embrace “big” data – what should change? I think we should use the big data momentum to make BI systems more intelligent than the rudimentary things it is capable of doing today. And this intelligence should be done with some business savvy. In other words both “B” and “I” of BI need some serious tweaking.

In my opinion, what should change right away is the expectation of business users needing to state most of their potential questions upfront at design time of the system . Or more clearly – the expectation should be significantly lowered, and business users should be allowed to ask more ad-hoc questions than they have done so far. Of course we can never guarantee full flexibility – so some subjectivity is necessary on where we draw the line. Just that the line should be drawn musch farther from where it is drawn today.

Accuracy of result for ad-hoc questions is not enough – the results should come back in a predictable and short time frame too. Ideally, all questions should come back with answers ( or a heads up to user that this is going to take longer ) within a predefined timeframe (say like 3 to 5 seconds or less).

Then there is the question of how the users ask these questions. SQL or NoSQL – querying languages do not provide democratic access to data. People should be allowed to ask questions in English ( or whatever language they use for business ). Some training might be needed for the system and for the users to understand the restrictions – but no user should be constrained with the need to know how things work behind the scenes. A minority of people should have the skills to educate the computer – the rest of us should not be burdened with that. Instead, the computers should be smart enough to tell them answers to what questions users ask.

There are very seldom exact answers to questions in business ( or life) – even apparently simple questions like “what is my margin in North America ? ” is ambiguous to answer. Most clients I have had have many different meanings to “margin” and “North America” and “My” within their organization. In real life, if these questions are asked of a human analyst, she will ask follow up questions to you to clarify and then provide an answer with necessary caveats. Why can’t systems do that? Wouldn’t life of users be vastly improved if systems answered problems like humans did, in a way humans understand? of course with more speed than humans 🙂

Big data or otherwise, there is always an issue of trust in the data from user’s perspective. Most analysts spend nearly as much time explaining how they arrived at their results, as they take for compiling and analyzing the data. The system goes through all the computation any way – even today in the non big data world. Why can’t our BI systems explain to the user how it arrived at the result all the way from source to target or backwards? Wouldn’t that increase productivity a lot?

When users ask questions – they usually will also combine it with external data (google, spreadsheets etc) before they take a final decision. Would it be possible for a BI system to present some useful contextual data to the questions from internet and intranet and allow the user to choose/combine what he needs?

And one last thing – if the system is intelligent enough to find answers, why can’t it have the smarts to also figure out the best possible presentation for the results? Today – we mostly have to predefine how output looks like visually. Why put that load on users? Can’t systems be smart enough to look at the question and the answers and figure out the best way to represent it to the user? This is not a “big data” problem – this should have been the case all along, but somehow never quite happened in a mainstream kind of way.

This is by no means an exhaustive list – I left out plenty of things like collaboration, predictive responses, closed loop BI and so on. I didn’t do so because they are unimportant, but only because of the boredom factor. These types of things are already happening to some extent, and hopefully will catch on more as time progresses.

So there you have it – its my birthday wishlist. And thanks again Ray for that much needed spark to blog again 🙂

Strategy Based On Data – Will What Worked For The Campaign Work For Governance Too?

So President Obama won a second term. Congratulations to POTUS and all my friends who supported him.  And I do feel bad for several friends who genuinely feel bad that Romney did not win . I didn’t think either candidate had it in them to help this country prosper.  The ideal candidate in my mind needed more Obama like stances on social issues, and a mix of Obama and Romney stances on economic issues.  But that is not going to happen any time soon I guess.

What fascinated me most in this election was the use of quantitative methods in Obama campaign. At a high level, there are 3 things campaigns need to know

1. Who to target to secure enough votes  ?

2. Who to target to get funding  ?

3. What to tell these 2 groups, and using which media ?

And one way or other, the Obama campaign seem to have succeeded on all three, and hence Obama gets to be President for 4 more years. So far so good – but that is not the part that fascinates me.

Now, with the elections behind them – will the administration still invest in big data analytics to govern better? That is the possibility that fascinates me. I sure hope the answer is YES (Not YES WE CAN, but YES WE WILL 🙂 )

Now, the administration does not create laws – Congress does. And given the control of house and senate are not changing according to latest poll results- the chance of a different type of law making is pretty low if nothing changes. But things could change – and if all arms of government can use a little more of decisions based on data, and a little less of “political gut” based decisions – maybe this country can start prospering again. The type of analytics needed for policy making is not exactly the same as the narrow objectives of campaigning – but I would argue that similar principles apply. And just like in enterprise software projects, I suspect that the hard work is less of tech work, its probably change management.

Second presidential terms are all about leaving a legacy, I am told – I would love to remember Obama as the president who used data effectively to win 2 elections, and then used data to effectively govern this great country. Probably wishful thinking on my part – but hey, I make a living doing BI work for my clients, AND I am jet lagged as I am typing this at 2 AM 🙂

I am also rather curious now on whether Indian politicians will use big data analytics in their campaigns. If they do, India might see some interesting times.

Big data – because it “can” deliver you huge value doesn’t mean it “will”

Big data is the talk of the town in social media, and has picked up some interest amongst customers too.I had a series of big data conversations this week with customers, colleagues and friends and thought I will share some here. As always – these are just my personal ramblings, not my employer’s views.

In social media – “big” usually means close to petabytes or at least several tens of  TB rushing at you from all over the place. At customer sites, the expectation seems to be much more modest – 50 to 100 TB is considered excessively big data even for some very large customers I know.

Cost of big data is bigger on all fronts compared to status quo volumes (and velocity and all other factors) of data in most shops. Storage is cheaper than few years ago, but it is not free – and when you talk about petabytes and all, it needs a LOT of storage. And then there is the multiples needed for HA/DR/Archiving etc. And this needs more data center space, cooling , power and so on.

What about the quality of data? As we know – poor quality is a big problem in all kinds of data related stuff. Quality becomes a bigger problem when volume and speed increases. Existing tools may be stretched to deal with that kind of data. But assuming tools can somehow do this – there is a question of the human effort to fix data. A lot of data projects fail to deliver value because no one owns data from business side. Big data will most probably make this problem worse, unless software improves by leaps and bounds in short order to make data quality a non issue. How many of us will hold our breath on that?

What about security? even with just 2 TB of structured data – there are companies who struggle to make sure everything is secure, and everyone is kept honest, and all the legal compliance is ensured. I have seen the amount of trouble they go through when status quo is changed (like an M&A , or even a small CRM system is introduced).   Most of them are not equipped to deal with more data unless they beef up on more sophisticated governance, and probably more staff.

Some companies love BYOD and others do not. The ones who do not, frequently worry about support cost and security. Imagine the effort they have to go through if BYOD will happen in their companies, and they have to protect much larger data than they are used to?

We are right now in the middle of a small POC for a customer – and the data in the datawarehouse is miniscule compared to what “Big Data” can be. We are talking about something only like 150 to 200 million lines per cube. The data comes back at lightning speed from database to appserver. But the user did not see this speed from his iPad connected from a starbucks wifi via VPN. He did see some improvement, but not enough for a big WOW.  And every drill down needs a roundtrip that also chokes up the network yet again.  Essentially, the bottleneck moved from the DB/App server side to the network/client side. These networks will need serious upgrades in capacity to cope with big data. And the mobile software should be smart enough to use the processing power and memory of the device to minimize the use of bandwidth when it is not required. Carriers will probably need big upgrades too, and if big data catches on – we should start seeing different types of data plans from them, dissimilar to the rates that we see now when we buy tablets and smart phones.

Then there is the cost of licensing – and the models of licensing evolving. But if licenses are tied to the quantity of data that is processed/stored – then that adds up quickly.  And even with sophisticated software – you need smart data analysts who can make use of it to generate value. These analysts – or architects, scientists, artists, artisans or whatever it is they are called this week – don’t come in big numbers, and they won’t be cheap either. And long term – I am not sure if this is given enough importance in universities.

The other side of the equation – the more important side, is the value that big data delivers. There is definitely value in big data – significant value – for sure. But it is not value that gets delivered overnight, and it is value that takes significant investment before reaping benefits. And this value will not be spread evenly across industries, or even companies across industries.  So it is a decision that needs to be taken carefully.  Given the cost, the insights from big data has to be not just “big” but  “BIIIIGGGG” – for the investment to be worthwhile.  And because it “can” deliver value does not mean it “will” – it is not a secret that several companies could not even make good use of much smaller quantities of structured data available to them readily all these years.

Several CXOs I have spoken to are willing to dip their toes despite the cost.  And they are all trying to find out where it is that they can gain competitive advantage by jumping in. Several are interested in a cloud offering for big data – mostly from a cost point of view. This is an area where SIs and SW vendors and analysts et al need to do a better job in my opinion. There seems – in my limited visibility – a serious shortage of  specific use cases to help companies make a business decision. There are a few – like in healthcare for example – where compelling arguments were made, and customers and vendors are partnering effectively.  Given the investment needed for big data – evolutionary change might not make it look appealing to the buyers.  It needs to be revolutionary . And as my ex-manager used to tell me –  almost every project that pays for itself will get funded irrespective of the economy.

PS: If big data catches on big time, then we can seriously expect a boom for the tech stocks across the board since several companies will benefit from the vendor side. The economy – at least in history books – will probably thank big data for the good that it did 🙂