It does not matter what vendors say about big data – end of the day, customers need to adopt and expand for big data to live up to its promise. If they don’t adopt – big data will just go the same way as many other passing fads that had the potential to change the world, but never quite did.
Big data is just data – we should not EVER forget that! And customers have had varying degrees of success with managing and using data in the past. So everything that applies to data – like ingestion, acceleration, quality, analytics etc also apply to big data. What is different is just the degree magnitude, complexity and predictability.
I like thinking of big data like how I think of Crude Oil.
[image courtesy of SAP]
Crude is very valuable – but not in its native form. A lot of specialized processing, storage etc are needed before it shows up at the gas station where we empty our wallets to fill the tanks of our cars. We don’t need to worry about what happens to the oil till it reaches the gas station – we just need it in a way that we can use it.
Big data also should be thought about along these lines. It may be huge, fast and furious – but a user does not have to care about that. Management of big data should not be a head ache for the user.
So clearly, we need two things to make big data click.
1. We need a platform that can do the heavy lifting and shifting and all that .
2. We need applications that users can make use of without ever knowing a thing about the platform behind it
Lets talk about platforms first. The 3V ( or 4V or 16 V) model etc needs to be kept in mind when we think of platform. Heavy lifting is a given, and it needs to be done without breaking the bank. I strongly believe that the success of any platform is defined by the number of apps built on it, and the number of developers making a living out of it. If that ecosystem stickiness does not happen – the rest does not matter, and such a platform should not continue to exist.
1. We need to get a lot of data into the platform
The price of storage is generally coming down – and not all data is needed all the time. So the smart thing to do is to put data in a storage tier where price performance tradeoff is ok for you. If you don’t need the data in sub second time , you probably don’t need to store it in the most expensive storage tier. Platforms should be able to intelligently figure out where data should reside, with the idea that a human administrator can tweak it as needed.
2. Platform should have data quality and governance abilities
Most data warehouses have at least 3X or 4X duplication of data – and this applies to transactional and master data. This might be ok when we are talking about few TB of data. But when it is in the tens of petabytes, this is a serious issue to be dealt with. Big data will magnify data quality issues if not taken care of adequately.
3. Platform should have the ability to injest data at various speeds
When data is coming in fast and furious – the platform should be able to deal with it. Some of the high speed data might need immediate action – and this need to be treated differently from other data that comes at same speed, but dont need to be acted upon in real time. For example – stock market data loses relevance if you don’t act on it right away in many cases, but social media data can probably wait a little bit before someone makes sense of it and reacts. Another use case might need some social media data to also be responded to immediately. So the platform should be able to respond to all such use cases. Also, not all data needs to be ACID. Eventual consistency is probably ok for vast majority of data.
4. Platform should support different analytics requirements
Speed of response, type of analysis, degree of precision etc differs in each big data use case for analytics. A platform needs to be able to deal with all of these issues.
5. Platform should be able to evolve as technology improves
As better techniques, technology etc come up, the platform should be able to make use of it wihout disrupting users. This is especially true for big data given the speed at which innovation is happening on hardware, software and academics. It is a non trivial challenge – and the primary reason I believe that big data and cloud need to converge quickly.
6. Platform should have a mix of commodity resiliency and enterprise resiliency
Some parts of data needs high availability and disaster recovery (say billing data), but some others might not need it (say,like click streams) . So the platform should be able to provide appropriate resilliency according to the use case. HA and DR are not enough – similar principles apply to security, encryption etc.
7. Platforms should allow both read and write operations in an optimized fashion
When people think of big data – they mostly think of the read part, as in analytics. While this is close to reality, we should not forget that analysis is useful only when we can act on it. And acting on it usually needs the platform to do some writes as well. This should be accomplished without forcing the user to jump from one application to another .
8. Platforms should enable ease of building applications and extensions
All platforms should have this developer friendliness in mind – but when it comes to big data platform, it is not just technology friendliness that will cut it. These platforms also need to be data scientist friendly. While there is some over lap between technology developers and data scientists today – for the most part, these are distinct skills now and will take time to converge.
Of course it is not an exhaustive list, but hopefully I have hit most of the important aspects. So, lets move on to Applications.
Applications are the make or break of adoption. Applications are what users touch and feel and relate to. And hence, for big data to catch on – we do need to shield majority of users from the complexity of the platform side.
1. What characteristics makes a good app does not change just because an app is built on a big data platform.
2. Apps should aim to provide precision and context – not one or the other. For example, you need to know exactly how much is the amount to be collected from a customer for a sale. But this needs to be put in the context of other useful information like historical payment behavior, other large deals pending with the customer, social sentiment about the customer and so on.
3. Apps should be extensible as business environment evolves. Just as the platform should evolve when technology changes. This is also the main reason why big data needs both platform and applications and not one or the other.
4. Apps should be easy to deploy and consume. If big data eventually does not catch on – my bet will be on deployment difficulties as the root cause. And of course it is yet another reason why I like the idea of big data and cloud converging.
Ok , so that was way more than what I wanted to blog. But two back to back meetings got cancelled and I just took the liberty to make full use of that 🙂