Big data and systems that need to go around in circles


I had a very interesting conversation with my seat mate in the flight to SFO early today . He works as a manager at an airline, and grew up the ranks . Currently he is exploring ways of “learning faster” given his team was having a hard time with baggage handling and they want to fix it like yesterday .

He had a big data book in his hand – which is why I started the conversation with him , and I am glad I did .

Here is his situation in a nut shell. He can get pretty much any report he wants on any data that affects his work . It is not real time , and that is fine by him . He needs printed reports that his team can carry around in their pockets and check periodically . Essentially real time is not an issue and neither is mobility .

However , when a problem occurs – he has to trust his memory to remember when it happened the last time and what he did to solve it . None of his reports give that information to him .

In his words – this makes him go in circles , while his information systems go in a straight line . He wants a solution that will go in circles with him in perfect sync . Why circles ? Because he says almost every problem he encounters is a repeated problem that has been solved in good and bad ways before . He doesn’t have the time to sit down and document solutions in a word document given he has a very busy job . His question is “why can’t the system figure it out that it is a repeated problem and tell me everything that has been done before and what worked and what did not ? “.

I asked him how he ended up with the big data book . He said his wife is a college librarian , and she told him that “big data” is the fancy technology that solves all kinds of complex problems and found him this book . He also said it is the first time he is reading a book other than an airline manual in more than a decade .

Eye opening conversation for me – to say the least . I wonder how many such employees and managers exist in the world today , where the IT systems cannot help them the way they want to solve problems in real life . This is not just a big data problem – the whole IT industry needs a reality check .

At the very least – it is a good proof point that yesterday’s solutions won’t solve today’s and tomorrow’s problems . Time for us to stop thinking in abstract ways with lot of jargon thrown in – and just go help solve real problems .

Big Data : Platform vs Applications


 

It does not matter what vendors say about big data – end of the day, customers need to adopt and expand for big data to live up to its promise. If they don’t adopt – big data will just go the same way as many other passing fads that had the potential to change the world, but never quite did.

 

Big data is just data – we should not EVER forget that! And customers have had varying degrees of success with managing and using data in the past. So everything that applies to data – like ingestion, acceleration, quality, analytics etc also apply to big data. What is different is just the degree magnitude, complexity and predictability.

I like thinking of big data like how I think of Crude Oil.

Image

[image courtesy of SAP]

Crude is very valuable – but not in its native form. A lot of specialized processing, storage etc are needed before it shows up at the gas station where we empty our wallets to fill the tanks of our cars. We don’t need to worry about what happens to the oil till it reaches the gas station – we just need it in a way that we can use it.

Big data also should be thought about along these lines. It may be huge, fast and furious – but a user does not have to care about that. Management of big data should not be a head ache for the user.

So clearly, we need two things to make big data click.

1. We need a platform that can do the heavy lifting and shifting and all that .

2. We need applications that users can make use of without ever knowing a thing about the platform behind it

Lets talk about platforms first. The 3V ( or 4V or 16 V) model etc needs to be kept in mind when we think of platform. Heavy lifting is a given, and it needs to be done without breaking the bank. I strongly believe that the success of any platform is defined by the number of apps built on it, and the number of developers making a living out of it. If that ecosystem stickiness does not happen – the rest does not matter, and such a platform should not continue to exist.

1. We need to get a lot of data into the platform

The price of storage is generally coming down – and not all data is needed all the time. So the smart thing to do is to put data in a storage tier where price performance tradeoff is ok for you. If you don’t need the data in sub second time , you probably don’t need to store it in the most expensive storage tier. Platforms should be able to intelligently figure out where data should reside, with the idea that a human administrator can tweak it as needed.

2. Platform should have data quality and governance abilities

Most data warehouses have at least 3X or 4X duplication of data – and this applies to transactional and master data. This might be ok when we are talking about few TB of data. But when it is in the tens of petabytes, this is a serious issue to be dealt with. Big data will magnify data quality issues if not taken care of adequately.

3. Platform should have the ability to injest data at various speeds

When data is coming in fast and furious – the platform should be able to deal with it. Some of the high speed data might need immediate action – and this need to be treated differently from other data that comes at same speed, but dont need to be acted upon in real time. For example – stock market data loses relevance if you don’t act on it right away in many cases, but social media data can probably wait a little bit before someone makes sense of it and reacts. Another use case might need some social media data to also be responded to immediately. So the platform should be able to respond to all such use cases. Also, not all data needs to be ACID. Eventual consistency is probably ok for vast majority of data.

4. Platform should support different analytics requirements

Speed of response, type of analysis, degree of precision etc differs in each big data use case for analytics. A platform needs to be able to deal with all of these issues.

5. Platform should be able to evolve as technology improves

As better techniques, technology etc come up, the platform should be able to make use of it wihout disrupting users. This is especially true for big data given the speed at which innovation is happening on hardware, software and academics. It is a non trivial challenge – and the primary reason I believe that big data and cloud need to converge quickly.

6. Platform should have a mix of commodity resiliency and enterprise resiliency

Some parts of data needs high availability and disaster recovery (say billing data), but some others might not need it (say,like click streams) . So the platform should be able to provide appropriate resilliency according to the use case. HA and DR are not enough – similar principles apply to security, encryption etc.

7. Platforms should allow both read and write operations in an optimized fashion

When people think of big data – they mostly think of the read part, as in analytics. While this is close to reality, we should not forget that analysis is useful only when we can act on it. And acting on it usually needs the platform to do some writes as well. This should be accomplished without forcing the user to jump from one application to another .

8. Platforms should enable ease of building applications and extensions

All platforms should have this developer friendliness in mind – but when it comes to big data platform, it is not just technology friendliness that will cut it. These platforms also need to be data scientist friendly. While there is some over lap between technology developers and data scientists today – for the most part, these are distinct skills now and will take time to converge.

Of course it is not an exhaustive list, but hopefully I have hit most of the important aspects. So, lets move on to Applications.

Applications are the make or break of adoption. Applications are what users touch and feel and relate to. And hence, for big data to catch on – we do need to shield majority of users from the complexity of the platform side.

1. What characteristics makes a good app does not change just because an app is built on a big data platform.

2. Apps should aim to provide precision and context – not one or the other. For example, you need to know exactly how much is the amount to be collected from a customer for a sale. But this needs to be put in the context of other useful information like historical payment behavior, other large deals pending with the customer, social sentiment about the customer and so on.

3. Apps should be extensible as business environment evolves. Just as the platform should evolve when technology changes. This is also the main reason why big data needs both platform and applications and not one or the other.

4. Apps should be easy to deploy and consume. If big data eventually does not catch on – my bet will be on deployment difficulties as the root cause. And of course it is yet another reason why I like the idea of big data and cloud converging.

Ok , so that was way more than what I wanted to blog. But two back to back meetings got cancelled and I just took the liberty to make full use of that 🙂 

What on earth do you mean by CONTEXT ?


I have to thank Frank Scavo for making me think harder about what context means . I and several people I know use the term liberally , and perhaps not very consistently .

Here is my hypothesis –

Answer to every question has a core (which has great precision) and a context (less precise , but without it -core cannot be meaningfully interpreted).

1. Additional questions maybe needed to get context

If all I ask you on phone is “should I turn right or left to reach your office” , you probably will ask me something in return like “are you coming from north or south”. Without this additional information, you cannot help me . Right or left is a precise answer , what is on my right might be on your left or right . Without extra information – you cannot help me with a precise answer .

2. You can infer all or part of the context from historical information .

Maybe you know from your morning commute that I could never be driving from south side on that street given that side of road is blocked for construction . So you can give me a precise left or right answer without asking me anything further.

3. Context can change with time

Perhaps turning right will be the shortest distance to your office , yet you might ask me to turn left since you know rush hour traffic going on now will slow me down . If I had asked you two hours later – you could have given me the exact opposite answer , and still be correct .

4. Multiple things together might be needed to provide context

It is very seldom that one extra bit of information is all you need to make a determination . When I called you during rush hour , if it was raining – you might have asked me to take a left turn so that I will get covered parking and a shuttle to ride to your office . On a sunny day, you could have pointed me to an open lot from where I could have walked a short distance to reach you .

5. Context is progressively determined

As the number of influencing factors increase – you have to determine trade offs progressively to arrive at a useful context . You might know exactly all the right questions to ask to give me the best answer , but if you were pressed for time – you could have told me an answer without considering the entire context . It would have been precise, but probably of limited use to me .

6. Context is user dependent

If I reached your assistant instead of you , she probably would need a whole different context to be provided before she could tell me which way to turn . She might have never taken the route you take to work , and hence might not have seen southbound traffic is closed off . She might not have realized it is raining outside given she was in meetings all day .

If I am your vendor and you know I am coming there to make a pitch that you have limited interest in – you probably won’t think through all the contextual information . If I am your customer – maybe you will go outbid your way to tell me not just to turn right , but also that the particular turn comes 100 yards from the big grocery store I will find on my right .

7. More information does not always lead to better context

If I over loaded you with information – you probably could not have figured out all the trade offs in the few seconds you have before responding . Your best answer might not be optimal . And if you take very long to respond , I might pass the place to make the turn and then have to track back – making it needlessly harder for both of us .

8. Context maybe more useful that precision

Instead of giving me a precise left or right answer , you might tell me to park in front of the big train station and wait for your company shuttle to pick me up. That was not the precise answer to my question – but it still was more useful to me .

This was just a simple question with only two possibilities as precise answers . Think of a question in a business scenario . “How are our top customers doing?” is a common question that you can hear at a company . However , you can’t answer that question in any meaningful way without plenty of context .

The eventual precise answer is “good” or “bad”. What makes the question difficult is that it could mean a lot of different things .
1. What is a top customer ? Most volume ? Most sales ? Most profit ? Longest history with company ? Most visible in industry ? Most market cap?
2. Who is asking ? CMO and CFO might not have the same idea on what makes a top customer
3. How many should you consider as top customers amongst all your customers ?
And so on ..

Information systems in majority of companies do not have the ability to collect context of a question . And hence they may or may not give useful answers without a human user doing most of the thinking and combining various “precise” answers to find out a “useful” answer .

That is a long winded way of saying “context is what makes precision useful”.

Ok I am done – let me know if this makes any sense at all , and more importantly whether it resonates with your idea of what context means