There are really no data scientists in the wild ! 


There are statisticians , there are mathematicians , there are engineers , there are machine learning programmers , and  there are many other types of experts out there – but there are really no data scientists out there in the wild ! What exists are data science teams and many are generally awesome . That is my conclusion after trying really hard to become a data scientist myself over the last few months . I am not giving up quite yet – but I am at a stage where I need to express my opinion on the matter for what its worth 🙂

2_221863

I thought I had good odds to be a decent data scientist, at least on paper. I think I am a good programmer and while I still think in C when I am coding , I can work on R and Python without sweating it too much . I am an engineer and have a degree in business – and I was convinced that I have more than an elementary capability in math and stats . I can do most data engineering work to get bad data into a shape that a machine can crunch. I spent a lot of time in BI – which made me believe I can visualize data really well. And so on . Yet , I didn’t become a data scientist despite my honest efforts to become one, and I think I now know why .

Between my engineering/business background a couple of decades in consulting – three things come naturally to me when I am faced with solving problems

1. The classic MECE approach

2. Thinking about it from the client view and working back to what I can do

3. Trying to get to a solution from first principles so that I trust the output

On the flip side, when I cannot do a good job on any of these three things, I get extremely frustrated. And in this effort to become a data scientist, I stumbled on all three. I also am close to questioning the idea of calling this domain as data science . It has more of an art feel to it – its like a half way point of an architect and an engineer, a bit weird. This could be an emotional response, so I am not going to make a fuss about it in this post.

As I played with it for a while – I understood that a few things need to come together for data science to work effectively for my clients, not necessarily in the linear fashion I call them out below.

  1. Define a problem in a way that it can be solved – some kind of designer/consultant type skill which I am generally good at, I thought. Turns out you just keep redefining the problem as you learn more.
  2. Create an abstraction – what programmers call “logic” or “algorithm” , and what math geeks call “model” . This needs a lot of “rinse and repeat” as I figured. I could have saved a lot of trouble if I started plotting data in simple dimensions first – a lesson I won’t ever forget.
  3. Find, clean and assemble data to feed into the model – the data engineering skill, and it becomes a challenge when data is really big. Analyzing data makes you wonder about your sampling strategy throughout. There are always gaps and it will make you say “maybe” or “it depends” as the answer to most questions.
  4. Figure out your model is crap, and explore alternatives endlessly. I realized I had forgotten how common substitutions worked in integral calculus and it took a lot of googling to get me comfortable on a first principles basis that what I am doing was sensible math. On the bright side my linear algebra skills are still sharp – but clearly that is not enough.
  5. Figure out what is worse – false negatives or false positives, and have a strategy to not have too many and how to explain the few you will always get. This needs extremely good domain/industry knowledge and the kind of assumptions you make can be comical when you run it by a real expert
  6. Finally – you figure out a half decent solution, fully knowing you can’t be really sure. At this point – you need to figure out a way to tell the story, usually with visualization. Voila – your entire perspective on how to tell a story with data will change quickly. I always loved D3, but now we are soul mates.

It is nearly impossible for one human being to be great at all these things – the best case is that you get to be really good at one or two, and have a solid appreciation of the rest. In other words – a bunch of such experts in these areas together can be brought together to form a great data science team. But it is just impossible to have one person have all these skills and hence be called a data scientist.

I also feel I should express my “amusement” about machine learning on two aspects before I end this rant.

  1. Depending on whose book you read, or who you talk to – you will think machine learning has two distinct flavors. One is a math flavor, and the other is a programming flavor. I have more developer friends than math geek friends – so I mostly got a math flavored “black box” answer every time I had that conversation. But the books I studied were mostly written by stats majors.
  2. The fact that a model is the right one does not mean that it performs well in production. You can sample ( I am staying away from my endless fights with bias, even for “simple” cases) and take smaller data sets to make your model work . But then you get the idea of running your logic against big hairy data – and suddenly you realize that your “black box” algorithms don’t all scale to work in parallel mode. I am now stuck in a debate with myself on whether a code rewrite , or a different math approach is better to crunch all the data I want.

Its clear that stats majors and CS majors should really talk more and not let me be the one worrying about these kinds of problems . I am happy to buy the pizza and beer for you folks 🙂

PS : my dear friend Sameer who is the chief of Kahuna , showed this blog to his data science leader Andrew – and here is his  counterpoint . You should absolutely read this too – debates and strong opinions are good things !

When machines think on our behalf !


I don’t really think machines will displace humans in significant numbers for a long time – but I do think we have an interesting time ahead of us where we let machines think on our behalf quite a bit . 

Every company out there has rebranded themselves to an AI company . The first generation of this is broadly of two categories 

1. Telling an AI system what we want to do – order a coffee , close the curtain – pretty much call an available API to do something 

2. Use AI to learn and do something better – like switching carpet bombing marketing campaigns and target better 

But this is just a temporary phase. Why do you want to ask your AI wizards to order coffee for your home – isn’t it better to let the machine reorder coffee when it gets to some level ? Should it even ask you to confirm these kinds of routine activities or just do it without asking ? About half the things I need routinely -I am totally cool with having a machine do it without asking me anything , especially about coffee . I get mad at myself when I forget to pick up coffee and I don’t have much left in the kitchen when I need my coffee . I am sure I am not the only one who is ready to offload some routine activity to machines .

So this poses some interesting challenges  like- if my AI system is the one ordering groceries for me without my input , how do other coffee vendors ever get my business ? 

My wife already thinks I spend way too much on coffee . So she maybe able to tell the AI system to limit my purchases to say $50 a month . So now my AI thingy needs to be coupon shopping and stuff to stay within budget – but that is easy, machines can do this math stuff  better than us anyway . 

This makes me wonder about what is the future of marketing itself ?

Simple – brands stop marketing to me and instead they ( as in their AI systems ) will market to AI systems ! And Brands will do whatever they can to convince my AI system to feed me their coffee first to increase their chance of my business being a reorder situation !

Well , guess what – this means we are in the “my AI is smarter than your AI” world at that point . The bright side – email spam reduces significantly for me as a human , and I have some more time on my hands . 

But this is not without its share of dilemmas too – for example , what if the AI provider for me and the coffee company are one and the same , or if they are two companies that share my data ? Am I going to be put in a situation where I am negotiating against myself ? So we do need some clear guidelines established on ethics , legality, security  and even morality before we get there to dealing with this problem . 

We have a good grip on what happens when AI does smart stuff when humans deal with it – like customer service , sales etc . But the thing that excites me the most is when both sides of a transaction are AI systems . I am betting it won’t take even 5 years for us to see this mainstream . Are you ready ?

Future of Software Development 


There are so many angles to this topic – and this is my third attempt in three days to organize my thoughts around it . The trouble is not that I don’t have enough ideas – it is that most ideas seem to contradict each other when I read them again .  Let’s see if the third time truly is the charm 🙂


1. Everyone will NOT  be a (meaningful) programmer in future 

I know that is not a popular position to take today – but that is where my mind is at now . We will need to water down the definition of coding significantly to make “everyone is a coder” be a true statement . If I can change the tires and oil of a car  , and program the infotainment system – should I be called an automotive engineer ? That is “roughly” how “everyone will be a coder” sounds to me now . 

Don’t get me wrong – I do think that understanding how code works is important for everyone in future . That doesn’t translate to everyone coding , at least in traditional sense . Very complex applications exist today in MS Excel – created by business analysts who are not traditional programmers . If we change the definition of coding to include that kind of development – I can buy into “everyone will be a coder”. The closer statement – though less sexy – would be “everyone will be a designer or modeler” !

2.   More code will be destructed than created 

World around us is changing extremely fast and that means we need lots of newer kind of applications . But the pace of change is such that no application can satisfy a customer for any length of time . Building better components and better orchestration mechanisms are the only way to deal with it . Neither concept is new – but the execution will kick into a higher gear . API designs will need a lot more flexibility than we are used to 

3. Performance will trump simplicity 

By simplicity – I mean what “humans” think of as “simple”, not machines . Code for tomorrow will be used more for machine to machine communication than for machine to human – by orders of magnitude . Creation of code itself might not need a lot of human help for that matter . And while maintainability and human readability are important today , it might get trumped by the need for extreme performance tomorrow  . For example – if both ends of an interface are machines , why would they need to communicate in text and pay for the overhead of things like XML/JSON tags that need to be converted to binary and back again to text ? 

4. You won’t need as much code in future 

A lot of code is written today because a human does all thinking and tells computers what to do in very prescriptive ways with conditions and loops and all that. When computers get to “general AI” – they will learn to think and adapt like humans – and won’t need step by step instructions to do what they do today . Less code will do a lot more than a lot of code does for us today . We may be decades away at most – we are not centuries away from that stage . Software will eat the world now , AI will eat software tomorrow 🙂

5. Software offshoring/outsourcing  will not be for development or maintenance – it will be for training 

It’s already possible for machines to learn from vast amounts of .  Some time in far future , machines will self train too . Till then – and that’s at least a decade or more – humans will need to train machines on data . And that will need to make use of local knowledge , labor arbitrage etc and hence will be an ideal candidate for offshoring and outsourcing ! 

6. Community of developers will be the only thing that matters  

Well – that is already true, isn’t it . I have forgotten the last time I have checked official documentation or real books to learn anything . I google or search on stack overflow to get most of what I need . I am not a full time developer – but looking at the communities that help me , I am sure full time developers do what I do , a lot more than I do 🙂 . A better way of mining this treasure trove of information is the need of the hour to get significantly more engineering productivity. 

7. More and more use of biological sensors 

Human bodies and brains are the ultimate computers and we are some ways away from mimicking human thought . In near future I expect simple sensors for all kinds of chemical and biological stuff ( how cool would smell be as an input , like touch is today ) that provide input to and also dictate how code should work . Text is fast becoming the most boring part of data any way 🙂

8. We haven’t even scratched the surface of parallelism 

What we call as massively parallel today in all probability will be amusing and funny to tomorrow’s programmers . The over heads of making parallelization work today is pretty high – and that will go away soon. A part of the problem is also that majority of developers don’t think of parallelism when they design code . I guess the near term solution will be for more primitives in popular programming languages (like for data access) to have built in parallelism . Note to self : get better at functional programming in 2017

9. Ethics and Privacy become core to all development 

A few things are happening together now

a) data is exploding and we are leaving our digital finger prints everywhere 

b) applications won’t stay around long enough to have ethics and privacy as a “future release” issue to be fixed

c) more and more software affects humans , but is controlled by machines with little human input 

d) access to information is (and will be ) universal – which means bad guys and good guys can both use it for what they want 

e) legal systems won’t ever catch-up with the pace of tech innovation 

And that means – ethics , privacy etc need to be core principles of tomorrow’s software . It cannot be “in pockets” as it happens today. And the education on this topic needs to be pushed down to even the earliest levels of schools. 

9 is really not a conventional number of bullets for a list – but given there won’t be anything conventional about the future of software development , I think now would be a good time for me to stop this list . Feel free to add , edit and challenge in the comments – I look forward to it .

Happy 2017 everyone!