There are statisticians , there are mathematicians , there are engineers , there are machine learning programmers , and there are many other types of experts out there – but there are really no data scientists out there in the wild ! What exists are data science teams and many are generally awesome . That is my conclusion after trying really hard to become a data scientist myself over the last few months . I am not giving up quite yet – but I am at a stage where I need to express my opinion on the matter for what its worth 🙂
I thought I had good odds to be a decent data scientist, at least on paper. I think I am a good programmer and while I still think in C when I am coding , I can work on R and Python without sweating it too much . I am an engineer and have a degree in business – and I was convinced that I have more than an elementary capability in math and stats . I can do most data engineering work to get bad data into a shape that a machine can crunch. I spent a lot of time in BI – which made me believe I can visualize data really well. And so on . Yet , I didn’t become a data scientist despite my honest efforts to become one, and I think I now know why .
Between my engineering/business background a couple of decades in consulting – three things come naturally to me when I am faced with solving problems
1. The classic MECE approach
2. Thinking about it from the client view and working back to what I can do
3. Trying to get to a solution from first principles so that I trust the output
On the flip side, when I cannot do a good job on any of these three things, I get extremely frustrated. And in this effort to become a data scientist, I stumbled on all three. I also am close to questioning the idea of calling this domain as data science . It has more of an art feel to it – its like a half way point of an architect and an engineer, a bit weird. This could be an emotional response, so I am not going to make a fuss about it in this post.
As I played with it for a while – I understood that a few things need to come together for data science to work effectively for my clients, not necessarily in the linear fashion I call them out below.
- Define a problem in a way that it can be solved – some kind of designer/consultant type skill which I am generally good at, I thought. Turns out you just keep redefining the problem as you learn more.
- Create an abstraction – what programmers call “logic” or “algorithm” , and what math geeks call “model” . This needs a lot of “rinse and repeat” as I figured. I could have saved a lot of trouble if I started plotting data in simple dimensions first – a lesson I won’t ever forget.
- Find, clean and assemble data to feed into the model – the data engineering skill, and it becomes a challenge when data is really big. Analyzing data makes you wonder about your sampling strategy throughout. There are always gaps and it will make you say “maybe” or “it depends” as the answer to most questions.
- Figure out your model is crap, and explore alternatives endlessly. I realized I had forgotten how common substitutions worked in integral calculus and it took a lot of googling to get me comfortable on a first principles basis that what I am doing was sensible math. On the bright side my linear algebra skills are still sharp – but clearly that is not enough.
- Figure out what is worse – false negatives or false positives, and have a strategy to not have too many and how to explain the few you will always get. This needs extremely good domain/industry knowledge and the kind of assumptions you make can be comical when you run it by a real expert
- Finally – you figure out a half decent solution, fully knowing you can’t be really sure. At this point – you need to figure out a way to tell the story, usually with visualization. Voila – your entire perspective on how to tell a story with data will change quickly. I always loved D3, but now we are soul mates.
It is nearly impossible for one human being to be great at all these things – the best case is that you get to be really good at one or two, and have a solid appreciation of the rest. In other words – a bunch of such experts in these areas together can be brought together to form a great data science team. But it is just impossible to have one person have all these skills and hence be called a data scientist.
I also feel I should express my “amusement” about machine learning on two aspects before I end this rant.
- Depending on whose book you read, or who you talk to – you will think machine learning has two distinct flavors. One is a math flavor, and the other is a programming flavor. I have more developer friends than math geek friends – so I mostly got a math flavored “black box” answer every time I had that conversation. But the books I studied were mostly written by stats majors.
- The fact that a model is the right one does not mean that it performs well in production. You can sample ( I am staying away from my endless fights with bias, even for “simple” cases) and take smaller data sets to make your model work . But then you get the idea of running your logic against big hairy data – and suddenly you realize that your “black box” algorithms don’t all scale to work in parallel mode. I am now stuck in a debate with myself on whether a code rewrite , or a different math approach is better to crunch all the data I want.
Its clear that stats majors and CS majors should really talk more and not let me be the one worrying about these kinds of problems . I am happy to buy the pizza and beer for you folks 🙂
PS : my dear friend Sameer who is the chief of Kahuna , showed this blog to his data science leader Andrew – and here is his counterpoint . You should absolutely read this too – debates and strong opinions are good things !