Unlike datalake, I don’t think of Hadoop as a buzzword at all. It is a real thing – a real project you can touch and feel. Also, I love elephants (The one in the picture is an eighty year old Guruvayoor Padmanabhan, owned by a famous temple in Kerala) – which might explain why I have a particular soft spot for Hadoop 🙂 . Hadoop is perhaps not yet ready to be an elephant this size, but it is not such a little baby elephant as shown in the marketing pictures either.
By now, it is stable enough even for an obsolete programmer like me to play with it minimum fuss. In my day job, I don’t program any more (sadly – but probably a good thing for my team, and perhaps for the rest of humanity). But there is hardly a week that passes me without a need for explaining hadoop to someone. Even when I was between jobs and enjoying a nice vacation, I was pulled into “help me understand hadoop” discussions.
The primary problem for me personally is that the folks I usually have to explain hadoop to are not always conversant with open source. I get to talk mostly to people who are smart enterprise technologists, but who have very little idea of open source per se. Vast majority of them equate open source to free software. And it is not as if open source is a good all encompassing way to explain hadoop (hello MapR). And if I somehow manage to get these folks to understand open source and a few licensing models, I lose them again on the idea of what a distribution is and what is different between the distros. I think its mostly a mental block that there are commercial vendors who make money of software that is supposedly free – and they all do it in different ways. Can’t blame them – I had those questions too for a long time.
Once you get past open source, then there is a question of what constitutes Hadoop.
No one in my circles apparently cares about Hadoop common , so I no longer utter a word about that. And the few times I have mentioned it, mostly to friends who come from coding backgrounds – I have to deal with “why would anyone use java for building this?” 🙂 . Hence no “common” – nothing but grief from that conversation.
Then I talk about HDFS and MapReduce . Enter more rat holes – why does the world need yet another file system – why not use GPFS? I have heard of Spark – so when do we use MapReduce and why not just use Spark all the time? If I get a breather, I also get to mention YARN and Arun’s explanation of datacentre operating system, and that there is a MapReduce2. People get why YARN is a good idea almost instantaneously. Everyone appreciates the vision of pluggability – but invariably someone will ask about its compatibility with stuff that came before YARN .
This is usually where I start with the idea of different kind of nodes, mechanics of replication, why 3 is a default value, and why the whole thing is built for commodity servers daisy chained together. You would think this is the easy part – but it is not. We are dealing with people who have spent their entire careers working on high end servers that have all kinds of resiliency. It is a really hard thing for them to visualize the world hadoop is built for. Usual starting questions include “if I just repurpose all my high end servers, can I just avoid replicating thrice?” or “I already have tooling that takes care of HA and DR in all my data centers. Surely there are APIs that I can use to connect my existing tools to this hadoop thing ?”. This is a game of how many times you get to say “it depends” in any one conversation without taking a breath.
Just when I am ready to wind down the conversation – impatient listener will ask “can you fill me in on hive, HBase etc”. Sure why not – so I explain how there are a bunch of other projects that play alongside hadoop. “Are they all java?” – well, they are not ALL java, but ideally you should not have to worry given they all have interfaces you can use relatively easy. I can see relief !
HBase seems to be a trigger for starting on the NoSQL part of this Odyssey. This is particularly so because my friends know I spent some time working at MongoDB. Sadly – as in REALLY sadly – 90% of my conversations include a part of convincing people that MongoDB is not hadoop. And at a minimum at that point I have to touch on Cassandra to explain there are more NoSQL options out there. Invariably this opens up the question of “does MongoDB work with Hadoop” – thankfully it does and I explain the connector. It also usually leads to my friends from ops background sigh aloud “can’t we just use a general purpose database that does all these things” ? . I no longer fight them on this topic – mostly for lack of energy.
“So you have explained how data sits in hadoop, but you have not explained how I can put it in there or how I get these awesome insights from that data”. Ah I forgot all about that. So I go on to explain even more projects ( at some point it is overwhelming to remember all the names and the right spellings ) – and also manage to get in a plug that a lot of closed source tools can access hadoop. This is usually the point at which my ops friends give up on me and say “good Lord, it is complicated to manage all this” and my dev friends get all excited that there are so many cool new toys to play with.
Usually the final question is on security. On the bright side, Ranger and Sentry are both easy names to remember. On the not too bright side – I don’t want ( or even know fully well enough) to explain why two major distros have two different approaches. And this usually leads to other examples like impala to show that not all distros share the same implementation philosophy. I should use something other than impala in future as my example – since that goes into “how many possible ways can you do SQL on hadoop?”. My usual temptation is to say “more ways than enough for you to get a costco size jar of aspirin” 🙂
I have not counted, but there are certainly more than 30 projects that can somehow be called a part of hadoop. I have personally played with less than half of them. And every day more projects are starting and it is getting hard to keep up. No wonder why my friends/customers/random people all get stressed out trying to understand hadoop and how to make use of it. And yet – everyone is excited about their hadoop journey, despite all its twists and turns. Marketers like my friend Ingrid who recently joined HortonWorks as CMO should have a fun time articulating a message that makes it much simpler to understand.
Alright, so who wants to ask me about Hadoop next ?