GenAI will need a whole new look at Data Governance !


There are two areas that I think will be the “make or break” criteria for Generative AI

1. MLOps and

2. Data governance

And between the two – I think Data governance will be the one that will get enterprise attention first, and real quick. This is because I think the first hurdle will be to make sure enterprise users trust GenAI – and that’s a high bar in itself. I will park my thoughts on MLOps for now.

The size of the model is probably less important for enterprise uses – most tasks that AI can help with in an enterprise context are narrow in scope. This is generally a good thing. Big models are expensive to train and probably will never get used at inference time to make use of all it was built to do.

Even if we look at a complex end to end process in an enterprise context – it probably makes more sense to have a series of specific models that can work together, instead of one big model that covers everything. We don’t need the model that answers questions on purchase orders to also write an essay on the meaning of life 🙂

I am well aware that talking about cost of a new technology instead of innovation goodness is uncool – but having lived my whole career in large Enterprise land, I am quite sure that if GenAI has to scale in adoption – it has to have a low cost base. Enterprises might even live with a lower quality of responses if the cost is right. I am only half kidding here 🙂

To make smaller models (which are cheaper) really useful – enterprises will need very high quality data to fine tune it with. For narrow scope – enterprises generally will have the data with enough tokens to make it useful ( product manuals, customer complaints m, procedures, laws, invoices etc ). The only question is whether such data is governed in some systematic way so that the information can be trusted to be of high quality.

Data quality is largely an unsolved problem even for the much simpler world of data warehouses which has been around for decades now. It has almost never attracted enough budget and time in most companies. a big reason why datalakes didn’t yield the planned business value is also because people didn’t trust the data to be of high quality. We will see what fate awaits lakehouse approaches – but I am always optimistic. These things generally improve over time.

Size of the available data to train and fine tune might actually not be as big a problem as the quality of data. More data that looks the same doesn’t really do much for models that use it to make them any better. After reading the Chinchilla paper , I am sure we will keep massively improving the ratio of training data to size of models. Deepmind’s approach is radically more efficient than the original GPT-3 paper and it only took a couple of years to get there.

There are two complimentary approaches I can think of regarding how an enterprise will think of data for fine tuning (assuming they will start from a model that a someone else spent money on training) – 1. Establishing a consistent data governance process and tooling and use high quality trusted data to fine tune the models and/or 2. Depend on LLM itself to create high quality data ( self-instruct , use one LLM to create data for another , have human users curate LLM generated data etc – like in a chatbot type use case where a human expert can correct an AI solution and let it learn from it) .

Fine tuning is only one part of why I think data governance will get a lot of attention. There is an “everyday” need that will happen frequently when the model is used – people ( users, auditors, regulators …) will all ask for proof on where did this data came from that GenAI is answering .

GenAI has an additional headache beyond what’s used for training and fine tuning and all – users might feed it inappropriate data ! That’s another thing that needs to be governed – and probably heavier in regulated industries and when IP, privacy etc need to be kept in mind at every step.

There are two things to think about carefully here – the process of data governance itself, and the tooling and automation of it. I am less worried about the tooling part in relative terms – I am just not sure yet if enterprises have thought through all these “fringe” aspects of GenAI compared to all the cool applications they are excited about. If they don’t find the time and budget to get it done right – it will be a lot of grief to deal with.

GenAI in the enterprise – nine themes that I have seen so far


Ever since ChatGPT became a thing – I haven’t had a week pass by without having GenAI conversations with clients. It’s truly been a fascinating time to be a technologist.

There have been 3 times in the past 25 years when I have seen this kind of massive interest in being a first mover

1. When ERP helped consolidate applications

2. When Datawarehousing became mainstream

3. When mobile and cloud converged

I work in Financial Services – which adds it’s own layer of flavour to make all opportunities and challenges a bit more spicy 🙂

Here are nine broad themes that I have noticed so far from the conversations I have had with FS companies.

1. Risk mitigation vs First mover

FS companies pride themselves as primarily being the best risk managers (which is a very good thing for consumers). So “what can go wrong” has been front and center for GenAI plans. FS companies also know that their primary competitive advantage is in data and they want to be the first to capitalize on it. This push/pull tension is common in how they operate even for mainstream innovation – but GenAI has taken over as the lead theme for now, with public cloud adoption perhaps as the close second.

2. Privacy

All FS companies handle highly sensitive and personal data. There are tight restrictions on what can and cannot be done – and thankfully this industry thinks through this carefully. Between legal and ethical issues at play – the risk of getting this wrong is apparent to everyone and hence a lot of thought goes into mitigating it. How they solve it is not consistent across the industry – and a unified approach with is both efficient and effective is much needed. Otherwise a lot of GenAI innovation just won’t happen at scale.

3. Buy vs Build

The larger Banks (all kinds of banks) all have hired great tech talent including in AI. While this is obviously great to have such great people – it also means a lot of time and money is spent on building everything in-house. This is less common in insurance – but Banking and Capital markets companies generally love to build more and buy less. I know companies who have tried and failed to build their own equivalent of commercial CRM systems. Open source software has made building systems much more possible and many times it’s a good thing for the companies. But again – these debates do take away a lot of time from having innovation at scale. You can’t extrapolate time and budget from POC projects to full enterprise implementations.

Buy is not an easy option either given the tech is so new. Every large tech vendor has a platform offering and evaluating them takes time and money. The usual checklists for build/rent/buy is not enough for emerging tech, and needs to be extended. But that extension needs a level of knowledge that they don’t have today.

4. Skills

To begin with – most companies don’t have enough people with solid knowledge of AI. GenAI has an even smaller talent pool. Upskilling is totally possible – but takes a lot of time. I have lost of count of how many hours I have spent in the last three months reading papers to get the basics right. I am grateful that my employer has a lot of experts in the field who can clarify concepts for me when I run into confusion, but that’s not a luxury every company has. It’s not just great AI talent that you need – you need all the usual things that go with it ( architecture, engineering, UX ….) which means you have to deprioritise other projects. That disruption is not pretty

5. Intellectual property

One of the offshoots of GenAI is it’s use with developer productivity – code generation type usecases . Everyone – me especially – got very excited when we saw the possibilities for the first time. But that doesn’t naturally translate to the enterprise world – IP problems come into play very quickly. GenAI is only as good as the training set that was used in its creation. Have the solution providers done the work to make sure copyleft and copyright issues are addressed before a client generates code ? Otherwise it’s a massive risk that the companies carry. I just used code generation as an example – it applies across the board for GenAI ( well for all AI really )

6. Environmental impact

Greenhouse Gas emissions is something to think about upfront. GenAI is compute intensive to train given the size of the models – and while inferencing is not similarly intensive on a unit basis, a wide deployment will make sure the units add up. Also remember that GPUs consume more energy than CPUs. Between primary and secondary factors – the environmental impacts are a factor to be thought through before large scale work happens. Only a subset of companies seem to have made it a tier one criteria though in my limited view.

7. MLOps

While most of the attraction of GenAI is in the actual “generative” aspects, the enterprise attention is quite high on operations. There are big problems to tackle – how do you detect and prevent models from drifting ? How do you prevent degradation via AI learning from synthetic data created by AI itself? What are the most trustworthy watermarking approaches ? And so on . I think GenAI will be the shining moment for all the research going on in MLOps which will help across the board .

An excellent side effect of this attention to ops is that it has highlighted the need for investment in foundational data management which often gets ignored in the enterprise world.

8. Quality control

Similar to the point on MLOps – companies will have to rethink how QA is done. Software is built in layers – LLMs can affect the quality of layers above them that use them. There is a lot of work going on in academia and at all the big tech companies on improving accuracy, consistency, performance etc of LLMs. I have a strong feeling that these studies will probably result in alternate approaches to GenAI fundamentally . I will write another blog later to expand on my thinking – I am still organizing my thoughts on the matter.

9. Trust

GenAI has rekindled this important topic and put some urgency around scaling it. It’s invariably the first question in every meeting that I hear – “can we trust this thing?”. The question is simple – but the answer is quite complex in the capabilities needed to ensure trust. We need to know how AI arrived at the decision, what data was used to train it, what has changed over time in both data and the model and so on.

What do you know today that you wish you knew when you started your career ? Nothing !


Varkala Beach – where I was asked this question by a bar tender

This is perhaps the most common question I get asked by people in their early years in the workforce. I have asked some version of this to my seniors too when I was starting out.

Someone asked me this today morning and I figured maybe it’s a good idea to put this into a post.

Fair warning – My answer might not be the most helpful 🙂

And that answer is : Nothing !

I of course learned a lot of valuable lessons along the way . But looking back – I highly doubt my younger self would have acknowledged or acted upon any of those things because it would have been so counter intuitive, and I would have fought it will my energy 🙂

That’s the honest truth in my case. I am not saying young people today can’t learn those things and put them to good use. But when I discuss these things with them – my default expectation is just that they listen to my words and think about it. I have no expectations that they will follow through. I actually think it’s better that each person collect information and then choose what to do themselves, and not just randomly accept what worked for someone else will work for them too.

With all that said – here are a half dozen things that might have helped me if I realized sooner

1. Make peace with the role of luck in life. There are many people I have worked with who had better skills than I did – but didn’t get the opportunities I got. There are also a few who got much better opportunities than I did who I didn’t think were smarter than me – and they advanced farther than I did. It took me some time to make peace with that – and I learned to not let it affect me.

2. Sucking up is not a sustainable career strategy. It does work from time to time – but eventually you get exposed for your lack of ability and will take the fall. Corollary – don’t assume that someone got their break by sucking up all the time. Most of the time we don’t have enough information to know why they succeeded and just use “they must have sucked up” as an excuse to not think of where we need to improve.

3. I don’t have the time for that is not a real excuse . We will always find time to do things we love. I loved to sleep 8 hours till I realized I enjoyed training my dog. I didn’t feel bad when I slept an hour or two late because I enjoyed spending that extra time with my dog. We also find ways to not do things we don’t find interesting – even if they are important like say working out , or up-skilling at work.

4. If no one heard it – that there didn’t fall in the forest. This was perhaps the hardest lesson for me to learn. I always believed that working hard and delivering results is all it took. People above you are busy and have a lot to think about. They won’t always know what you are good at or what you have accomplished if you don’t tell them. The nuance is that if you over do the “look – I am awesome” thing, it will backfire badly too. Finding that right balance – is key. Personally I still think understating works better than going overboard – but it all depends on the context.

5. There are many roads to success – but not a lot of roadmaps. It’s easier to think linearly and narrowly, and often get upset when we miss a near term milestone we had so carefully planned. I was quite convinced that definition of success for me was to progress in a technical career path. I did that quite well – and then my mentors nudged me to try sales and management and so on. I didn’t have to sacrifice being an engineer to become a good GM. First ten years out of college – I was convinced that all sellers were liars and crooks. I didn’t realize how wrong I was till I took on a sales target and figured out you don’t have to lie to sell something. The problem with multiple roads is that you won’t realize it in time unless you have good mentors who can help you think through your options. One more thing along these lines – your definition of success will change over time given most of us tend to under estimate what we can accomplish in the long term.

6. Qualify, Compromise and replan as needed . Younger me had the complete opposite idea – I just wanted to double down on everything I set my mind to. And I have a very rigid idea of what I wanted to accomplish with specific timelines. I also got hugely frustrated when it didn’t pan out like I wanted. Eventually I learned that it’s a lot easier if I critically qualify everything before I decide to double down. I also learned that decisions are always based on imperfect information available at a point in time – so use some principled compromises (which means decide upfront what will make you give up) and be ready to pivot (or stop) when needed. Picking the right battles to fight is something I still haven’t quite mastered – but I am a lot better at it today than when I started.