GenAI will need a whole new look at Data Governance !


There are two areas that I think will be the “make or break” criteria for Generative AI

1. MLOps and

2. Data governance

And between the two – I think Data governance will be the one that will get enterprise attention first, and real quick. This is because I think the first hurdle will be to make sure enterprise users trust GenAI – and that’s a high bar in itself. I will park my thoughts on MLOps for now.

The size of the model is probably less important for enterprise uses – most tasks that AI can help with in an enterprise context are narrow in scope. This is generally a good thing. Big models are expensive to train and probably will never get used at inference time to make use of all it was built to do.

Even if we look at a complex end to end process in an enterprise context – it probably makes more sense to have a series of specific models that can work together, instead of one big model that covers everything. We don’t need the model that answers questions on purchase orders to also write an essay on the meaning of life 🙂

I am well aware that talking about cost of a new technology instead of innovation goodness is uncool – but having lived my whole career in large Enterprise land, I am quite sure that if GenAI has to scale in adoption – it has to have a low cost base. Enterprises might even live with a lower quality of responses if the cost is right. I am only half kidding here 🙂

To make smaller models (which are cheaper) really useful – enterprises will need very high quality data to fine tune it with. For narrow scope – enterprises generally will have the data with enough tokens to make it useful ( product manuals, customer complaints m, procedures, laws, invoices etc ). The only question is whether such data is governed in some systematic way so that the information can be trusted to be of high quality.

Data quality is largely an unsolved problem even for the much simpler world of data warehouses which has been around for decades now. It has almost never attracted enough budget and time in most companies. a big reason why datalakes didn’t yield the planned business value is also because people didn’t trust the data to be of high quality. We will see what fate awaits lakehouse approaches – but I am always optimistic. These things generally improve over time.

Size of the available data to train and fine tune might actually not be as big a problem as the quality of data. More data that looks the same doesn’t really do much for models that use it to make them any better. After reading the Chinchilla paper , I am sure we will keep massively improving the ratio of training data to size of models. Deepmind’s approach is radically more efficient than the original GPT-3 paper and it only took a couple of years to get there.

There are two complimentary approaches I can think of regarding how an enterprise will think of data for fine tuning (assuming they will start from a model that a someone else spent money on training) – 1. Establishing a consistent data governance process and tooling and use high quality trusted data to fine tune the models and/or 2. Depend on LLM itself to create high quality data ( self-instruct , use one LLM to create data for another , have human users curate LLM generated data etc – like in a chatbot type use case where a human expert can correct an AI solution and let it learn from it) .

Fine tuning is only one part of why I think data governance will get a lot of attention. There is an “everyday” need that will happen frequently when the model is used – people ( users, auditors, regulators …) will all ask for proof on where did this data came from that GenAI is answering .

GenAI has an additional headache beyond what’s used for training and fine tuning and all – users might feed it inappropriate data ! That’s another thing that needs to be governed – and probably heavier in regulated industries and when IP, privacy etc need to be kept in mind at every step.

There are two things to think about carefully here – the process of data governance itself, and the tooling and automation of it. I am less worried about the tooling part in relative terms – I am just not sure yet if enterprises have thought through all these “fringe” aspects of GenAI compared to all the cool applications they are excited about. If they don’t find the time and budget to get it done right – it will be a lot of grief to deal with.

Published by Vijay Vijayasankar

Son/Husband/Dad/Dog Lover/Engineer. Follow me on twitter @vijayasankarv. These blogs are all my personal views - and not in way related to my employer or past employers

5 thoughts on “GenAI will need a whole new look at Data Governance !

  1. All good points – success with data will be the key leverage.

    Retrieval-Augmented Generation (RAG) maybe more prevalent than fine tuning LLMs in most corporate cases.

    Liked by 1 person

  2. One issue you didn’t explore in detail (but I guess it falls under governance) is the suitability / fit of the training data. We have all seen cases where the senior business people tell us how they THINK the business runs, then the implementation team find out that out in the office or plant that things are very different.

    Another issue is where does the training data come from, and how reliable or suitable is it ? For example, an AI trained on DB2 12 documentation needs to be retrained / replaced when you upgrade to DB2 13. Or consider a legal AI, apparently at least one version of Gmail’s “smart compose” feature was trained on a collection Enron emails !!!

    Liked by 1 person

  3. Data governance is a big challenge. Even utilizing open router AI early access to paid LLM is also not up to the expectations for SME’s. To get any info, legal/ financial advice, and different scenarios LLM data differences and authentication is still challenging. Even utilizing via open router and direct access of LLM have result discrepancies. Over the time data governance will be most challenging and important factor.

    Liked by 1 person

Leave a comment