Bringing GenAI from Promise to Reality: Navigating the Journey of Implementation

Log in to watch

Las Vegas 2023

Bringing GenAI from Promise to Reality: Navigating the Journey of Implementation

At DOES Amsterdam, I unveiled the boundless potential of GenAI and the dreams it inspires. However, as OpenAI continues to introduce new features, many of us are grappling with the challenge of integrating GenAI into our own organizations. In a landscape characterized by constant change, product and engineering teams embark on an exploratory journey:

- What is the art of prompt engineering?

- How do we choose the optimal model and provider?

- What's the process for integrating GenAI with our own data and APIs?

- How can we seamlessly incorporate this into our delivery pipeline?

- What implications arise for development, security, and operations?

- What are the financial considerations?

- How can we build and maintain user trust?

- Where does GenAI fit within the organizational structure?

Drawing from real-world experiences in the trenches, I will guide you through these critical aspects, providing insights and strategies to navigate this evolving landscape.In the spirit of collaboration and knowledge sharing, I am actively seeking stories and experiences from others who have traversed this path. Whether you are just beginning or have advanced expertise, I invite you to engage in a dialogue. Let's exchange insights, learn from each other, and collectively contribute to documenting this transformative journey. Join me in this conversation, and let's empower each other. #askforhelp

Chapters

Full transcript

The complete talk, organized by section.

Patrick Debois

All right, jumping right into it. It all starts with a prompt.

You all will know by now ChatGPT. The prompt is the thing you type in. A prompt is what you type, it goes into a model, and the model is the smart thing that is kind of like your auto-completion. It gives you the result, which is a completion. This is the simplest query.

You can get a little bit more fancy when you enter that kind of query into ChatGPT by saying, "Imagine you're an expert." It sets the stage for the system to answer, and it gives you better answers. This is one trick.

You can give examples in your same query: "Do something like that. Do something like that." And then the LLM, which we're going to talk about, is actually giving you an answer.

Another way to improve these things, and then you can get more fancy, is when you're saying, "I need to solve this problem. Let's break it down into steps." This is a pattern called chain of thought. Like, I'd say, "Solve this first thing, then move to the next." Then going on, tree of thought is saying, "I need an answer. Give me three answers," and then kind of go follow the chain of thought and then kind of do a summary at the end.

So why am I explaining this? Because that's part of the whole prompt engineering. But it doesn't stop there.

If you're integrating this, you want to have some structure. You want to say, "Return this answer as a JSON to me," and give it a schema to say, "Well, first name is supposed to be like this. Last name is supposed to be like that." So you're putting even more parts into your query. It's still all text. I haven't done any coding. It's still just text.

And sometimes we have to ask, "Are you sure? Is this correct? If not, try again." So you see, this is still kind of a whole train of thought there.

So what we've learned is that changing the way you ask the LLM, or the generative AI, massively improves the things. So this is what people call prompt engineering. It's a field on its own, but it's definitely worth the effort. And that's why companies like Salesforce, they partner up with MIT to kind of come up with new strategies, because this is the way things improve.

Again, you can do this all by yourself, but when you're implementing generative AI, you want to look at these patterns, and they're all in the research.

The other thing we learned is that having one call to the model is not enough. You'll probably do dozens to kind of get a well answer. That's where it is really weird to kind of make it work. You have to put in a lot of effort to make it predictable, correct, the way you want it to serve.

So that's one thing we learned: it's never that simple, that one-shot query.

Moving on to the next part. So prompt is the text you write. Models is where you send the text to, works. One of the problems you might hit when you're on these systems is that they have a limit of the amount of text you can send to that. ChatGPT, 32K. Claude has 100K. You can probably start sending a book to it, but it is a limit of things you could set. So that's why often you have to go into smaller chunks.

And the interesting part is that even though you give it a massive list of text, the LLMs tend to forget what you send in the middle. It's because it's trained on material that we give them, and usually we have a strong opening, a strong ending. The rest in the middle is kind of that. So that's why they call it "lost in the middle."

And we know time to first byte in the browser, or kind of that. We have that with a model as well. Some models are slower to react. They take more time, and some are faster. But latency, therefore, if you want to have this real-time experience, is a very important thing.

And I told you it never is that one prompt. So the latency is multiplied across multiple prompts, and that's why it's so important that you pick a model that has a very good latency in this. Depending on your use case, if you're doing it in batch offline, you don't care. If you want to do this real-time, it is very important.

The other thing, if you select a model to implement this, is that you want to look at the amount of parameters. It's basically one giant matrix of numbers where it tries to answer your question, your prompt to completion. And it's doing a lot of math using that. So the bigger dimensions, that's kind of how you look at the number of parameters of a model.

But some models have been trained specifically to stay within a certain region. So some have been overtrained, and some have been undertrained. And it will result in you having to require to use more memory, being faster at responding. So you have to look at what is the model that I want to use there.

And having a higher number of parameters doesn't always mean it's going to be better. It's not a thing we found out. You would always think, like the higher number, like I said, the context, make it bigger; number of parameters, make it bigger. But it's not always like that. There's a thin line of kind of improving those things.

So you know now a little bit on what parameters to look at and why, but now you're off to: what model, what vendor am I looking for? Some vendors will sell you a foundational model like ChatGPT or AWS, here Titan, and others will have integrated partners. So in this case, Bedrock is providing foundational models and also the ones from partners. So you start seeing this as a marketplace of models that you pick and choose based on what you need.

There's an equivalent marketplace on open source, or open models, which is called Hugging Face, where you can find public models as well. So ones are more proprietary, ones are more open, and it depends on your choice on what you want to do, whether you're going with a vendor or an open source model, much like you want to do with open source, yes or no.

And of course, just picking a model based on the parameters, it doesn't make sense. You have to pick the model that fits your use case. So is it translation? Is it summarization? And you kind of have to navigate the whole zoo. But this becomes, depending on your use case, to pick the right model.

So this whole selection process you have to go through before you kind of have the right model.

And then obviously pricing. So this is not anymore, unless you run the model yourself, this is GPU pricing. But they're charging by the token. And token, think of it like pieces of text you're sending. Not only the input you're going to be charged, it's also the number of output, the text outputs. And so depending on how big your input or output, this is going to influence how you have the pricing.

And then you can start comparing vendors based on your use case that you need. But the pricing is important, because it can really be expensive if you have a high context window or a smaller one. So you need to navigate what is important for you at the time.

And even Microsoft themselves, they're looking at a way to reduce the cost of not always calling OpenAI, because internally they're also feeling that pain of cost.

There's new hardware coming out. So whether you are not running your own model on a GPU, the vendors are looking for ways to reduce the costs and have more specialized hardware. So it's another option that you can take, depending on your use case for training, for fine-tuning your model.

So our experience is OpenAI is still top-notch. Competitors are really coming up fast, and the open models are also getting up faster. So this has really been an explosion and making this more. But obviously, it depends on whether you buy a service, run it yourself, based on the pricing, whether you go with the GPU or a SaaS supplier.

And the biggest worry a lot of people have is: "Yeah, but when I send it to a SaaS provider, is it going to train on the data I send it to that?" So that's, for a lot of people, the first decider: will it keep my data private, yes or no?

Speaking of data, the third piece in here. Up until now, I told you we can get smarter at sending the prompts and the pieces, but there's another mechanism or a pattern in prompt engineering. It's called Retrieval-Augmented Generation, fancy name for saying, when I'm asking the model a certain question, I'm going to include pieces of information that I think are relevant to answer the questions.

So, "Given the following pieces, data X, data Y, can you provide me an answer?" So that means I'm sending solutions or information to the model and giving answers back.

So what does it solve? The models you can have from vendors, their information has a cutoff date. They've been trained, and then has no certain information. But the other thing it solves is that I can do PII masking. I put information in there, I can mask it, and then I can get it back so it doesn't leak any information.

So this pattern is kind of like a smarter way of doing prompt engineering with your own data and giving you the results. This is the pattern that a lot of companies will try to do to find the balance between the privacy or not.

For dealing with your own information, you probably have to extract a lot of text from your PDFs, your Word docs, or anything there. So you kind of have to have a way of dealing and extracting all that information before you can start sending the chunks in.

And while companies have certain structured information in a database, a keyword, and even like the search, you might have to look at cutting pieces of that information in a way that it's easy to retrieve the relevant pieces that go into the prompt for that. There's a technique called embeddings, which kind of shows the similarity between pieces of information. And when you ask a question, it finds the similarity between the answer, and that's the pieces it actually puts in the prompt.

To avoid always sending this information over the wire in your prompt, remember the size limit, we can start fine-tuning the model. Fine-tuning the model, think of it like patching a model and putting a layer on top of it with your own data. And that is a lot cheaper than training a whole model. It can be done in a couple of hours, but there's a little bit more risk on hallucination or giving false answers. But it is a way to overcome you always having to send all your data to the model as well.

So what we learned is this cutting of data, of Word documents or structured data, that strategy, how you do this, is really important. Because if you cut things middle sentence, it will have a bad similarity. And so that's a thing to consider in chunking these things up.

And what we found in our product is that we wanted to replace the search functionality, but users have been trained over the years just to put keywords in. So we had a hard time convincing them to actually start asking questions. So you'll have that resistance of: is it keyword-based? Is it search-based?

So eventually, like this study also shows, don't dismiss your old search and start mixing, showing the results with keyword-based search and based on more a question, and show both of the answers so users can select what they need.

Soon, you don't have to think about this. It will just be a service on AWS called Knowledge Bases, and we just send you the data. So another thing we learned: this is moving so fast that whatever we're doing ourselves probably will end up in a functionality from a vendor. Same thing with fine-tuning. It's a very hard thing to do. Now it's just a commodity service, all done.

Fourth piece: we got the data, we got the model, and we get the prompt, and then we can hook this up to APIs.

The APIs are a way of saying, remember that we stuffed the data pieces in, like this is the relevant information. You can say the same thing about, "Well, here's a bunch of tools." One is very good at calculating things, so you could hook that into some piece that just calculates. And then the other thing is very good at searching things. That's what you go to hook up to your search functionality.

And so based on the answer of the prompt, it will tell you, do I need a tool? Do I not need a tool? And based off that completion, you decide to use that information to do a strategy.

So agents typically will do a strategy to do a goal and go to a certain thing, but it's less driven off just the question and answer. So you see, it's going around until it finds a solution based on the tools you provide. It's a little bit more off in the future, but even Amazon, again, they have these agents coming to their foundational models and hooking it up together.

Last piece: middleware. It ties everything together. So it takes the prompt, takes the API, the model, and the data, and it just turns this into reusable code components. So instead of you trying to write all the APIs directly to the models, get the data, they all have these pieces built in. So solutions here would be LangChain or LlamaIndex. They glue all these things together.

There's another way of thinking about, instead of having to code, is just use a no-code thing. I can wire these things up. We find this really useful about prototyping, not having to write the code first, and then we embedded that into code.

And it even comes in the spirit of declarative, like the third way: writing the code, the no-code, and then making it declarative, is that you can specify what the model needs to talk to and how it talks to. You save that as JSON. Think of it as the YAML that we're used to, defining how these things work together. And it just spins up the chain of doing the calls.

And a model file, which is an interesting one. It's mirrored after a Dockerfile to spin this up. And the whole training is another example of doing this through YAML.

So we found this easier than just having to write all the code and change this over again, is that you start doing this declaratively.

So they break. They're so fast. Somebody said the rhythm was rewritten every week or something. It's something you need to look into, but it is still better. All the abstractions are there, and everybody's contributing to keeping it more stable.

So this was about getting the functionality out the door. We have a great app, we have the middleware, everything's running, but now we have to make sure that we can keep an eye on how this is working.

And it's not MLOps, because that is getting the model into shape. It's not AIOps, it's not about the trends in your monitoring. This is a whole new thing.

Most of the vendors, they come from observability, where they say, "Well, we're going to look at your cost. We're going to look at your latency." Those are technical metrics, so they're the easiest ones. And you can have these metrics, all the pieces of your stack, but still there's a piece missing that you want to have observability on.

And yes, we can get it into your OpenTelemetry in a good way, but that's not the point. And yes, we can have tracing, that's not the point. It's important that we can trace the prompts and the whole chain. But what is really important are the data metrics.

I want to have the quality of the answers checked. And how do I check this? I can't write a more explicit test for this, so I have to ask another model to see: is this text readable? Is this relevant? Is there any security issues? And maybe the sentiment of the answer, right?

So these are metrics that, when you're doing GenAI, are not the traditional technical metrics, but the ones you want to look out for. And so the dashboard is not just about the latency. It's about the quality of the data that goes in and out. And that's what you start worrying about, whether that's really working.

And some use a model. Some use another way. They just ask an LLM, "Imagine you're a security researcher. That's like a system prompt. What do you think about this answer?" And they use LLMs to do the evaluation of other LLMs. There's some inception happening.

So the whole thing is there's a whole new chain of things that we need to take care of: the security of the data, what goes in, the model, the quality. So there's a new way of operationalizing things.

But what's also important is that we capture the feedback. I don't know. There's no alarm, kind of the feedback of, did people actually like this answer? Was it good? So we're tracking that as well.

And we need to version the prompt, because we're updating this constantly while we're changing our information. And we can do A/B testing of prompts, a whole new set of things that we're not used in our infrastructure, but it's testing our data as well.

And I said it before: we can't do this manually anymore. We have to ask help from automation to triage those answers. So again, inception, LLMs helping LLMs in there. And so this becomes a whole test control, and this is kind of where it gets fuzzy.

Things break all the time, but we need to get better at training and testing all the data, as I've shown with the metrics in observability. But we also want to create a test set, how we do this during our pipeline of deploying these kinds of apps.

And we'll deal with uncertainty, and we'll test it in prod, right? That's really weird for us, because we've been so focused on being able to do these things in the test environments.

And another thing we learned is that we want to expose the data as an API, because that saves us a lot of hassle of cobbling these things together.

You're going to get access to GPUs. That's another challenge you want to get to, but that's an easy one to solve in a cloud environment, right?

So this whole delivery pipeline is taking different forms: the quality checks, the evaluations that we need to do. But it fits right into the pipeline. We just have better test environments. We have more GPUs needed, and have the models come from there.

But you want to earn consumer trust, right? Because you've done the observability, you get it into production. You want to earn that trust. And one of the ways you can earn the trust is by showing your AI company principles, because that shows customers how you deal with these issues. Are you training it or not?

And obviously there's legislation that will force you to do certain levels and make certain precautions. You can read about this later on the slides.

And you want to check as well whether your AI, you're saying, is it trusting customers? So you have to be aware, like when it takes decisions for your customers, that you want to have it controlled or not controlled.

And from a security point of view, you want to treat your data as dependencies and use the existing security controls and fill in that space, a registry, much like we do with all our data, all our artifacts, and data source tracking as well.

And we can even win the hearts by improving our UI and really showing off that when the user is getting an AI answer, that it is shown, and users are actually incentivized to give feedback. And that feedback can be just by putting an icon on there, by showing and actually not hiding that you're using AI, in a good way.

And so I found, in the discussions with customers, they want to opt out on AI, want to opt in, but there's a cost to doing this as well.

Lastly, beyond the marketing budget, these things cost a lot of money. It's not just your platform that you want to bring. It's your use case. It might blow up your budget, and there's an innovation tax. I've already shown that you're reinventing; new product comes along, new product, and you have to take that into account when you're calculating the cost of these projects.

And you might have to adapt your pricing strategy to actually have you afford using AI and bring that to the customer as a cost as well. This is just going to go up in your spend. Start budgeting for this. It's the end of the year, so it's a good point to do it right now.

And it will put a collaboration test, because the testers, the marketing, they all have to work together on this part. And really, it's also making sure that we get all the data from all the pieces in the environment or in the corporate. So that collaboration is really helpful.

And if you don't have your DevOps straight, that's going to cause you a lot of hassle.

And what I found is that just data science, I call it shift right. They're moving into production. We first had an engineer join their team, then they took over, and then slowly we put the data in the engineering team, and now we're moving this into the platform team as well.

And yeah, this is just crazy how we live right now. This is my life. Two hundred lines of code, half English. I don't know what's going to happen, and I'm just looking for help.

Basically, if you're on this journey, beginning, starting, advanced, I'd love to hear from you. That's my ask for help.

And thank you for listening to me.