LLMOps: How we Develop, Operate and Secure LLMs in the Enterprise
With the advent of new technology comes new ways of working. That is exactly what we are experiencing as we enter the era of AI- a new deployment pipeline for this new technological paradigm.
Many aspects of LLMs make existing devops workflows difficult,. Figuring out how to continuously deliver, monitor, secure and go on-call for LLM-powered applications is the next frontier of DevSecOps. In this talk we will learn from recent product launches involving AI at Cisco and walk through an end-to-end model for LLMOps in the large enterprise.
Chapters
Full transcript
The complete talk, organized by section.
John Rauser
Good to see some friendly faces in the audience, some of the new friends I've made here this week at the DevOps Enterprise Summit. Really great to be here.
My name's John Rauser. I'm the director of engineering at Cisco Systems. I build security products. To be specific, it's a zero trust access product. If you're interested in that, we can talk about that another time. But today, I'm here to talk about AI in the enterprise.
We're doing a lot of stuff with AI these days. Everybody's talking about AI. They want to know a little bit more about how AI is going to work in the enterprise. That's what we're going to talk about.
We're going to go on an adventure today talking about what it means to run LLMs in production. And it truly is an adventure, folks. It truly is, because what is an adventure? It's excitement. It's unknown. It's venturing into an area of the world where we don't have a map yet.
And for me, and I know for a lot of people here, that's very exciting. One of my other passions is organizational behavior, how teams work together, and that's another area where we don't quite have a map yet. And so I think, trying to think to myself, why am I so interested in this area? I think that's it. It's the frontier, and frontiers are exciting. We're discovering things, and new papers are getting written every day. We're changing our world every day. We're changing our understanding of how we can use these things, and that's just amazing.
So what I'm going to do is tell a little story here. We're going to build a castle together. The castle is the product. We're going to talk about what it means to build the castle. We're going to talk about who lives in the castle. I call that the council. And then we're going to talk about how to defend the castle, the moat around the castle. Those are going to be the three main things.
But I just want to reinforce why we're all here. Why are we all here listening to me talk about this topic? We're at the peak of inflated expectations, folks. That's right. That's where we are. We're at the peak.
And that's fine, because you know what? The impacts are real. Businesses are getting impacted by this change already, and the uptake is huge. I put this graphic in here. I'm not sure if we're allowed to use this graphic anymore because Threads, the Threads app, I think came out with, like, two weeks, they hit 100 million users. So I don't know if we're allowed to use this. I don't know if it's a good comparison anymore, but I'll still put it in there for a data point that this is real.
But what's going to make it real for you is figuring out how to run this stuff in production. This is a Y Combinator data graph, the latest round, all the different startups that are working in this space. What are they building? They're building tools. They're building ops tools.
And what are leaders saying? They're saying ops is the biggest problem that they've got. Ops is the key to our success here. We can come up with all these ideas about what we're going to do with AI, but how are we going to make it real? And that's the frontier, I think, for us, for people that are building the builders of the world.
I come from Cisco. Cisco actually has, and I learned this recently, I'm teaching you now, a very deep AI capability because Cisco has many different businesses and it acquires a lot of businesses. And every one of those businesses has a data science team. And those data science teams are getting together, and they're talking a lot about what we're going to do, and we're thinking about all the different ways that we're going to do this stuff.
There is a difference, though. The AI that we've been doing up to now, you could call it traditional, you could call it classic, you could call it predictive, but it's the ML kind of stuff. And that's the competency that my organization has and many organizations have. And we're transitioning to this generative AI competency.
And in Cisco, it looks a little something like this. We've got a wide number of businesses across all these different areas. And like I said, initiatives going on in each one of these areas. So what I set out to do is talk to all these people. I interviewed a couple dozen people, figured out: what are you doing? How are you doing it? What are your challenges? And that's what I'd like to share with you today.
So let's talk about the castle. Who lives in the castle? How do we build these products? Who lives in the castle?
So this is the analogy, this is the metaphor. The LLM is sitting in the throne. The LLM is doing my work, and I need the LLM to do these different tasks. And what I've seen is, across all these businesses, three different use case categories or task categories emerging.
The first one I call the vizier. So the vizier is the helpful one. They're standing there answering questions, and they're available to give me access to this deep set of skills and knowledge and data that they're trained on. The vizier, you can interact with them. You can keep asking them questions. If you don't like the answer, you ask again.
And so this is the primary use case that probably everybody here is thinking about in some way: a conversational agent, a chatbot, or some kind of a copilot. There is a difference between a chatbot and a copilot. A chatbot, you're using natural language. You're interacting with this thing that's your OpenAI. But a copilot is more guiding you through a decision tree. "Would you like to do this? I see you're doing that. Maybe you want to do these three things. So it looks like you're trying to accomplish this. Here's some ways to do that." But not necessarily through conversation, more guiding the person. So I call those copilots versus chatbots.
And at Cisco, we've already launched our help desk chatbot. Cisco has a huge amount of information and knowledge that customers are trying to get access to. So we developed Sherlock, and there's a little tweet from somebody who is very impressed with Sherlock's ability to answer questions on any kind of material that Cisco is working on. And so, huge value created for customers right there.
And so this is the first use case and the most obvious one. And if you're not working on this already, I would be surprised because it's really important.
But the next one is a little bit different. I'm just going to go back one slide. I want you just to look at one thing. In the vizier, there's an inflow. The user is going into and interacting directly with the LLM, so there's an inflow.
And the next pattern, I call it the judge. The judge is more of an outflow. So we're presenting to the user, we're making available to the user some skills or knowledge that the LLM has access to. And so the use cases here are, I boiled down, I'm trying to give you my judgment. I need the reasoning on something.
And you can think of different ways that this would be used in an enterprise. There's all kinds of case analysis that has to go on. We have to summarize the sales call. We have to summarize the support call, or to summarize an incident. And we can have the LLM do that. But the user isn't necessarily interacting with the LLM. They're just getting the output of it and then maybe regenerating it if they don't like it.
We're also cautioning the user in a deep way in every one of these use cases that they are interacting with a model. They're interacting with AI. And that's part of our commitment to responsible AI, is making sure that user is well-informed.
The last use case. So we've got the vizier, the judge. I call it the general. So the general is going out and doing orders. They're taking orders, but getting the job done. But they have to get it right.
And this is actually the paradoxical area I see in AI, is that we're dealing with probability-based models. They might not get it right. And so a lot of the thinking, a lot of the neat ideas that people are having is, what if the LLM could just do this one task and could take the output of that task and use it in the next task and the next one, chain these things together? The LLM can take over a piece of the workflow. We don't need to have to worry about that anymore.
But the problem is that the LLM always has to get it right. So there is some really interesting stuff. Again, I work in security. So a great one that we talk about all the time is content categorization. What if we can just look at the content that the user is getting access to and call it whatever it is? It's a gambling site, but we're going to block it.
So that operation that it gets used in, there's a danger there. And that's where, especially as we're getting these products to where it really matters, that we are able to observe, monitor, improve the accuracy of the product to a very high degree. We can't launch these things into production unless they actually can do their job very reliably, right?
So this I call the general, and I don't think we're quite here yet. The vizier, already launched. The general, we're figuring out. And I do think that's a very strong ops problem.
So let's talk about the ops a little bit. What are we up against? So we're going to build the castle.
Oh, by the way, I did generate all these graphics in Midjourney, and it's pretty good. I was telling my wife the other day, I'm now paying more for my AI tools than we're paying for our entertainment stuff: Netflix, Disney Plus, and all that. Midjourney and OpenAI and all that. Yeah, starting to add up. But they don't always get it right. I don't know if you noticed the judge. There's something going on with his hand, his finger there, and I still like it.
So what are the towers, the keeps, of the castle? There's three. There's the model, the data, and the interface.
So the models are what are kicking off all this fascination with the world, and they are doing incredible things. We don't exactly know why, but as we add more parameters, they start to do more things. And so there's these two motions, really. There's a motion to add more parameters and get more capabilities. And then there's a motion to see if we can squish more capabilities into models with less parameters. And so those two motions are going on right now, and we're seeing the effects of those across the industry.
And there really is this explosion of models coming out. GPT-3 was almost the turning point. It wasn't quite good enough. 3.5 became good enough. And then we start getting these open source models, smaller models, models that I can run on my computer.
One of the questions I had for people when I'm asking them these questions is: is it real? Is it realistic for us to take a small model, like a 7 billion parameter model, and run that on a computer? Will it actually do things that are effective? And I think the answer is yes. Llama 2 is open source. It's available for commercial use. It's the only one. Here's the data sheet on it. It comes in these parameter sizes.
The 7 billion one will run on, well, I have the M2 MacBook with the integrated chip and the 64 gigabytes of memory. I can run it on here, which is, if you're looking for a reason to buy a new MacBook, you just found one.
It's got limitations. The context size, it's small: 4,000 tokens. It's not very big. Claude has 100,000 tokens. You can stick a whole book in it. So this may be a chapter. And 2 billion parameters or a few trillion tokens, all the data it was trained on, it's a lot of data. But there's a lot of data to go around.
But these things are effective. And if you go out and run some experiments with them, you'll see that. In fact, there's a really easy way to do that that I like a lot, which is this thing called Chatbot Arena. And you go here, and you ask it the same question, and it'll load up two models side by side and give you the response of those models. And it randomizes the models too, by the way, which is kind of exciting. You don't know which one is giving you the answer, but it'll answer questions, deep questions that I have, like, "What is the meaning of life?" Not very well, but worth asking. Worth a shot.
And you can convince yourself that, hey, maybe these smaller models do have a role in the enterprise. Because make no mistake about it, companies like mine and maybe companies like yours, if you're working at a bank, insurance, or something like that, you will need to run these models in-house. You know that already. You probably run your own GitHub. We do. Probably already run your own Jira. You're going to be running these things in-house. You're not just going to ship all your data.
And that's where we have to figure out: can we actually do these? Can we create smaller models that we can run on smaller infrastructure and still be effective? How do we know if they're effective?
So we get this concept called evals. And evals is essentially running a test on the model. These are the really popular ones, MMLU and HumanEval. These are the ones that are saying: take the LSAT, take the sommelier exam, which I thought was interesting. It's actually a very hard exam.
And then you get this collection of folklore tests that we use to say, "A new model comes out. Let's see if it can pass the Sally test." Somebody even wrote a cool little tool here. It runs the Sally test against every single model and tells you the latency and whether it got it right.
Does everybody know? Sally has three brothers. Each brother has two sisters. How many sisters does Sally have? I mean, how many of us do you think are going to get that right? Did you get that right? You did. Okay, that's good.
One of the things I thought was interesting here, it just highlights some of the problems that we have with running these models. Look at the latency on answering some of these questions. That's a lot of latency. Are you really going to be able to put that in production? How are you going to put that in production? Yes, you are, but how? And so these are some of the things we have to solve.
So the future of models is, again, it's a frontier. So many different areas that we're exploring: bigger ones, smaller ones, cheaper ones, more expensive ones, multi-language. It turns out when you train a model in more languages, they get better at doing everything in all languages. That's interesting. Why is that? We don't know.
Multimodal: when they can do language, when they can do audio, video, they get better at all of it. So interesting. Why is that? We don't know. Again, the frontier there. It's not like there's somebody out there in the world that knows. It's not like you can go take the MIT course in this stuff and you'll figure it out. Nobody knows yet. That's why it's exciting for me, probably for you too.
Let's talk about the next tower, the keep of data. I wonder if data is a good word for this, because one of the things I discovered is that you think we have the data, but we don't. Because we've been collecting metadata. We don't actually store data.
People are using models, and people are collecting large amounts of data. They're not storing everything. They don't store the entire documents, the full text. They throw that stuff away and keep the metadata. And now, when we go back and look at our data sets and we want to train our models, our language models on those data sets, we're finding they're not complete.
So I wonder if it's the right word. It might be knowledge or information, but we want the whole thing. We want to train these models on the whole thing. So there's a bit of a missing piece there today on the data.
How do we get the data into the model? How do we get the model to do what we want? There's a word that's coming out. It's called grounding. It seems the industry is settling on this word. We're going to ground the model in our own information.
And so there's four ways to do that. The first one is not good. Too expensive. We're not going to train our own model. But the next three all work together hand in hand.
So you may have heard of fine-tuning. Fine-tuning is where you take the model and you update all the parameters using your dataset. The next one is RAG. You don't change the model. You run a system just off to the side of it, and you try to retrieve just the documents, just the elements of your information set that are relevant to the query. And then you give those both to the model to answer your question.
And the final is the one that really captured the world's imagination for a second: prompt engineering. How can we stuff the prompt with more stuff? Right? How can we put the data, you know, and throw it up? You can throw a whole book in the prompt, right? You can put all the examples of the kind of responses you want there. You can almost train the model right in the prompt itself.
When you do these three things together, now you start approaching high degrees of accuracy with your responses. That's what people are finding.
Another thing people are finding is, and you can think about it this way, if you want to give it a skill, you're fine-tuning it. If you want to give it information, knowledge, documents, you're using RAG. And the interface is the prompt, and that's you. You're getting the stuff in there.
So that's the way I've started to think about it. I noticed people in the industry are starting to think about it that way.
So RAG itself is this kind of mysterious thing, but it's actually not that complicated. You take the data. You turn it into embeddings. Embeddings are just vectors. You put it in a database. You take the query. You put it into embeddings. You compare it against the embeddings in the database. You take the ones out, you retrieve them. Retrieval augmented generation. And then you put it all in the prompt and you get a better response back.
That's the essence of it. It's not that complicated, actually. But it is cool because this frontier is emerging of, well, embeddings, and what is the algorithm we should use to generate the embeddings? That affects performance. That affects correctness. That affects a number of things.
What is the database that we should use to store these embeddings? Also affects accuracy. Also affects performance. And by the way, those things should match, the database and the algorithm, the embedding algorithm. There's a relationship there where they can work well together or they can't. And experiments, that's how we're going to figure that out.
The last element, the last tower of the castle: the interface. So how do we engineer the prompts that we're going to give the model? Now, this is probably where people are most familiar, so I'm not going to spend a lot of time.
What I do think is interesting, though, is you are setting up the query. You're feeding in whatever it is that you're getting from the user or the instructions you want to give in. You're getting the response back. But you have to be really careful.
Again, we get back to this responsible AI initiative, the important aspect of this, and we have to put guardrails on the system. And there's all kinds of ways that we can check and make sure that users aren't, if they're using the vizier model, that they're not doing malicious things with the model. Also that if we're using the judge model, that it's not producing really random output that doesn't fit or maybe is, I don't know, not just something we don't want to show users.
So we put guardrails on the system, both in front of the prompt and coming out of it. And that's how we can make sure that we're protecting ourselves. Again, this will have an impact on performance. It will have an impact on accuracy, but it's an absolutely necessary thing. Something, again, we're going to have to tune, experiment with.
Put it all together in a really complicated sequence diagram that I'm not going to talk you through, but I just wanted to show you the data, the interface, the model.
But one thing I want to point out is it's not just a model. It's many models. And that is one of the patterns. So this is one of the patterns that is emerging, is that we're probably going to train smaller models for specific tasks to take on parts of the process, and larger models that maybe are more expensive, difficult to run, difficult to train, and use these all together in an ensemble.
You hear this word, the ensemble. That's what it is. It's just models working together. And then they're running on some kind of infrastructure. There's vLLM that's emerging. It's a way to cache queries and improve performance, these simultaneous queries. So that infra that we're running it on is critical. And then the hardware itself. Are you running it in the cloud? Are you buying your own DGXs? I heard they break down a lot. Cisco was trying to figure something out with the UCS product.
Just checking the time here. I've got six minutes left. There's some really important stuff I want to talk because we're just getting into good stuff.
So the ops, the actual ops. So we touched on this a little bit. There's going to be a lot of elements to figuring this out. It's the first one that I really want to focus on, and then the last one, security, that I want to tease a little bit.
So accuracy. We have this accuracy problem, and especially when we get into the general use cases, we can't prevent hallucinations. We don't know why the things work the way they do. We have this problem of explainability that you don't have in most computer systems. Why did it arrive at this decision? I don't know. And how can I change the system so it arrives at the correct one?
Well, there's so many things that affect accuracy. I just tried to list off a few off the top of my head and color-code them onto that map we saw earlier. All these things are going to affect the accuracy of the model. And you don't know how they're interacting. It's a network. You don't know how they're interacting.
So the data hygiene: did we clean up the data enough? Did we remove the duplicates? Dupes matter. They bog the engine down. They make it give non-unique responses. But there's all these elements to it that come together that, again, we have to experiment with. We have to learn. This is the frontier. How do we make these things accurate? How do we enable the general model that I don't think is enabled today?
And then the second one is security. This is the NVIDIA red team model. They came out with their own security model. There's a few that are floating around here now. There's also the OWASP Top 10 for LLMs, which is pretty cool.
And I find looking at security models, threat models, that kind of thing, is a great way to understand a system. It's another layer of insight into how systems are designed that is a slightly different take than maybe the one we're used to. So you can go and look at that and think about some of the security issues, the SecOps problems that we're going to have with running these things.
Okay, the last section here. We've got four minutes left. We're going to move to the last section. I call it the moat.
Oh, forgot about this one. Whole stack of stuff, stack of tools is emerging. This is from the Sequoia article just published recently. You've got to check that out. They go through the whole stack, both the ops stack, the infra stack, and the consumer stack as well. So definitely go read that article there.
So let's talk about the moat. What is the moat? And I'll just make my own opinion on this. What is the moat right now for LLMs? I don't think it's the data or the knowledge or the info. I don't think it's that, because a lot of it's just baked into the LLMs, and a lot of it's even public. So Cisco's public data set on how to run its products, well, anybody has access to it. All we're doing is stuffing it into an LLM through some RAG and then making it available to you. So it's not necessarily the data that's the moat here in enterprise LLM, enterprise AI.
It's not the users because I don't think there's a network effect or a critical mass. I don't really know how much that 100 million users on OpenAI really matters to the stickiness of that product.
But I do think it is these, and this is, I'm talking in the context of a large enterprise that's trying to enable its teams to build LLM-powered products. I think it's the platform. So the platform is going to create a moat in the large enterprise that enables teams to deliver features faster, faster than the competition, and to win quicker in the space.
And if you're paying attention to some of the stuff we've been hearing the last few days, that's it, right? I think it was Steven Spear. He said every business wakes up in the morning with a set of problems. The ones that can solve those problems are going to win. And so how do we enable our org to solve those problems faster?
So here are some of the things that we're doing at Cisco to enable, like I said, a diverse set of teams across many different products to build AI features faster. And we're able to release these quicker and get them into the hands of customers.
So things like a common design system. Did you think about that, by the way? When you're using a vizier model or even a judge model, there's this new design paradigm where we're going to get feedback on whether we're wrong or not. That's not normally something that products have to do. Products are normally right. They should be right, right? But here we have to come up with ways and means to get feedback from the user.
So Midjourney does a cool thing where you can regenerate the graphics in different ways and kind of get what you want. So a common design system is going to be critical, especially when you're in an enterprise. You want people to have a similar experience when they're using your products. They want to be slipping between products, and the AI works in one way here, in one way there. Not a good experience.
And then there's a whole bunch of other things here too. Let's talk about a couple of them. So the design system. Yeah.
There's the idea of a model zoo. Hugging Face is the model zoo. It's the GitHub of where you go get models. But the large enterprise is going to have to have its own model zoo, its own local Hugging Face, serving up these things. And even a place to run them, a collective infrastructure where anybody can go and run their model and do it cheaply, do it quickly, not have to worry about getting the GPUs and allocating the AWS hardware, something like that.
It's not easy right now, by the way. If you want to grab GPUs in AWS, it's not like EC2s. You can't just push a button. You got to work with them to get it.
My friend Patrick here in the UCS group, he created this. He built this. He put it into one of our data centers, and we can run our models on it, serve them up through an API, and then use LangChain to work with them. Kind of cool. Maybe we should sell this. I don't know.
An internal corpus. We need these big piles of data. And how can we make them available to our teams with good hygiene? That's the critical thing here, not have a lot of dupes and things like that.
And then an enterprise AI API. So how can we give access to our teams to experiment with Azure API, experiment with OpenAI, but do it safely and do it securely? So we actually put this whole infrastructure in front of OpenAI, and our employees use this instead of going to openai.com. They go to an internal site, and we get the same experience, but it's all filtered, logged, recorded. It's compliant. So compliant infrastructure.
How do we learn more about this? I'm wrapping things up now. The best way is to go ask AI. And that's exactly what OpenAI does. This just blew my mind. They don't even know how these things work. They're using their own large language models to figure out how their large language models work. How much do you love that, right? This is a great place to be right now.
So that's their paper, by the way. You can go check that out. Read the paper. Yeah.
So here's the ask. What help do I need? What help do we need? I think we need a community that's thinking about these things together. I think we need to come together and ideate and figure out what the edges of this frontier are. Draw the map. Start drawing the map.
And I think that the people in this room, the people at this conference, are the right people to start that community. So I'd love to do that with you. Reach out. Get that going. Patrick's into that too. We've been talking a lot about that.