Log in to watch

Log in or create a free account to watch this video.

Log in
Europe Virtual 2024
Share

Decoupling from Foundational LLMs

A dialogue between Dr. Mik Kersten and Gene Kim.

Chapters

Full transcript

The complete talk, organized by section.

Host Intro (Gene Kim)

So to set up this next talk, let me talk about who it is and then what we're going to talk about.

The who is Dr. Kersten. He wrote the awesome book Project to Product now six years ago, which is being used by so many organizations to change how they think about their software efforts. He's now currently CTO of Planview. Over the years, he has taught me so much about architecture, which so much dictates how organizations are wired, which is all about the book that I worked on with Dr. Steven Spear.

As is so often when I talk to Mik, he said something that totally shocked me. This is all about how we may inadvertently be coupling ourselves to LLMs and, more importantly, what we can do about it.

This will seem like the oddest way to set up a question, but let me set up the question by telling a story, Mik. I don't have your video yet.

The story is, I was telling him how over the last year, couple years, I've become very paranoid about shared services. I was part of the team that created the forum paper called "The Checkbox Project," where I think it was just an incredible description of how small things require superhero efforts.

One of the people on the team, she's this incredible technology leader. She said, "We have an SAP role security team. They have an SLA of turning around changes in two weeks, but their average was around seven to eight weeks." They had an NPS score of negative 87, which I thought was pretty funny because before I heard that story, I thought NPS scores were from zero to 100, not negative 100 to 100.

Her countermeasure was, she said, "We love shared services, but not there." So what did they do about it? They took the group and broke them up and embedded the SAP role security engineers into the business units so that they could do the prioritization themselves, and the problem just disappeared.

So I shared that story with Mik, and we were laughing about it. Then he shared a story about how you came up with an example of how you're going in the opposite direction. You're centralizing certain components of your AI team because the situation you found yourself in is that it took months to switch from GPT-4 to Claude 3. My reaction is, what? How can that be? The API shape is almost identical. The switching costs should be minimal.

So can you talk about, tell us about, introduce yourself, tell us the story, verify that the switching cost wasn't small, how that happened, and what you're doing about it.

Dr. Mik Kersten

Hello, Gene. Great to be here. Thanks for having me.

Yes, this has been just a fascinating journey. I think the entire DevOps and technology leadership community over the years has become better and better at understanding the importance of architecture. Especially over the last few years, we've seen more and more literature and contributions in terms of how architecture and team structures interact.

The interesting thing for me is, I thought I knew how to deal with these things. I thought I knew, for example, you build it, you run it, team structures are more effective and more cost effective for driving team autonomy in cloud than having things completely functionally distributed. Like you said, I avoided these shared services things to help support that autonomy, to make sure value streams and nested value streams had as much autonomy as possible.

Then a year and a half ago, I found myself with something that we created with our AI and data science team at the time, which is a very successful demo of a copilot leveraging GPT-3.5 back then. It was time to scale this. As we scaled, of course, more and more people came onto this new product.

We realized this product was coupled in a more interesting way. These LLMs themselves are highly capable. They're this kind of external shared service that just tends to work. Like you said, it seems like the API shape looks similar. Back then, as we were getting our hands on Claude 2 as well for Planview, it looked like it was all kind of the same.

But we realized we had an interesting use case because the use case was really more around quantitative data and structured data, less just around images or more natural language data, because of the problem domain that this Planview Copilot existed in.

We were building up the teams, and I realized that some odd things were happening. Keep in mind, by the way, I had the privilege of reading Wiring the Winning Organization fairly early on, around the same time that we were all learning, well prior to learning about GPT-3.5. That book crystallized the core concepts I needed to understand the organizational structure and how to wire it.

So it made me hypersensitive to the signals that were being amplified or signals that were being suppressed, as well as the thing I'm used to being very sensitive to, which is the coupling and cohesion between these teams.

What was happening is that the prompt engineering parts, one of these really important new skills and super powerful new things that we can do with large language models, were getting sucked into the various products because this Planview Copilot needs to work in the context of a dozen different products that our company offers.

The product teams, let's say the team working on the roadmapping tool, were starting to write their own prompts. I realized something odd was happening because what we of course want to do is make sure we've got as much modularization as possible, to have option value for the changes we want to make in the future.

Our principal data scientist let me know fairly early on, as he was experimenting more, of course GPT-4 had been on the market 14 months and Claude 3 was coming, or as soon as we got our hands on it and started experimenting with it. Lo and behold, the chain-of-thought process, how you get these LLMs to reason over data more iteratively, which is really what we rely on, turns out was very different.

So the way that you feed it the prompt and the data for Claude 3 was very different than GPT-4. Yet what we'd done, avoiding centralization, was basically spreading out the knowledge of creating these agents and prompts into the various product teams.

Now fast forward, let's say six months. If we actually decide to make the change of LLMs, or something new came onto the market, someone else brought one at scale, it would be extremely difficult because every one of those teams would then be wired in with their prompts with this LLM. By having basically decentralized, we would have coupled across the organization, across the teams, into GPT-4, and I realized something was wrong.

Gene Kim

That is just fascinating. You've spent decades studying software architectures and what it takes to decouple, and to see the same phenomenon showing up in how we communicate and use LLMs is a little bit shocking to me.

Can you just confirm that this is definitely a form of coupling that increases the switching cost, that prevents, or at least makes it more difficult, to change? It increases the cost of change. Is that a correct interpretation?

Dr. Mik Kersten

That's exactly it. There are a couple really shocking things to me.

First of all, there's a very interesting thing with the gen AI work that's being done, which is that demos are very easy because you don't have to care as much about architecture. Everyone across the industry has made super cool demos that really leverage the power of these LLMs.

But to make something really unique and interesting and powerful for your customers, for your users, you actually need to do some fairly serious work with the data. What really makes them more interesting is the data and how you're curating and feeding the data to the LLM. Otherwise, you're just getting the out-of-the-box functionality.

To do that, you really need to have the right architecture, and of course agents that can act on that data and do interesting things in the context of your products and your applications. In the demos, no one really has to care much about the architecture.

But what happens very quickly, and even more quickly in my experience, if we rewind back to our first experiences with cloud: we all became fairly sensitive, fairly quickly, to the cost of cloud, understanding if we lifted and shifted. There's been amazing work and research in this community around that, things actually got more expensive in cloud, not less.

Well, after your very first gen AI demo, when you've built something on top of the APIs provided by GPT or by Claude, it becomes very clear very quickly that this thing is going to get extremely expensive if you don't actually think about how you're going to make those LLM prompts. All of a sudden, from the architect's perspective, the cost profile becomes really significant.

Then what we realize is, even more importantly, the costs for coupling the wrong ways, for not properly linearizing the work and not properly modularizing what you're building, are just massive. What do you do? Who's responsible? You've got this coupling between the understanding of what it should do for a particular product domain, like roadmapping, let's say.

But then who's going to understand? Are you going to put a data scientist or four data scientists in every team? Where we're at today, Gene, is we've got over two dozen people working on the centralized core of this thing, because that's how big a problem it is.

It turns out the wiring between those people is critical. Who runs? Who operates? Who does LLMOps? Is it the product teams? Is it the AI team, the data scientists who live in the world of Python and run when they see Java because they don't like it? Or is it the copilot team? Because the product front end, those questions turn out to be very difficult to answer.

Gene Kim

Let's put that in a box for a second. Can you connect the dots about what you did about it? You centralized that group because you wanted to make sure that the surface area of coupling is confined, and so that you could actually bring down switching costs.

Can you connect the dots of what you expect to have happen by centralizing that concern and how that will lower the cost of change and decouple yourself from a specific LLM provider?

Dr. Mik Kersten

Yeah. Let me tell you, because it sounds like this was all very thoughtful and planned out from the front, so let me just correct that.

I said, "No way in hell the data science team is going to do you build it, you run it. We do you build it, you run it everywhere else except data science, because who wants data scientists on call, and data scientists don't want to be on call?" I said that some months ago.

Yesterday in our quarterly product review with our whole leadership team, our principal data scientist said, "We just shifted to you build it, you run it, because that's how we're going to move fastest." We now actually have engineers on call, including data engineers and engineers supporting those services.

Some of those services, the RAG services, are really simple. You throw them into a Lambda and support them. Some of the actual agent services that are using various parts of various LLMs and other models are actually quite complex.

I think the key thing is to have a set of principles. This is where we apply the principles of Wiring the Winning Organization in our basically monthly discussions about this. We learned as we went.

Again, some of the architectures out there, like LangChain, are great for prototyping. It's just not enough in terms of helping us understand how to create the architecture for this thing. The architecture and team structure of course need to line up.

This is where I think we've got very good words for understanding software architecture. Now we've got very good words and concepts and frameworks for understanding the wiring architecture, the system three. I encourage everyone to dig into this, of course, from Wiring the Winning Organization, this level-three architecture of how things wire up. We've had to change it.

What we realized is that we had to have conversations with the architects and team leaders, both from the AI and copilot team and the product teams, on effectively a monthly basis. Today we're actually running two different wirings from two different parts of the portfolio because we're going to see which one wins.

One thing that we did do, Gene, is it is already clear to us that we need to have basically tight cohesion in all the prompt engineering. They need to be in one repository, and we need low switching costs for LLMs because we can't predict exactly how LLMs will evolve.

We don't know what's coming in GPT-5. Now everyone's excited with Claude 3, both cost and performance wise and context-window wise. But what if then they get very excited about all sorts of new things we can do in GPT-5, let's say GPT-next?

We knew we had this principle: we needed optionality and basically tight cohesion in all the prompt engineering. So we changed in a centralized way, because if they were decentralized, it would be too difficult.

Gene Kim

So interesting. Maybe just to really concretely land the point, Dr. Ethan Mollick talked yesterday about how the LLMs we're using today are the worst and the most expensive they'll ever be. We just don't know which technology is going to emerge as the best in the short term or medium term. This is why we must enable optionality and we must enable low switching cost. Am I capturing that correctly?

Dr. Mik Kersten

You are. This to me has been the fascinating thing. I think two key lessons, and of course we're still learning, so this will be an ongoing process.

One thing is clear: the cost aspects of the architecture, we've never seen them be this profound, where the product that you're building can be completely inviable if you're making excessive use of LLM calls in the wrong way.

We already use multiple levels of LLMs depending on what kind of prompt is coming in. Of course, the top one, the one that's looking at which agent to pass it to, has to be the most powerful, like Opus or GPT-4. But the architecture within that will actually determine the cost profile. If you don't get the architecture right, your cost profile will be too high.

Of course, if you don't get the team wiring right, if you don't get the organizational wiring right, then you'll put yourself into the net. It was very clear that just applying the architecture and team structure principles that we thought were correct by decoupling and decentralizing, which by the way is just what we've done for our data, would have been wrong here.

We have a data mesh architecture to enable all of this, which is completely decentralizing the ownership and production of data catalogs to all the product teams. It's wonderful. It's exactly in line with your story on the horror of shared services. If we'd centralized and made a centralized data lake, we would not be where we are today. The decentralization was critical, even though of course everyone wants to centralize a bunch of the data governance and those sorts of things, just a common governance layer in decentralization.

So we were headed that way with the gen AI capabilities. If we'd done that, we'd be today locked into GPT-4, and it'd take us months to switch to, let's say, Claude 3. Right now we've already got Claude 3 up and running, and we can switch between the two of them because we centralized the thing where we wanted that cohesion and then the loose coupling to the external shared service, which is in this case this extremely expensive thing called the LLM.

Gene Kim

This is so awesome. The reason why I thought this talk was so important for this community is just because it's the same principles, the same sort of symptoms, in a place where you just wouldn't, or at least I didn't, expect it.

By the way, Michael McLarty and Stephen Fishman are talking later today about their amazing book Unbundling the Enterprise, which is all about preserving optionality in those conditions where we just can't predict the future, which very much seems to characterize the world of AI now.

So Mik, you said also, we have five minutes. You said something to me that struck me and took me aback. You and I talked about how wiring a technology organization isn't getting easier, especially as we're adding all these new functional specialties to our products and value streams. In fact, one time you said, "All I do these days is think about who should be doing what and who should be talking to who."

Can you confirm that you actually said that and I didn't make that up? If so, can you paint the picture of why that's so difficult and maybe some of the lessons you've learned?

Dr. Mik Kersten

Yeah, I may have said who should be talking to who, or I said who should be talking to who as well. Because that's the thing.

These things get such tight coupling, and that's okay for demos, that's okay for your proof of concepts. That's great. It's when you want to scale this thing that this becomes highly problematic, and you have to make the decision: are we embedding AI and data scientists in every team across the organization, or are we centralizing them?

For that, you have to have these guiding principles and the language and structures to talk about. Again, the language we use is from Wiring the Winning Organization.

Gene, to me right now, what I realized through this story is that we talk a lot in this community around tech debt and so on. The organizational debt that we would have created by decentralizing is completely shocking to me. The fact that we would have made it hard to make a switch where we could have had half the cost, let's say, or just much more capability, by not creating the right cohesion and minimizing coupling in the right spots.

Because you always have coupling. We have to make this very odd decision with a centralization. We now have an AI and data science team who has to operate the software. You build it, you own it, operate the software. But it was the right thing for this context. The key thing is we had the language.

Keep in mind, I'm talking about two dozen people, and this is this consequential. When you actually scale this, I was speaking a couple days ago on exactly this topic to the chief technology officer of one of the larger banks. He's dealing with the same problem as he's looking at how to roll out gen AI, how to deal with their data platforms, what to do with on-prem versus how much data moves into the cloud and support all of this and so on. Again, this problem is now orders of magnitude larger in terms of the number of people.

What I think has been happening is, I think one of the reasons we're seeing some organizations, again, that's kind of the studies that we've seen in this community, some organizations be 100, 1,000, in the last talk we saw potentially 10,000 times more productive, is because of this.

I think we attributed so much of it to bad tech debt. The wiring is, I think in this case, if we got the wiring wrong, that would have created the tech debt. It's not because the teams were doing the wrong things. It's because as leaders, we would have put in place the wrong condition for them to create the right architecture, the right platform and investment.

For me, this was a really profound shift. Leaders affect us directly every day by not allowing us to rewire monthly, which would really restructure the teams on a monthly basis, which you don't do as often. We try to do it quarterly. A lot of companies only do it annually.

By not putting in place the right conditions to rewire, we would have put the company in the tech debt end, and we'd have complained about the tech debt, not about the wiring.

Gene Kim

So good. Mik, I always learn so much every time I talk to you. Is there any help you're looking for these days that you would want people to reach out to you?

Dr. Mik Kersten

Yes. The big thing is just two things, and maybe they're not two little things.

One is best practices on structuring gen AI data science teams and then these copilot-agent product teams. What are you doing? Did you decentralize? Did you evolve it over time?

The second thing is the architectures. We've got the start of these things with RAG and so on, but scaling these architectures, I think there's not enough guidance around there yet. I know we've had to create our own, and we're just hungry for looking at what others have done. Of course, not just at the scale of Microsoft or OpenAI, but at the scale of product and enterprise organizations.

Gene Kim

Awesome. Thanks, Mik. Thank you so much. I'm so delighted to make time for this today. Looking forward to catching you soon.

Dr. Mik Kersten

Thanks so much for having me, Gene.

Gene Kim

See you, Mik.