Misadventures in Re-platforming: Lessons From Three Years of Trying
Kessel Run is a software factory in the United States Air Force, deploying warfighting workloads to a Tanzu Application Services (TAS, formerly Pivotal Cloud Foundry) platform.
Our developers have been clamoring for a more full-featured Kubernetes-backed platform to unlock a multitude of benefits around the developer experience: metrics, observability, portability, all those wonderful DevSecOps enablers the Kubernetes ecosystem can provide.
We've had quite a lot of success rolling out and stabilizing multiple flavors of Kubernetes, supporting business essential and dev tool workloads in as many as four flavors of Kubernetes in production at a time.
And yet, despite years of efforts by teams large and small - we're still on TAS for the warfighter.
I'll talk about what we have learned at each tech stack pivot, both from a business and technical perspective. I'll also speak to the key aspects of a successful enterprise re-platforming effort: what we got right - and (you guessed it) what we got very, very wrong.
I hope to give you courage to make your next major tech stack choice, and the key aspects you need to get right to be able to stand by that choice all the way to production.
Chapters
Full transcript
The complete talk, organized by section.
Chris Lauer
Morning, everybody.
This is just a repeat of Rosalyn's talk from this morning, but with a lot of failure. So let me go back here.
I was formerly Air Force. I just left two weeks ago, but I still wanted to come here and tell this story, because we tried to do platforming in the DoD. Platforms are really hard. DevSecOps platforms. So why not try to do it in the hardest place: U.S. Department of Defense, giant bureaucracy.
Spoiler alert: it's hard.
So the big picture: Air Force Kessel Run was founded smuggling innovation at the edge of the Department of Defense. So we're doing DevSecOps. We're doing Agile. We've got some interesting challenges. Our production is air-gapped, and we still have really high compliance issues because you're fighting wars with this platform now.
And then they also scaled this up huge, right? Kessel Run started kind of small, a team or two, and it's a thousand people, lots of teams, three platforms.
So what's a platform, just briefly as I define it? A user story format down there: as a developer, I want to deploy my application, and you can see my stupid application, so that the warfighter can use it. I wrote some code. It works. Maybe it's twelve-factor. I just want to run that somewhere.
And I'm taking this on with the product lens at Kessel Run. But what I'm really doing is abstracting infrastructure away from that software developer, right? So they don't need to worry about cloud, or on-prem, or air gap, or Google Cloud, or AWS, or whatever.
So for Kessel Run's platform, we have the security compliance, the access management for warfighters, and disaster recovery, because things tend to get blown up in wars.
All right. So we're going to start the story just a little bit before I got to Kessel Run. And the first year I like to call accelerated chaos.
I forgot to mention, this whole time we do have Pivotal Cloud Foundry as a platform. And now we're trying to bring the Kubernetes platform.
We have an early effort. We got two airmen in the military, and they like Kubernetes. They're nerdy. They've been doing some work on the weekends, and they just deploy Rancher Kubernetes Engine straight into the air gap, get up and running, prove that it can be done. And now they have a place for the tools they need to do their job to deploy on the other platform, and the secure artifacts and scanning, all that cool stuff.
So it gives them the tools they needed. Their option was VM or deploy their own platform. So they deploy their own platform because it felt easier.
So that's huge, right? So we'll just build on this, right? We've already got it in the production air-gapped environment.
And through the talk I'm going to keep track of how many distributions of Kubernetes we have running in production. Production defined as: has some customer that if it went down, they would call me.
So in parallel, there's another effort. We're scaling up, right? We need to get a bunch of tooling in place for our CI/CD pipelines, our version control, our business management, product management, all that good stuff. So we've got to get all this commercial off-the-shelf software. And we're DoD, so of course we're going to self-host it because it's very sensitive.
All these things come with Helm charts. They want to run on Kubernetes, right? Some of these things you can do VMs. I just don't want to do VMs, right?
So we've got a few teams, and a decision's made somewhere: we're going to build our own sort of vanilla Kubernetes. We're not even going to use a distro. We're just going to stand it up with some Terraform and make it work. How hard can it be?
So they do it. We call it PCO, Platform Container Orchestration. And it's supporting these tools, these critical applications, mostly commercial off-the-shelf software at scale with thousands of users.
Spoiler alert: this is still running at the end of the talk, but without vendor or community support. And it doesn't have that governance and compliance kind of baked in, but it works. That's one of my kid's Lego creations. It also works.
All right, enter me. September 2020, I come in. I get the best seat in the house. They give me two jobs, and I'm coming in as a technical product manager. So I'm bringing a product lens and that developer understanding. Pretty cool.
They give me the small team that's already running this Tanzu Application Service, Pivotal Cloud Foundry, right? Pivotal Labs got bought by VMware, and so now it's Tanzu Application Service. I might flip back and forth between those two in the talk. Sorry about that.
And we're just going to keep that alive with a small team and try not to distract the rest of the organization. And that gave me an opportunity to learn how to run a really good platform, because with the engineers, we got former Pivotal folks, really sharp, and really cared about the developer experience and supporting the application developers in all these environments.
And this is running all the way across the air gap, right? So if you're a warfighter using an application from Kessel Run, you're on this Cloud Foundry. Max's whole talk was about software running on Pivotal Cloud Foundry.
And my other job is to kill that off.
We're going to get some Kubernetes, because Kubernetes is DevOps, DevSecOps, right? All the best tools. All the best nerdy folks want to be working on Kubernetes. Cloud Foundry is expensive, a lot of licensing costs, and its adoption rate is on the way down. So we want to get on Kubernetes.
I've got a blended team. I've got military, passionate military. I've got some awesome civilians, including myself, thank you. And then labor from a few different contracts. So we've got a nice blended team. If anything goes wrong with any of this part of the team, we're going to be okay.
All right, so now we get this new top-down idea. We're going to partner. We need to scale this Kubernetes thing. We need to do it for real. Let's take that energy we got at Kessel Run from pairing with Pivotal Labs, and let's just repeat that for Kubernetes.
So we're going to buy VMware Tanzu Kubernetes Grid, extending that relationship we have from PCF, which became TAS, VMware.
But when I arrived, there's no architectural decision record saying why we chose TKG. So I've got to solve this mystery. And I'm hearing from my team, "Maybe this isn't the right one." And other teams are like, "Why did we pick this one, and is this going to work?"
So I've got to work that back and figure that out. And the question I have in my mind as a product manager: how do I know when it's time to change this decision if I don't have that information? I could stay on the wrong track for a really long time.
So we put a lot of work into solving that mystery. But there's the TKG logo. They're pretty.
And then a few months into my time at Kessel Run, we were really struggling to make this work. Part of that is that misalignment from not knowing why it was picked. But part of it is also because there are opinions that TKG has about how you're going to deploy infrastructure, and there are opinions from KR about how you're going to deploy infrastructure, and those were not compatible.
So we're going to have to work through all that conflict to get it running on-prem. And we have it in the cloud. So it's running dev and staging and prod. If you're a developer, you can get your workload on the cloud side, but there's no way we're going to get this across the air gap. The clock is ticking. We don't want to buy these licenses at the end of the year.
So it's time to pivot.
At this point there's a little pilot project to try RKE2, and it's very successful. It looks like it's going to work for our use case. Engineers are much more excited about it. We spent a couple months with an architectural decision record, talked about why this one might be the right one, which is that it's also known as Rancher Federal. It's the Rancher Kubernetes Engine 2, Rancher Federal. It's for gov. It's got the government mindset built in for compliance and governance.
And yeah, so now we got this buy-in. So now we can just execute.
So look, I got a new tracker here. RKE2. Four already. So nothing's been deprecated yet.
RKE1 is running across the air gap, and it's now in the pipeline. If you want to deploy something across the air gap, you're going to be using this somewhere in the pipeline. That's not hosting the software that the warfighters are using.
PCO is running all this massive commercial off-the-shelf software that, if it goes down, the developers are just sitting on their hands waiting for it to come back up.
And TKG is running the developer-focused clusters for application developers on cloud: dev, staging, and prod. And we switched dev to RKE2 as we mature.
So that happened really fast. And just a second to reflect: we could have built on the energy of RKE1. We could've tried to make PCO work on-prem or stuck with that strategy. We could have stuck with TKG, but we didn't do any of those things. And so now RKE2 is looking like the one we're best aligned on.
Fairly straightforward. We just have to kill off the other three and get everything on RKE2.
But some lessons about decision-making. If you're at the top of an organization and you're making a decision about a technology problem, decide in the open. There's no reason to hide your motives here. And there's going to be conflict, there's going to be politics, but it's a lot cheaper to go through that pain at the beginning.
And you want to document the processes for the decision and the outcome of the decision so that everybody can see it in the organization. Work in the open, right? And be clear about what might change that drove your decision and what definitely will not. That helps everyone down the line.
And for the bottom, me: just keep asking questions so you understand why. And that's a product manager thing, right? Like, why, why, why? Also a three-year-old thing. And then write that down for everyone so that everybody can see it, and you can share that around, and you can iterate on it. Version that why.
And if you build something useful, like RKE1 or PCO, for a specific use case, it'll stick, and you should be prepared for that. Not scared of it, just prepared for that, that you're now on the hook to support this. And you're going to have customers. If it goes down, they're going to be upset.
That's a kind of power, right? Something that works. But it's also a liability.
Okay. Year two: cool, slow order. We're just going to get on RKE2 by the end of the year. We got this.
All right, so we've got this migration deadline coming. Pivotal Cloud Foundry, Tanzu Application Services, and TKG, all these licenses are expiring in ten months. And the organization doesn't want to buy PCF licenses again. They're very expensive. Let's fast-track this migration over to RKE2.
And TKG licensing, that's just silly at this point. I will be personally offended if we're still buying TKG.
So we've got the six-step plan here. We're going to freeze the TAS platform, right? Just stabilize it, make it quiet. We're not going to be distracting devs or platform or leadership with changes on that platform. No new capability there. Also kind of kneecap the competition, if that's helpful.
Keep the two platforms separate. We don't want to get them all tangled up, right? Because then you've got security issues, and you've got different contracts on the hook, and different platforms.
And you want to upskill your application teams and partner with your early adopters. We did some Kubernetes office hours twice a week for an hour. Anybody could come back and talk Kubernetes, and those were great. We also did KubeCon and brought people to KubeCon, and we did CNCF certifications.
And then just roadmap the death with TKG, PCO, and RKE1, and we'll just be on one. And rely on all the teams in the organization, because we're just building the orchestration here. We're just putting clusters out there. So we also need security. We need databases, all that stuff. So you've got to rely on those teams to build the rest of the platform.
And then make sure you mature your RKE2 clusters.
Okay, I made some mistakes on this slide, and they didn't line up quite right on the slide. But number six, that's the part I really sped up. Why is that number six? That should have been number one. We're deprecating things with clusters that aren't fully mature yet. You probably want to finish the platform before you start scheduling the demolition of the old platform.
And I get it, we're DoD, there's a deadline coming, there's money on the line. But that was a big mistake. That's the one I really learned from.
Freezing the old platform: that's where everyone's doing their work. All your applications at Kessel Run, right? That's where the warfighters are touching the applications. If there's stuff missing from there, you're really hurting your organization. They're spending a lot of time dealing with things that are missing.
So maybe don't freeze it if you haven't matured the new platform.
Keeping the two platforms separate: well, now you can't really do a strangler pattern. So we've got all these microservices and applications running, right? And leadership is now spending all their time figuring out who needs to go first, and what order, and what are the dependencies. Where if we just said, "You know what? We're just going to blur the security boundary between these two systems. If you want to move over here and talk to that one, that's fine," then we can do one application at a time.
And I bet you, if we'd done that, we would've had applications running into production on Kubernetes, because we could've done one, and then two, and then three.
All right, so mistakes.
And then we ran into some incredible government staffing problems. The Air Force was not too sure about remote work here, and so they had this complicated "Can we really hire remote or not?" thing that meant some of the people we interviewed for critical roles in building this platform took them seven months to get in.
And then, just as we're getting the inertia going and going towards this deadline on RKE2, this contract failed, and we had whole teams basically lose their capability overnight. And my team, I lost about half my folks.
But because we'd worked really well as a team, and we documented, and we mentored and paired, we were fine. We were able to keep all four of these cluster solutions running in production, and learn and iterate on. But we weren't able to make progress on our roadmap.
So there's another two months gone. And now we're at two months to the deadline as we got folks returned and spun up and ready to go, and some of these critical positions.
Okay, we're going to kill one. So it's cutting it really close, and I had to do it in the wrong order, and it was messy, and had a lot of downtime, but I killed TKG. So we got the staging and production clusters swapped over to RKE2.
And the metrics I use for this, as we're making slow, steady progress: what time do I make this swap? I've got the deadline there, so I need to hit the deadline, and I want to do it kind of as late as I can because I want to have as much maturity as I can.
So I'm looking at: are the clusters stable? Are they just staying up? What's our uptime? Are they compliant? Are we allowed to do this? Are they secure? Sure, that's important too. How much manual effort is there in the deployment of a cluster?
And at this point it takes about two hours to stand up a cluster. So not too bad. But mostly the fear factor. If I'm going to the team deploying clusters and I say, "Hey, can we deploy that change, or that change you have ready to go tomorrow?" do they freak out, or are they like, "Sure"? So trying to balance those needs.
And then there's no way we're getting off PCF. This is ridiculous, but we're totally not ready. So start buying licenses. And so we got three.
But as I was saying, our RKE2 clusters are doing all right. But they're not stable yet. We're not happy with them. And we're finding it's really hard to get to a root cause of outage on the clusters we built, which is a red flag.
And we've chosen a storage solution for parity reasons, because we've got this on-prem environment in air gap. We want to use the same storage solution there as we're doing in cloud. And because that meant we were storing all the data inside the cluster, when you roll all the nodes, you've got to move the data around. Long story short, it would take days for some of these clusters that had a lot of data on them to do an upgrade.
So that's a lot of toil.
So we made investments in these problems. We got all the logging and monitoring and metrics for the infrastructure off the cluster, because one of the problems we were having is the SRE team was building their tools on Kubernetes. It meant all the tools to analyze a cluster failure were inside the cluster. So the cluster failed, and you couldn't look. So we moved all that off.
We did our own SRE, like we probably should have done from the beginning, and that alert for cluster down, because I didn't want to find out from my customers that the cluster is down anymore.
And then we broke this parity for our storage solution, because what you're trying to do as a platform is create parity for your developer customers. It's nice if you can have it for yourself, but the trade-off here for cycle time was just too high. And so that reduced the upgrade time from a couple days to a couple hours. So that's nice. Now you have a good cycle time.
So why did we wait so long on all these things? Well, the team's got to gel, I guess. And you've got to learn from your mistakes, and you have to see it a couple times. You've got to swing and miss a couple times before you can actually knock it out of the park.
But yeah, it took days to do some of these things, and I wish I'd done the observability way sooner.
But my team could do these because we had DevOps. We had access to the production environment. We could push our code whenever we wanted, really. I've got a product mindset. There's no approval board, right? My customers are my approval board.
So I've done a lot of the work there in messaging communication channels. And we've built all this skill because we've been running these four flavors of Kubernetes in production. I guess it's three at this point.
So we're able to respond to these things and do some really smart quality infrastructure work. And we've got clusters in production with customers, so that gives us a lot of cred.
And I also want to talk to the team, just the skill of the team here. We got hit with an executive priority that we needed to load test a cloud.gov-hosted website with chaos engineering tools, and we had about five days.
And my team was able, because we had this maturity built in, we stood up another cluster with some custom nodes and worked with all the other teams we had to work with. And we had these chaos engineering tools running with a hundred million hits per hour, testing this website that you've definitely heard of. And it's not a Defense Department website.
So this cluster capability in cloud was somewhat unique in government at this time. That's what I'm really trying to say.
We hadn't really given this to our customers, because we delivered the orchestration layer. The clusters are stable, but the rest of the platform isn't quite there yet. And we haven't really gotten the air gap solved.
There's not really easy access parity infrastructure for this production air gap system. So I can't bring my whole team, which is now mostly remote. Some people might not have clearances or some other access issue. And then even if I could, we've got COVID. I can't put them all in a room together working, right?
We should have fixed this first. I should have died on this hill of building something that looks just like production somewhere we could all reach it, so we could bring all the brainpower from the team. I pushed hard, but I didn't die on the hill.
But these roles came in. So we got a director of product for platform, deputy for product. We've got all these technical product managers running these teams. Incredible folks, the most talented federal leadership I've ever worked with.
And now we've got a roadmap, and we've got hierarchical epics, we've got value stream mapping, dependency mapping. Clusters are stable. There's challenges across the air gap, still a problem. That's going to be the hard thing.
And then the documentation, the author's experience, it's terrible. But everybody here has said that, so I'm not too worried about that at this point.
I take on a new role to take that on. And yeah, so we all live happily ever after. What could possibly go wrong?
The murder plot twist.
So at this time, things kind of freeze. And that new role I just took on, we're not doing any of the platform stuff we're talking about because we have a major change in leadership.
Mean time to leadership change is not an official DORA metric, but maybe it should be, because you shouldn't schedule an effort that's bigger than your mean time to leadership change.
So there's this huge change in strategy. And what can we say? We failed so far to deliver. So maybe they're not wrong. I don't know.
So they say, "We're going to buy a platform. We're not going to build it. This is ridiculous. We've got to stop doing this. We're not in that business. Cloud only. This on-prem thing is incredibly inconvenient." And it is.
And we're going back from product management to project management. And so I'm managing this pivot and trying to teach everyone how to do that. And we're getting rid of design roles. So this dev experience and documentation probably is not going to get better. And lots of Gantt charts.
And it's time to pivot again.
Just a note about handling differences with empathy. I've heard so many stories about leadership disagreements and trying to change that guy's mind and all that stuff, but leaders like this new leadership we had, they just want to feel safe. They didn't understand what we were doing, and it didn't look successful.
But you can build adapters for that, and you can keep working the way you know how, as long as they see what they expect to see. If they expect to see a Gantt chart, build a Gantt chart. That doesn't mean your team needs to see that same Gantt chart.
But I feel like we almost made it. This RKE2 roadmap was the success. We were now on a cluster thing that could work. But we had these delays from the pivots. We had the delays from staffing. We're trying to do something really hard that a lot of people are talking about at this conference, so I feel better about that.
That lack of emphasis on production parity for our team really slowed us down. And this big bang migration plan, that we hadn't gotten even one thing across because you were going to have to do it all at the same time, all or nothing. And when that meant we had nothing delivered, so easy to put it away.
And now we're moving to EKS.
So this should be an easy swap, right? Because we're running Kubernetes. We've already got three flavors. What's it matter? But because the change in how the organization is working happens at the same time, there's just too much going on, and we're really stuck.
So a year has gone by since this pivot started, and calling EKS serving as a solution to serving some sort of production workload is maybe being a bit generous.
And then I started to see a lot of leadership departures, because people came to product-led, and people came to build a platform, and we're not doing that anymore, it looks like. And then more reorgs to respond to those.
So that was painful.
So RKE2: we just had months to kill off RKE1 and PCO, and instead it's frozen. It's canceled. So no further development there. That roadmap ends. That's the end of the story.
RKE1 is over there running across the air gap, caught the blame for a couple of flow interruptions out of this, whatever you want to call them. Just delete it. We'll go back to the VMs. That story. But it gets us back down to three.
So RKE2 is frozen. EKS isn't ready. This PCO work that I talked about at the beginning, this vanilla thing we did ourselves, they're working, that's really cool, but they have nowhere to go. So PCO is still supporting these really incredible, really important critical business tools and developer tools.
So after three years of trying, all of air-gapped production is still on TAS and Cloud Foundry. And yeah, that works. Cool. So congratulations to VMware and Pivotal Labs.
It is really hard to top a working platform. And that's what we had.
So help. I'm looking for who's transforming government HR. I'm looking for mentors as I try to get into higher roles in government, because I'm really curious about who hires the first really highly qualified technical person in a government agency.
And then also the validation or dissent on getting observability really early, which everybody said, "Do that." So good.
All right, well, thank you all.