Steering Towards an Outcome - Sense & Respond
Max Reele from the Defense Innovation Unit is excited to present a case study on how DevOps played an integral role in the noncombatant evacuation of American citizens and refugees during the US withdrawal from Afghanistan over two years ago. The compelling story and examples will demonstrate how organizing for the desired outcome enabled DevOps teams to achieve optimal sense & respond performance to update IT systems at the speed of rapidly evolving humanitarian and military operations around the world.
Chapters
Full transcript
The complete talk, organized by section.
Host Intro (Gene Kim)
So I am so excited about the next person presenting. So many of you have heard or seen the amazing work of Kessel Run in the US Air Force and how they blazed the trail to modernize software practices within the US Department of Defense. In some ways they helped create scores of similar efforts, showing that there's a better way to build software for missions that matter.
So earlier this year I was doing an event with Lieutenant Colonel Max Reele, who was formerly their Deputy Commander and materiel leader. He spoke last year here at the conference, but he shared this amazing story that I hadn't heard before about their role in August 2021 — how Kessel Run helped enable the largest non-combatant evacuation operation in American history, helping evacuate over 124,000 civilians from Afghanistan. It's an astonishing case study full of learning for any technology leader.
So Kessel Run was founded from the Defense Innovation Unit, so it's only appropriate that Max now heads up the US Air Force projects for DIU, where they aim to leverage commercial innovation to meet needs in operational missions. Here's Max.
---
Max Reele
Thank you. Thank you, Gene.
That walk-up music has me feeling like I'm entering a presidential debate or something. That's a bit overwhelming. Thanks to the team for having me back this year — really proving that they believe in the paradigm of DevOps. You get to fail fast, you get to learn, you get to come back and try again after fumbling through a rushed presentation last year. Try and do a better job this year.
I saw on the scrolling Slack that this is the year of platform engineering, and I absolutely loved that — that's what came across. Because one of my favorite DOES memories from last year was getting to have a small group conversation, for which I'll paraphrase his statement so as not to give any endorsement to any specific product. But Paul Gaffney said, "We as a community haven't properly lamented the death of the easier-to-use, easier-to-operate paths that are out there." And I just felt like maybe this is the year. If we're scrolling it over there and talking about how important platform engineering is, maybe that's where we are.
So what we're here to talk about today is the case study that Gene introduced. I'm really excited to be able to tell this story. From where I sit now — I was formerly the Deputy Commander of Kessel Run, where the case study existed — but for what I do today, I represent the Defense Innovation Unit, known as DIU. We're kind of a VC that pulls commercial tech into the defense sector, leveraging that technology for defense priorities, to close capability gaps that are much needed for military operations.
You can see from the infographic we have locations in most of the tech hubs in the United States. We also have a lot of international partnerships. We'll run things like prize challenges with international partners and international innovation organizations, so that's pretty exciting for what we're looking at this year. We bring in a lot of first-time vendors — for the first time that they're doing business in the defense sector. So if any of you represent companies that are interested in dipping your toe into the defense sector, maybe click the link at the end of the presentation.
Additionally, we run a program where we have a lot of reservists. So if any of you have service in your history, or operate in a reserve capacity now, please look us up and see if it's a good fit into any of our portfolios. We have a fairly diverse portfolio across the tech sector: autonomy, AI/ML, wearables and human systems, energy and efficiencies, space systems — both launch and on-orbit — as well as some cyber IT work and cybersecurity.
But that's not what we're here to talk about today. Today we're here to present some information about how DevOps helped us achieve some operational outcomes in mission execution — at a time when mission execution was changing as fast as the operators could keep up with it. And when I say operators, I don't mean IT operators necessarily — although we just talked about that and I have much love for the IT ops side of the house. What I'm talking about is military operations, whether that's in the traditional sense of warfare, or military operations in more of a humanitarian aid or other non-combatant context. And that's what this is.
Many of you will recall pictures like this that came out just over two years ago, as a result of the US withdrawal out of Afghanistan and the immediate surge by the Taliban to reinforce their influence and their governance of the nation. We had to do a non-combatant evacuation of our allies — that was called Operation Allies Refuge. And that's where most of these pictures come from, that were really captivating us in the news reels of the day.
During that time, this was not well announced — on purpose — that we would be doing this across military operations. And so this was a change that everybody felt. It was a change to 20 years' worth of daily operations that they had been conducting in the Middle East, essentially since 9/11. When we were conducting more of an intelligence and surveillance operation across our air campaign across the Middle East for that entire time, overnight essentially that completely changed — and it changed into what turned into the largest airlift in US history.
Over 124,000 individuals had to be evacuated out of Kabul and bedded down at the installations you see across this map here. That brings a lot of new problems and a lot of new operations to think about when you've been doing one thing for 20 years — flying about 150 flights a day in the region, mostly surveillance missions — and turning that into: we now need to evacuate over a hundred thousand humans and get them bedded down in these other locations, with many things to think about.
So there were two primary systems that Kessel Run was running at the time that needed to be used to support Operation Allies Refuge. First is CREDOS. Second is KYMERA. I'm not going to spell those out for you because we'll be running out of time pretty quickly. But to put it into scope: CREDOS does air campaign management at a regional level — think the size of the US, or a cluster of smaller countries in a different region of the world. Regional operational command and control of air campaign management. KYMERA is the zoomed-in version of that, down at like the city level — looking at an individual installation and the surrounding community around that installation. So when we're talking CREDOS, we're talking regional level, country size. When we're talking KYMERA, we're talking city scale, zoomed-in version on your map.
Let's talk about how those two programs had to adjust themselves with the changing mission operations.
So first off, with CREDOS. This was running what is called an air tasking cycle. For the Middle East at the time, they were using the CREDOS software to publish all of the flights that needed to come in and out of Afghanistan, in and out of the entire theater. Well, when they went from what was the expected flight load out of that day to what actually happened — when they had to turn on the largest airlift in US history — that decidedly choked the system.
So what you can see on the continuum across the bottom is a series of outage calls that were made as the information was coming in and the operational planners were trying to plan all of those flights and plan all those missions. And the system wasn't built for that. It was built for what had been the operations for the previous 20 years.
So what would you expect to happen for the people on the ground when their IT system starts crashing on them because it can't handle the load? All good military members are taught to adapt and immediately move on to a different tactic that's gonna help them get the mission done without fail. So what would we do in this type of planning operation? You would immediately rip it to the whiteboards — jump on a whiteboard, start writing down all the load-outs, get in an Excel spreadsheet, start typing it up. It would become a very manual process.
But because we had built so much user trust with this community and this specific command and control center for the Middle East, they trusted that we would attack this problem with them. While things were on fire in their command and control facility and they were trying to figure out how to adjust to the changing mission, they knew that we would be right there with them. In fact, we were — we had liaisons sitting right in the command and control center with them. And those liaisons were woken up, they drove immediately into the base, got on the floor with the combat plans division, and sat with them to start to see what was happening within the software applications and why it was crashing. They immediately called back stateside, got the dev team up. The dev team feels like they're a part of this mission. The dev team immediately surged into our office, got onto the classified area where they can see what's happening with these missions and why the system is choking.
You see the series of outage calls — it's all the standard backend IT ops reasons for why the system would be choking on this. So they clean bin files, they scale CPU, they double instances. They do all the things to keep kicking the can down the road to allow the system to continue to operate, even though it was experiencing extreme load based on what it was designed for. That means we also had to get our platform engineers involved, so they surged into their classified location where they're able to adjust the system as well.
So you have Dev and Ops working together with somebody that's sitting in the user community, all at the exact same time — so that within this 24-hour period they could surge forward and improve the usability of the system to the point where it's no longer experiencing any of the latency that was causing the mission operators to want to bail on using this software system.
So they overcome that and they get to the place where now the mission operations are flowing freely. Within 24 hours they were able to scale an entire new mission set out of this software suite. And at the same time — one of the proudest moments — they were able to actually push UI updates too. So while the platform team was doing everything they could to get the IT operations to be suitable for the new mission profile, the front-end developers are doing the same, along with the designers, and they're pushing a new UI. Because the combat plans division sitting in the Middle East — they were used to running intelligence and surveillance flights of about 150 a day, and now they're having to have their load-out look like a massive airlift mission. Naturally their UI needs to be a bit different. They need to see different data fields, they need to see a different number of flights that they can see on the observable screen and what they're publishing for their leadership.
So this was really a very successful win for showing how DevOps — with the power to be able to push directly into prod when they knew there was something that the users urgently needed — could keep up with the changing mission profile at the exact same time.
So that was at the operational level. That was at the regional air campaign management level.
We zoom in a little bit to KYMERA — again, installation-level management. If you remember back from the map, all of those 124,000-plus humans that had to be evacuated out of Kabul had to be bedded down at a dozen or so facilities, military installations mostly, where they were going to be bedded down until it was figured out what would be the next evolution of their livelihood.
So we're uprooting all of these people, we're bringing them in, bedding them down on a military installation. Military installations have the system KYMERA that they use to track what is happening on their installation for internal command and control, so that somebody like a base commander could look out there and see how many support personnel are on the base — for medical, for fire prevention, for emergency response. That's what it was designed for. It wasn't designed to handle the load-out of over 124,000 refugees dropping in on any one of these installations across the world.
So there were many new data fields and attributes that were necessary. So that they could still use this command and control system for what is now a brand new mission — introduced overnight to each one of the bases that's accepting these refugees and trying to take care of them in a very meaningful way.
Some of the examples of data fields that needed to be updated and attributes that needed to be updated: for one, it had a data field for beds — how many beds are on this installation? But the attributes associated with it were namely around hospitalization. We're talking firm beds in a hospital with all the hospital equipment around it, life support equipment. We're not talking the kind of beds that are now relevant to the bedding down of these refugees. We're talking cots and blankets. And so the attributes associated with that data field needed to be updated overnight, so that as these refugees started to land on any one installation, we would know: can we actually care for them there, or are we going to do something terribly inhumane and not even be able to care for them where we evacuated them to?
Another example is meals. Traditionally in a military installation, you want to know how much food you have — if you have to disconnect, go off the grid, you have to be able to feed the troops, keep them healthy so that you can continue to conduct operations. So you know how many meals and how much reserve you have on that base. We're talking military meals — give us a PB&J and some peanuts and call it good. But what we're talking about in this instance — we're landing refugees. There have to be culturally sensitive implications that come in here. So those data elements needed attributes that would let us know that we're being very caring about the way that we accept the refugees: how many halal meals do we have on any one of these bases? We have to be very culturally sensitive.
Additionally, there were basic life support requirements that needed to be considered — childcare, how much childcare activity down to the infant and baby level do we have on any one base, so that when we have refugees coming with small children, we know exactly which base to send them to because we know where that capacity is. Again, that's something that wasn't tracked at all — there was no real need for that when you're talking about military operations. But this is a totally new environment that changed overnight for the people running it.
It was really exciting that we were able to successfully pull this off. Not a 24-hour turn on this one, but as you can see from the timeline, still very respectable. Within four days of the initial request that came in to update the data fields and the attributes associated with the KYMERA system, we were able to work with our prime vendor partner executing Agile DevSecOps. They were able to push the updates to those data fields, load new attributes, adjust the UI — so that the base commanders and leadership could then see what was happening and have a scope of where they needed to send different refugees for different purposes.
Again, very successful use of DevOps in mission operations — keeping up with the changing operational environment at the same time that the users needed it. Where I think previously, just speaking from experience, we would have scrapped the IT system knowing full well it's going to take six months till they update that thing for what the mission we're doing now. Not six months — we were able to do it within a few days. Their IT system was able to keep up with their changing operations so they could change their tactics and procedures knowing that their IT system was coming along with them, which is pretty incredible for what is the history of government software, which generally can't keep up with any changes — let alone something that happens overnight.
So what's the importance of these two different case studies? Let's put them in juxtaposition with each other. Both executing Agile DevOps — which is incredible, because it shows that there's real mission success that can come from executing DevOps in a very healthy way. But it also shows that because these have very different management approaches, very different business operations approaches to running these two different programs, you can be really successful. There is no one-trick pony for the public sector on how you need to structure your contracts or structure your business operations to be able to run DevOps successfully.
Do you need organic dev teams in-house, or can you contract out with a vendor for your dev teams? Does it need to be a small tech business with high-speed engineers, or does it need to be a defense prime? Do we need to own our entire path to production in-house, or can we use enterprise IT that's outsourced somewhere? All of those things were different about both of these programs.
What's important is not those individual attributes. What's important is that we as leaders know how to lead appropriately to empower our DevOps teams — again, with a call-out to last year's presentation — to give those DevOps teams the privilege of accountability so that they can start to run DevOps the way that they know they need to for their users. And that's exactly what was the secret sauce to them being able to keep up with this changing mission operation.
Really an exciting time for these teams. You can imagine the amount of importance they felt around the work they were doing when they were able to successfully pull this off. And all that does is it breeds organizational optimism for what you're able to do when you know that you're executing DevOps healthily and your leadership lets you execute the way that you know you need to.
So very different business management approaches to both of these systems and programs — all able to be successful, as long as we think about DevOps in the appropriate context.
Alright, just a couple of user testimonials from our highest-level stakeholders. I shouldn't call them users — they're not hands-on-keyboard — but they are incredibly important. I'll just read you the top one because I think it's really important:
"Kessel Run's programmers more than kept pace, rapidly iterating critical applications in order to meet the evolving needs of the leaders, planners, and operators across the world in real time."
So even the highest-level stakeholders — who are very abstracted and separated from the actual combat plans division who are hands-on-keyboard planning military flights, separated by four echelons of hierarchical command and control from where the actual operator sits — even they get it. They get it that in real time, we did not expect our software system to be able to keep up with such a rapidly changing mission environment. But it did, and it was so impressive that they were willing to give us a testimonial about it with such high praise.
Alright, I'm going to transition a little bit into talking about why I think it was so important that our dev teams felt like they were a part of this mission.
As I mentioned, we had liaisons that were sitting in that Combined Air Operations Center — the command and control facility that was actually handling the military airlift — for that time. Because we had liaisons embedded in the user's environment, they had genuine interest for the user's context, because they felt like they were in it. They didn't feel like they were a supporting entity to mission operations — they felt like they were part of the mission operations. That's a very big difference from when you call out to an enterprise IT cell that has never actually even seen where you work as the user.
And so I just think there's something really special about developing that user trust model — where you develop enough user trust that your dev teams feel like they are actually the brethren of the users. They feel like they understand their operational context, and they feel like they've become part of the operational mission of that user.
So that brings me to — in true DevOps Enterprise Summit fashion — a little bit of an ask and a little bit of a showcase of what I've really been wrestling with most recently. It's been about five years that I've been thinking about this in my head, trying to read on it and study on it. There's so much out there that has really crystallized this, but I think there's a way that we can evolve it a little bit here for the public sector.
You even see in the pictures — these are photos I've taken — our developer teams embedded with our users in their operational environment. The middle one is actually the team that did most of the surge forward for Operation Allies Refuge, so huge call-out to them and thanks to them for being so successful at their mission on that day. On the top, you see a PM who's sitting on a wing with an aircraft maintainer who's showing him how they do their actual maintenance job. So we're not just going out there and watching them — he literally got up on the aircraft and is sitting on the wing watching him do his job. He just wants to understand it that much more.
So that's what these pictures represent. And the reason I show them is because this user trust model concept that I'm really trying to wrestle with and evolve — I think there are three important tenets, but I think everyone in this room can really help me flesh that out quite a bit more as we put some deep thought into this to try and make our teams more successful.
I think consistency — establishing yourself in the user's environment — is absolutely paramount.
I think transparency is one that we probably don't talk about enough. How do the users on the end understand what your process is for prioritization? So if they ask for something and don't get it, they don't think that you ignored them — they understand the transparency in the way that you prioritize, and the process by which you push features out, so that they can know exactly where they fit in the queue.
And then frequency — the revisit rate on how often you're going to connect with your users, so that they can trust that you'll be back. So that when they have ideas, they can store them on their phone in their notes, knowing you'll be back and they can bring them to you. They don't feel like, "I have an idea — I either give up on it because they're not coming back, or I better funnel it in right away." They can actually evolve those thoughts knowing that you'll be back pretty soon.
So if you don't mind, please put some thought into that. Let's collaborate.
As we do in true DOES fashion, you can reach out to me on any of these profiles here — the DOES Slack or of course on LinkedIn. And if you have any interest as a small business in the tech sector, or as a reservist wanting to do personal time serving with the Defense Innovation Unit, please click that link on the bottom and you can find us there, and we'll find opportunities together.
I'll be running a Birds of a Feather at 3 o'clock for the public sector, so if anybody wants to have any more conversation about any of this, I hope to see you there. Other than that, thank you for your time today. It's great talking to you.