Employing DevSecOps for Air Force Cyberweapons
Many commercial companies conduct DevSecOps transformation to increase business value. The US Department of Defense currently seeks to implement Continuous Integration/Continuous Delivery in several areas to overcome previous multi-year program delays. Our corporation recently contracted to move a USAF cyber weapon system from Waterfall model to DevSecOps. Our story involves moving multiple facilities, building a culture, developing relationships, and incorporating Agile practices despite compliance, technical, and communication challenges. While our journey continues, this session shares our experiences at incorporating DevSecOps and an Agile workplace into a Department of Defense Cyber Weapon System.
Chapters
Full transcript
The complete talk, organized by section.
Dr. Mark Peters
Good afternoon. I'm Dr. Mark Peters, and I'm a presenter here today with DevSecOps for the USAF Cyber Weapons System, Year One.
I'm excited to be here at the DevOps Enterprise Summit. This is my first year at the DevOps Enterprise Summit, and although I'd potentially rather be in Vegas, it's certainly nice to be presenting to all of you regardless of the situation.
With me today, or with me in the question and answer, will be Jeff Sorrell, who works business development for Technica, as well as Brian Butler, who's our system integration lead for Technica. Both of these people will help us advance the program and help answer any particular questions you may have.
As we started, I started out working with DevSecOps and Technica as their lead IA security engineer. So I got a lot of experience in looking at some of the problems we solved and some of those unique problems we faced in trying to take an Air Force cyber weapon system and integrate it into this blend of DevSecOps.
When we're looking at that, when we're looking at what we're doing, we have to know that the mission background is the US Air Force specializes in air power at the right place and at the right time. That's how we made our money. I did 22 years in the Air Force as an intelligence professional, and that's how we learned what we had to do.
Now, that culture that comes with that, delivering at the right place and the right time, comes with high levels of training and high levels of risk. So that culture influences even cyber tools.
I used to work with a lot of fighter squadrons. I worked with both the F-16s, which was the Falcons or the Vipers, and the F-15E, the Eagle or the Strike Eagle. When they had a flying culture, they used to talk about that their rules were written in blood, that the standards they had in place to prevent things all came out of having fatal accidents.
One good example is with the F-16s, which are single-engine aircraft. When they come forward to fight, they go into a one-circle fight. They each turn off like this, their cockpits face each other, and there's one circle. Well, in training, they were only allowed to do 180 degrees of the turn around that circle before they had to exit the circle and go off. That made sure that the circle stayed a constant distance. It didn't get smaller, and they didn't crash into each other and lose pilots, right? Because even though that might not happen, the chances of it happening were higher, especially if you allowed it to go more into that practice.
So they test these practices, they test, and they test again to prevent any error. However, this runs contrary to when we get into the culture of DevOps, and we try to move faster, and we try to make sure that we can move faster.
So what we wanted to do as we came here was try to figure out how we could accelerate that DevSecOps delivery for the weapon we were delivering, while we were overcoming those cultural challenges, while we were overcoming the waterfall mindset that said stop and check at every level, check at every facet, so that we made sure we went to the right place at the right time and did what we were supposed to.
Now, we start off, and one of the challenges we faced, if you look at the slide over on the right of the picture, or on my right as I talk about it, this is the simplified and easy version of the DoD acquisition strategy that defines what you have to do in order to have a project delivered.
Unfortunately, each of those little blocks, those little different-colored blocks, probably has anywhere from six months to a year's worth of effort involved in the whole process if you do it right. So a government waterfall program is a three- to five-year cycle to get anything out, no matter how simple, right? The simplest projects often take longer and can be more complicated.
I was once in line where they had to adapt a computer so they could focus in a pressurized environment, and that even was the three- to five-year cycle, even though we could buy it off the shelf. The admin and packaging of those components was more important than actually delivering a functional product at the other end. Getting all the right paperwork associated with it to make sure that something bad didn't happen was the goal and the reason we moved forward. And the reason was that goal were those rules written in blood, right? We had to do it the right way.
So we started our DevSecOps transformation. We see that the DoD, in some of the higher levels of the DoD, mandated the use of Scaled Agile. They liked the Scaled Agile Framework. They wanted everybody to use it, and we were going to be no exception.
So when we came onto the Scaled Agile, when we looked at some of the integrated teams, we started off and we said, "Hey, we're going to have a team for development. We're going to have multiple teams for development. We're going to have an operational setup," which from a program standpoint is more of your program management viewpoint. "We're going to have a security team, we're going to have a lab team, and we're going to have a couple other teams. And we're going to try to build those teams in together the best we can, to get to the right process for us."
Well, what do we see as our mission requirements? And I know I'm blocking that slide at the bottom. The part at the bottom actually says accelerated timelines.
There was no standard infrastructure. We took over the contract from someone previous. The previous contract had actually not been one contract but had been three contracts. They'd run one for sustainment, one for maintenance, and one for development. Instead, when Technica picked up this contract, we ran one centralized contract. Although we maintained subs, we did it all together so that everybody could talk together and work together.
But we didn't inherit any infrastructure, so we had to build our own infrastructure, meaning that we had no infrastructure, we had no dev environment. We had no place we could go to test things, and still, we stood up the infrastructure.
We didn't have any physical workplaces when we started. One of the things was we said we'd go and have a workspace, and we had it contracted out, but then the contract ownership of the realty changed, which meant our contract changed, which meant what we expected to be there on day one didn't wind up showing till about day 180. These things happen. They're part of the Agile process, right?
The previous contract hadn't had integrated security built in. It had different periods of it built in. They left us with a half-built model. They hadn't finished their release when they got out, and not only that, but the releases they did have had incomplete documentation. They didn't build it all the way through in what the government expected.
Now, I know Agile says that you should prefer functional software to paperwork, but at the same time, there's places where the paperwork is helpful, especially if you've got users scattered all over the world, like the Air Force, and different users have to do the same thing in a different manner.
So what did that requested feature look like? What were they talking to us about when they said a weapon system? Well, the weapon system we're talking about is CVA Hunter. The CVA Hunter service is kind of a blue force, kind of an off-net SOC, that it can come in and it can take a look at your functionality, at someone else's functionality, or the functionality for that mission owner, and then provide feedback and say where is it good, where is it bad, outside of your normal operations set, outside of the normal process where you might see problems happening.
So very good, very valuable cyber weapon system on the defensive side of things. Not on offensive, not getting out there into the enemy's network, but sneaking on our network, looking at traffic we have rights to, and trying to figure out how we can get the best results for the systems we have, how we can protect it the best way.
Well, our flight plan, if you want to call it a flight plan, was to close some of those software and documentation loopholes, get right up to speed on the processes, and get everything out there as fast as possible.
We thought we could innovate our way to successful delivery and do that through creating a winning culture in the DevOps mindset. We looked at the architecture. We built up the systems of the architecture to mirror to our integration, so that in our system lab, when we talked about it, we had not just virtual environments, but we had hardware environments.
So we used those hardware systems, and we tied them to our service desk. Service desk answering tier one, tier two issues. So our tier one is problems you can fix with a checklist. Tier two is configuration issues. Tier three is we need code to do it. We have to have code separately. While our development teams were right there in the code or in the process, so they could fix it if it ever came to a tier three.
But we had these blocks for that DANS program, the Defense [varied] Network Systems that we had. And one of the most valuable was what we talk about as being our lab or integration lab, where we got the capability for each developer to set up their own virtual environment where they could do their own testing, or their first levels of testing, before it came out to the next point that we needed.
We talked about our physical challenges. We actually went through three temporary sites. The picture you see is the building we eventually wound up in, that one that was supposed to be there on day one, but due to the realty issues, had to be delayed just a little bit.
When we got there, it was great, though. It was smooth. You'll see another picture later on. We got the videos up on the wall for the iDRAC to monitor the status of the system, monitor the dev environments.
We moved from having that dev-controlled environment individually to having a continuous infrastructure. So under the previous contract, even though they had their own environments, we hadn't inherited that. But the devs could go in and destroy the entire, not destroy, update, exchange, expand the entire overall system.
When we brought them in, we brought them all into that lab so that they stood up their own environment, and we provided the infrastructure. And it was that deployable infrastructure, infrastructure as code, whenever they wanted it to be able to move to the next step. And that really allowed us to accelerate our delivery.
Because with that, and here's a picture of the new site. You get the pictures up, and they've got those great Technica beams on them. But what they let us have is they let us have that iDRAC and that control over what the different integrations we're using. So not only could we look at the continuous infrastructure they were deploying, we could manage those requirements and resources.
And there was one instance, there was probably a couple instances, where we noticed during the pandemic, developers were using it from home, and certain developers were using more of the system, more of that VPN pipeline, that security back and forth, than we may have actually needed to do the job because there was only one or two of them that was using that much. And they weren't using too much functionality. Well, they were moving the entire ISO back and forth every time they started up and every time they booted. And that was bad.
With the spirit of DevOps, once we figured out what that was and it was hampering our flow, we gave them the feedback. We did some experimentation to make sure that we had the right answer, and we found ways to speed it up. And we only did that because we committed to that continuous infrastructure. We committed to that continuous way to move forward. That was the lessons learned in the process.
And of course, that was learning that virtual and presence is not always equal to virtual presence. Sometimes you actually have to get face-to-face with an individual to get the results you want out of the DevSecOps program.
We did see some varied baselines. We saw some problems in running through with the government, not because of anyone, but because no one had run this fast before. When we took that first baseline, we took that half-done software. They'd been working on it for two years. We finished it off by the end of the first month and pushed it out the door.
After that, we were on about a three-month cycle. So every three months, not true Agile. Well, I say not true Agile. It was our Agile, so it doesn't matter if it was true DevSecOps or true Agile or our Agile, because we made it work. But every quarter, we'd be pushing something out the door.
We used a system of integrated testing, and you can see the system of integrated testing up on the slide, to get functional tests, dynamic tests, integration into the product, and check how those things worked. And along the way, as we were building them, we drafted the documentation.
Now, a lot of people hate documentation, but it's a necessary factor, especially when you distribute it. We drafted the documentation, we held it in a highly visible format, and we let the devs actually write their own documentation. And then we snapped the chalk line at the end, and we let the tech writers take it and make it more presentable to others.
But as we passed it around those teams, we made sure that when we were drafting the documentation, that every other person could do it. This is one of our key points of security, too.
Being a member of the security team, our security guys participated in the dev process, and we saw the results of these tests. And when the documentation was being drafted, we saw that, and we were extremely familiar with where the documentation was.
So when it comes time for an audit and our auditor back at the ATO facility comes back to me and says, "Hey, we're having an audit. We need to have two-factor. You guys have to have two-factor. You've skipped this requirement. Show me where you have two-factor." I said, "Not only do we have two-factor, I can point you where the instructions are for the two-factor. I can give you a copy of how I set it up, and I can even show you how I went into my instance and ran two-factor to show how it worked."
It was run all the way across the way, and it worked really smoothly to bring people down there and bring everybody together to create a process.
Those tech writers, instead of redoing things, helped us to spend our effort on doing where we had to refactor the code base and work on the code base. The developers could work on the dev parts and just pass enough information down for the tech writer to do their part. But those tech writers, just like the security guys, participated in the daily events, they participated in the daily stand-ups, and they understood how it worked.
We got the audit data from security that we needed as we were running our scans, and we merged it with the infrastructure. We kept track of a standard baseline, and then we did the scans as each item came out of the DevSecOps version. And we compared those, and then again, we could allow the developers to focus on the elements that needed their help so that they were only fixing things that needed to be changed instead of just working on a blanket set.
We could coordinate the resources between the devs because we had those VMs. So when you were having a problem, instead of having to drag somebody else in, they could do the virtual bit with the pandemic. They could log into your environment, they could see what you were working on, and see what the issues were, and move forward in their discussion.
So you say that's great and that gives you some of the functional aspects, but it doesn't get you the compliance. It doesn't get you the compliance aspects. It's true that when we inherited, we had a large compliance problem, and we didn't think we were going to hit Agile.
If you've heard about the authorization to operate on the government and an authorizing official who has to sign off on it, the previous CVH version had an ATO with conditions, which meant they had to fix some issues before they could get a full authorization.
Well, unfortunately, those several issues that we had to fix to get to a full ATO were about 1,260 when we took it over. And we closed more than 90% of those. I think when we actually turned in our ATO documentation, we had less than 40 of those open.
Now, granted, a lot of those had to deal with policy. A lot of those we had to rewrite policy. We had to get policy signed. We had to get policy approved. But the way we were able to do it was because of the integration with the teams, that integration of compliance and Agile.
Having the security folks there meant we knew exactly what was going on in those teams. We knew what we were working on. We knew what the features were. And we knew where our authorization to operate, our ATO, tied into those teams' practices, which really let us have some successes on integrating our compliance and Agile.
And it started with having those security folks down at the team level. We had four teams. We had four security guys. I was supposed to be a security lead. Even me, I went and sat one day in each team. And every couple months, we'd rotate through. Not only did we rotate through, but when we rotated or when we went to the teams, we came back to our meeting, which was later in the day. We talked about what we heard in the teams, and we discussed the team meeting with us.
So we were functioning as a separate dev team, Agile team in the structure so that we could get all the information together. We used the structure we had on the VMs. We used the structure we had on DevNets to get to a continuous monitoring solution.
We were pushing out hardware for this system. It doesn't push it out on a cloud, but it pushes out hardware. So we could monitor how it went, and we could monitor how it went forward.
Now, at some point soon, it will be on a cloud system, and getting that cloud system is going to enable even more of a continuous monitoring. Because even though they're still going to deploy independent servers, they're still going to monitor when they come back. They're going to check in, they're going to look at that process, and that home system is going to say, "Hey, you're in this state. We can verify that you're in this state. That's a known good, and then we can update you and move on."
So as we get to the end of the process, we talk about the main things that we looked at, right? What are the lessons learned that I can give you, that you can take away, that you can move forward?
Well, we addressed debt in three different areas. We addressed cultural debt, technical debt, and process debt. When I say cultural debt, I talk about the shift between an Agile culture and a waterfall culture that we started with. We had to really make allowances for some of these leadership structures.
One of the things was that when we got to it, at the end of every session, at the end of every program increment, the program manager wanted to snap the chalk and have a release, and he wanted to have tests to go with it. We said, "You know what? We're not really sure that's Agile. We'd like to do the features. We'd like to deploy the features and work with the features, and the releases will come when the releases come." And he said, "Nope, you know what? We're going to do this."
We said, "All right. You're in charge. You're the government. You've contracted us. We're happy to do it your way. We'll use our DevSecOps teams. We'll integrate this." And every three months, we've been pushing out a feature, or we've been pushing out a new release.
So we started with a 3.2.1, and then we pushed a 3.3, and then a 3.4, and then a 3.5. And what actually happened was we were pushing it so fast that that administrative process the government had built up couldn't actually keep up with the speed at which we were doing releases. And this is only once a quarter, not once a day like we talk about the high end of the DORA metrics, right?
But it got us there. And we were able to show those tests and build those tests and the continuous monitoring to get us to that next step and get us to the level where we can move forward.
We talked about some of our technical debt. We ran some pipeline runners. We started off with Jenkins. Jenkins was what we inherited from the previous team. It was what they were familiar with.
In looking at how we wanted to get bigger, because we knew we were going to get bigger and faster, we were looking at whether we wanted to go to a CloudBees or GitLab. And our guys decided they'd rather be on the GitLab. Good things and bad things, right? You move. We said we're going to use GitLab.
One of the discussions I had with one of our lead developers, and it was great. Later on, he said, "Hey, we're going to GitLab." I said, "I haven't used a lot of GitLab. Let me go do some searches. I'm more familiar with the other one. I'm more familiar with Jenkins."
And I went and I looked at it and I talked to the GitLab guy and had those discussions and looked at all the security features we had available in GitLab. And I said, "You know what? This is great. There are so many security features in GitLab. We are going to tie these in. We are going to have monitoring across the pipeline, and life is going to get so much easier for security."
And I went back to our developer ops, said, "I'm excited. This is what I need. This is the version I want. Show me where to turn it on." He said, "Well, we just bought the basic version because we think we can do everything we need, and we need to limit the number of seats."
So we got better over time. We got better at it. But we used the GitLab and the GitLab instances to create transparency and dashboards. Again, our security folks were able to go to any of the tests in the pipeline and see them all the time. They could look at the security scans. They could see the static and dynamic scans happening as the different elements of code were being committed, both at the individual level and then at that feature level as it went through.
And it helped us solve our technical debt. It helped us figure out what it was, and not only what it was, but figure out where it was in the cycle and how to close it quickly.
We also talked about process debt, right? Our customer wanted things fast. Well, he thought he wanted things fast. But once we started delivering every three months, it turned out it was a little bit faster than he could handle, like I mentioned.
We needed to organize the teams. There were a lot of government folks. So instead of having the government folks be in charge, we did the Scaled Agile thing, and we rolled those government folks into the product owner, where they were accepting the products, but it didn't mean that they drove the structure of each of those individual teams.
And some of that took a little doing too, because that's really a cultural shift for a lot of those folks. If they've been in that government structure for a long time as a civilian, they're used to having that control over anybody you put them with. They're on the top of the team. They're not on the bottom of the team as they move through the levels. But we organized it, and we worked through it, and we made sure it worked.
And in organizing the teams to get to the processes, one of the other unique things we did when we came through, well, probably not unique for everybody, but unique for us at the time, was we drafted against those features and those program increments.
Each of the scrum masters knew what the technical capabilities of their team were, and they got together and they did a one-for-one draft, just like you would in fantasy football. They got the chance of when they were going to go first, second, or third, or fourth after we added the fourth team, and they picked the feature they wanted to work on, and then the next team picked, and then the next team picked, and then they were responsible for committing it.
Like Name That Tune. You say, "I can do that aspect in 35 story points." Team two says, "I can do it in 30 story points." Team two, build that feature. And it worked great for us, obviously by the production record.
So what do we see overall? What's our taxiing to the hangar? What do we do with this weapon system?
Well, we did six releases in 20 months. Previously, they'd been looking at about 24 months per release. They did one release in 24 months, they did a second release, and then they turned it over to us, or an almost second release. And we got it up to six releases in 20 months.
That's just outstanding speed. That's an outstanding timeline for it. And we did it through applying DevSecOps processes, starting with the basics and starting with the beginning of where we went through.
We were compliant with the standards along the way. I mentioned some of our individual security wins, some of the individual processes that we got to. We're talking about NIST 800-53, which if you haven't seen it, is a whole beast. Obviously, it's a whole NIST standard, but if there's 1,200 different instances that you have to address, that you have to close, that's a lot.
So we got a two-year ATO for the program, a two-year authorization to operate. This year was the first time in the eight-year program history that it had ever received a two-year ATO for that amount of time. The previous one had been the 18 months with conditions. We did get a six-month extension, in all fairness, because of the pandemic, so they gave us six extra months to operate and start closing some of those issues. But we still made it. We were compliant, no conditions this time.
We moved into that new building. We went through a couple of the WeWork-type situations, but we finally made it to that new building. We built those dedicated environments that we were going to use, those dedicated systems, that infrastructure as code. And then we increased to four dev teams. We started with three, we went to four, so we can have nonstop improvements, that we can keep adding features and keep building forward, even when the government isn't completely ready to get there.
So what are the final lessons? What do we take away that we work on next? Because it's the parts I expect you to take away is the people, process, and cultural debt, the kind of things you can do with your team.
Our team builds to increase the continuous monitoring as we go to cloud, to be able to hook systems together and know what the different systems are doing, so we can monitor user status whenever they're connected, instead of just when they tell us they're connected.
To get to that continuous ATO document, where instead of doing it on a year-by-year basis, we take what we find is the vulnerabilities, we take what we know is the known bad on a system, and we dump it on a repo where folks can check at any time.
To get some AI and ML agents in our pipelines to get improvement, both on the systems as they do the network detection, and in the pipeline to show where we're having problems in our pipeline as we speed up.
To accelerate from that chalk line delivery every three months to a release on demand. To just build up and build up and build up as we're doing deliveries and just put them in that feature folder. And every time somebody says, "You know what? I want that," they can go to the selection board, just like the Chinese menu, and they can say, "I want one, three, and 13. Can you bundle those up for me and release them?" We can bundle up one, three, and 13, and we can release those on demand. So instead of being a 3.4 to a 3.5, we call it a 3.4.1, or whatever we need to do for the administration, but we can move it forward.
So those are our next steps. And we're going to get there by setting some clear goals. We're going to continue talking about accelerating the value delivery. We're going to have that quarterly release based on the on-prem solutions. We're going to have some rapid maintenance response with the service desk.
Because what I didn't talk about as much in this was the whole time we have a service desk. We're not only building a new system, but we're also responsible for maintaining the ones in the field. So folks are constantly calling. Well, not constantly calling in because the system works pretty well. But you've all worked with users, and they all have something they want to call in and talk about. So we had them calling in, and we talked to them. We worked with them as best we could.
We secured the test integration. We had our test processes complete. And again, we were processing so quick that operational test element that the Air Force uses to make sure that they all comply, we actually overran them. They couldn't keep up with the speed we were going. They said, "You know what? We're going to do this, but we're comfortable with the level you're producing. We're comfortable with the quality. We're only going to test every six months."
And we got through to comprehensive compliance. We're compliant now. We have that two-year process, and we're building it to a more integrated process along the way so that we can speed up and accelerate what we deliver, and deliver the best capability to defend everyone as we deliver for cyber vulnerability analysis and hunt.
So that does it for my presentation. I'm excited to be here at the DevOps Enterprise Summit. Again, my name is Dr. Mark Peters. You can find me online either on LinkedIn or on Twitter @TinyCyber. And I'll be in the Slack channel for questions, as well as Jeff Sorrell and Brian Butler.
Thanks for your time, and have a great day.