Shifting Sand: How a Large DoD Project Moved Toward DevOps

Log in to watch

San Francisco 2016

Download slides

Shifting Sand: How a Large DoD Project Moved Toward DevOps

Jeffery Payne

CEO · Coveros, Inc.

Dan Gahafer

Program Manager (former), DISA forge.mil Program · DISA

DevOps within Federal Agencies can be difficult to achieve when disparate, competing companies are involved in various aspects of a large-scale project. This case study discusses how one such program was successfully moved toward DevOps using value stream analysis and logic. Lessons learned and pitfalls will be discussed as well as metrics used to measure successes. Topics tackled will not only include a discussion of how CI/CD was implemented but how security and compliance was integrated into the process.

Chapters

Full transcript

The complete talk, organized by section.

Jeffery Payne

My name is Jeff Payne. I'm the CEO of Coveros. If you want to follow me on Twitter, that's my Twitter handle down there. I tweet a lot on DevOps, Agile, and the integration of security into those things.

My company helps organizations move to Agile. Our specialty is building secure applications using Agile and DevOps principles.

Dan Gahafer

I'm Dan Gahafer. I'm with Hewlett Packard Enterprise, soon to be CSC, and formerly I was the DISA PM for Forge.mil.

Jeffery Payne

So what we're going to talk about today is how, over really almost a five-year period, we helped, collectively, a large program move to DevOps. Now, I'd be lying if I said up front that was the goal. Our goal was just to make our customers happy and add value, and this is what transpired.

So to kick that off, Dan's going to talk a little bit about the program. It's called Forge.mil and what it was all about, and then we'll talk about the transformation journey.

Dan Gahafer

And I'll try to be brief. We don't have much time, and Jeff has a lot of material to cover at the end.

Jeffery Payne

He does.

Dan Gahafer

But Forge.mil was started in about 2008 by Defense Information Systems Agency, primarily to foster information sharing, kind of an open source software repository for the Department of Defense. We called it Community Source.

There was some success with that. We fielded a version of an application life cycle management system that supports mostly software developers. And I'll go over some of the statistics as far as utilization, but when it was first put out there, it took a long time for people to start taking advantage of it.

But we had some success in the information-sharing side. Ultimately, though, I think that Forge.mil's real benefit to the department was as an enterprise ALM. And projects were able to kick off their software development efforts in really days instead of months.

The price was free, which is as cheap as it gets when it comes to ALM licenses, because it was paid for by DISA. And ultimately, we ended up with about 30,000 users on the free version. And we had a paid version where we implemented access control, which has today about 5,000 users with 900 projects.

Jeffery Payne

So we got started, we were brought in to help support the engineering of this program in 2009. I don't know when the first DevOps effort happened in DoD or happened in the government, but this was pretty early in the process.

And we call this "shifting sand" because what we're going to show is how, over time, a big old sand dune back here in manual deployment land ended up shifting over and becoming integrated into our Agile process. And we were able to automate more and more of the activities as we went through this five-year journey.

When we started and got involved, I would characterize the program as not Agile, but fragile. You've all heard that. We go into organizations all the time, help them transition to Agile, and it's very common to go in there and have the organization say, "We're already Agile. We're already doing Agile things."

When you get in there, what you find out is they're picking and choosing the principles they want to follow. They're not following all of the manifesto and the principles. And it's no great surprise that it doesn't work.

When we came into Forge, the program had been going for about a year. It was not yet very successful, and they were looking for additional engineering and Agile help to get it there. And so that's where it was at. We were part of the software development team, way over there on your left, and we were doing two-week sprints.

The teams had been doing two-week sprints, but there was no testing integrated into that. Those of you familiar with large-scale programs, DoD or commercial, there's often multiple stovepiped organizations involved, and this was no different.

Dan, you had all sorts of people involved.

Dan Gahafer

Yeah. To show how complicated it was, we had the engineering and development was one contract, the test and production deployment support was another contract, and operation support was a separate contract. So getting all those guys to work together was not an insurmountable task, but I'll say it was difficult and time-consuming.

Jeffery Payne

Yes. Well, and then you had the government folks, like the security organization and other people that had to approve the software before it went to production. And so that's the environment we stepped into.

The first thing we did, and I call it the winds of change, was we decided, first of all, we have to get our ducks in a row. Forge.mil was supposed to be an exemplar project for how Agile was done in the DoD. It was rolling out an ALM, an Agile lifecycle management capability, for all of DoD to use. How ridiculous would it look if the program itself wasn't Agile?

And so we started looking at how can we get our ducks in a row in development so that we can help facilitate change and make this program more Agile. And the first thing we did was to try to get configuration management and configuration control in place.

Dan will tell you, because of all the different environments that existed, and they were all different at this point, it was very hard to keep track of the software, what versions were current, what had been released, what hadn't been released, et cetera, et cetera.

And so we felt like, let's get our ducks in a row. Let's get continuous integration in place so that we start to test as we build and start to push out code that is, by the Agile definition, working software. Right? Not just software.

And the way we went about this is, and I'll put a plug in for SecureCI. SecureCI is an open source, continuous integration, continuous delivery platform that we built and put out to the community as an open source product. So it's available off of our website, coveros.com/secureci. It's just an integrated set of open source tools that you all probably use: Jenkins, Subversion, now Git, Selenium, SonarQube, et cetera, et cetera.

It's got probably, at this point, 20 or 30 pre-integrated tools in it. And we built SecureCI because every time we went in to do Agile development, the first thing we would do was set up our automated build and test. That's what you do first, and we kept doing the same thing again and again. So we just decided, let's just pre-integrate all these tools that we like, and to support the community, let's just put it out there.

So you can download it and use it locally. There's an AMI for Amazon if you want to use it in the cloud. It's a nice basic capability if you like open source. So that's what we started doing first.

Oh, by the way, we started looking at how could we speed up the release process. Down bottom I have that initially the release process on this program was six months long. That's how long it took to get anything out the door. And approximately three or four of those months were spent manually deploying and doing the types of things you needed to do to release into production at a DoD government facility.

The other big chunk was the time it took to test. And as Dan mentioned, testing was a completely separate contract, completely separate set of contractors. Right? And we had some struggles with them, right?

Dan Gahafer

Separate environments.

Jeffery Payne

Separate environments, everything. And quite honestly, they didn't really want to change. Is that fair?

Dan Gahafer

The idea of automating testing was kind of scary to them because they were the test contractor.

Jeffery Payne

Mm.

Dan Gahafer

So manual testing is what they performed. There were about 16 of them performing the testing, and they kind of felt that automated testing could interfere with their livelihoods.

Jeffery Payne

Yeah. And we had how many on the development side? What did we have total, roughly? Engineers, right?

Dan Gahafer

So by the time I came on board in 2010, there were three.

Jeffery Payne

Yeah. We had like three or four engineers. This was not a complicated system. It was mostly an integration of existing products with some custom code written around it. It was not very complicated, and we had 12 to 15 to 16 people testing it and doing it much slower than we were able to build it, which was really frustrating.

So the first thing we mentioned, as Dan mentioned, that we decided to try was, let's automate some tests. Not just our tests that we were using in development on our systems and in our CI environment, but let's automate some tests and give them to the testing organization. Let's let them own it, and we'll build it for them. We'll show them how to use it. We'll show them how to update the tests when they break, and let's see if we can get them engaged in automation.

How well did that work?

Dan Gahafer

Well, they wouldn't use the software.

Jeffery Payne

They wouldn't use the software, right? So a week, I don't know how long it was, a month or two, and we were inviting them to our stand-ups, and they would occasionally be involved, but none of the release processes were getting any faster. And finally we asked them about it, right? And it turned out they weren't actually using the automation.

Their excuse was it was too hard to understand. It would break. It was too brittle. The mistake we made was thinking that if we would train them how to use it, that they would just use it, and they wanted to use it. They didn't really want to use it.

Dan Gahafer

Similarly, we developed scripts, and Jeff will probably get into this later on, to deploy to production.

Jeffery Payne

Mm-hmm.

Dan Gahafer

So we had software that we developed, plus we had Puppet to deploy consistently into the different environments we had. But they didn't feel like that was helpful. They didn't think that that would provide as consistent an environment in each of these pre-deployment production environments as they could do it manually.

Jeffery Payne

Yep. Yeah. So we took the automation back, and we decided at that point that our role, our job on this program, was not just to build the application, but it was to create and build the automation that would help everybody else in this project. And we would give it to them. We would maintain it. We would support it. We would give them no excuse not to push the button and use it because we felt like it had to happen if we were ever going to be able to speed up our deployments.

And so our next thing was we pulled the automation test back. We re-architected them in a combination of JUnit and TestNG, something that was a little bit less fragile and something that we could maintain and support and, oh by the way, start to ship with downstream the code that was developed. So we would send the whole package.

By this time we were using Nexus as our artifact repository. We were starting to do code analysis, static and dynamic analysis, first using open source tools, later using things like Fortify and other commercial tools to do security analysis and other things. And we started giving the test team, we called it push-button automation. All they had to do was press a button, and the test would run.

And they started sort of doing that, right? But they didn't want to admit there was any value to that, really.

Dan Gahafer

So part of the push to get them to start using the tools that we were providing within their environment, I think, was to eliminate the use of their environment. So what we did was we actually... I would not allow them to use their environment to test in anymore.

Jeffery Payne

Mm-hmm.

Dan Gahafer

So instead, I built an environment within DISA that was right next to our pre-production environment that met all the same security requirements. It was essentially an identical environment. And so we would promote our application, our code into that environment, and our testers would have to test in that environment. And I think that helped a lot.

For one thing, they were building these test environments as they needed them, so this test environment was there for them all the time. It modeled pre-production, which mostly modeled production.

Jeffery Payne

Mm-hmm.

Dan Gahafer

And so it was more valid. The testing that was performed was more valuable to the engineers because the feedback they got let them work immediately on problems that were showing up in these environments that more closely resembled production, and that caused the testers to start to see the value of the automated testing in those environments.

Jeffery Payne

And that got us to Puppet, which Dan mentioned earlier. We finally, after a couple of years, this is a couple of years into the project, I think, for us, we were asking for the manuals that the test team and the ops team used to provision machines, install the software, configure that software, test that software, and then push it and deploy it into the next environment.

And they fought giving us that for a long time. Finally, we got ops to agree to give us this manual, and this manual is about this thick. I don't know, it was huge. No wonder it was taking six months to get out into and put something into production.

We took those manual procedures, and we rewrote them all in Puppet. So we automated the entire thing. And all of a sudden, instead of it taking two weeks to provision and set up a test environment, it took five minutes. Instead of it taking a month to set up a new version for staging and get everything deployed on it, it took less than a day.

All of a sudden, we saw radical decreases, as you can imagine, in the amount of time it took. That alone drove down our release process to two months. It cut two-thirds out of the process by automating that. And as Dan mentioned, it really gave the downstream organizations no excuse for just using this automation that we delivered to them as really push-button capability.

Anything you want to add to that? All good?

But we still had one hurdle left, which was there were still late lifecycle activities that were just part of the government process. You had to get through a security audit and an assessment. You had to get through load and performance and analysis. You had to get through a final review and acceptance test.

And those things were typically done late in the lifecycle. They were not part of the sprints, and so they were things that really were a bottleneck to getting ourselves under two months of release into production.

We now had the test team kicking and screaming in the sprints, meaning we were now able to, because we also set up every environment to look like production, we didn't have to test anymore to see, is this software going to work on this environment versus that environment versus that environment? They were all production-alike, and we could release multiple times a week, and test would have something to test multiple times a week.

So now they were all of a sudden in the sprint, kicking and screaming. Right? Is that the best way to describe it?

Dan Gahafer

I would say so.

Jeffery Payne

Kicking and screaming. But they were in it, and it was productive. It was working. They had automation they could use. And if they felt there were more tests they wanted to do with exploratory testing or whatever other types of testing they wanted to do, there was now time to do that in the sprint. And that helped us tremendously.

In the end, we were able to, because we now had production-like environments, and because we were able to automate the provisioning, installation, and configuration in those environments, and the pushing of code through production, we started bringing security and performance into the sprint.

So we started using open source tools to do various types of security analysis as part of our standard CI process. And we started to identify security vulnerabilities and issues much earlier than they were ever found before. And that started to gain us trust with the ops people and with the security people that we were doing things proactively that they were used to having to catch at the end.

And I would say one of the breakthroughs, from my perspective, is that we got to the point where a lot of times, at the end of the release process, they would just ask the team what we had done. They'd want to look at what had been run, and they wanted to make sure we did all the things they were planning to do, and then they'd say, "Okay, I think we're fine. We're good."

Dan Gahafer

Part of the reason for that was, and I'll say Cyber Command, before there was a thing called Cyber Command, who would identify vulnerabilities that were present or could potentially be present on department systems. There was a JTF-GNO that did the same thing. So this function has existed for well over the 15 years that I've been involved in this.

And we were actually determining that we were vulnerable in certain ways before we were notified by Cyber Command, JTF-GNO. So we were actually able to fix problems sometimes before we were even notified, which allowed the security people that oversaw our program, the auditors, if you will, to develop some trust in our ability to respond to security incidents in ways that no one else had.

So I think that was meaningful, and it was very helpful for us to get to the point where we could more rapidly deploy our application into production.

Jeffery Payne

That's a great point. One of the things that happened very early on when we started doing security analysis, first, when we first started doing security analysis, we started finding things that nobody had ever found before because we were actually using tools that they weren't using, they didn't have time to use later in the process.

So we started finding things and proactively fixing things and notifying them, "Hey, there's things in production that are not secure that were never found."

Second thing is, as mentioned, when a new vulnerability got introduced out there in a framework or in a third-party library or whatever, it got published to the National Vulnerability Database. And that's where this organization Dan mentioned would go. They would look at what vulnerabilities had been published, they would triage them in terms of their criticality, and they would flow it back down to programs and tell programs, "Hey, you got to fix this thing, or we're going to shut you down."

Right? What was it, 30 days, they'd shut you down?

Dan Gahafer

Typically, depending on the category. CAT ones were considered critical vulnerabilities.

Jeffery Payne

Yeah.

Dan Gahafer

So you had 30 days to fix them usually.

Jeffery Payne

Yeah. And we wanted to get ahead of that, obviously. And because we were running the same analysis and scans automatically and accessing the vulnerability database, the National Vulnerability Database, we were usually fixing the problems before they showed up and told us that there was a problem.

And when we were able to demonstrate that every time they showed up and they said, "Hey, you got to fix this or we're going to have to shut you down," and we'd say, "Yeah, it's in the next release. It's going out in three days." They'd say, "How did you do that? We just told you." We'd say, "That's because we're not waiting for you to tell us. It takes you longer manually to tell us there's a problem than for us to automatically identify it and fix it." And that did gain us a lot of trust with the organization.

In the end, we got releases down to two-week releases, and I think they're still operating like that today, right?

Dan Gahafer

They are. They're not necessarily deploying into production every two weeks. It's up to the PM to decide how often he wants to put into production. So sometimes they'll just wait until it's significant enough just to avoid user complaints about changing interfaces.

So if they're holding off on production releases, it's not because we're worried about injecting defects into the production. It's because we want to make a more seamless user experience. And so I think that's the right reason to make the decision not to deploy.

Jeffery Payne

Absolutely. Yeah. It puts the release decision back in the hands of the business, which is what you want to do in Agile and DevOps, right? As fast as your customers can consume the change, you change. But the goal is we were now releasing production-quality software on a very regular basis and going through the whole process and having release-ready software.

So some results and lessons learned. First, we did reduce the release process from six months to two weeks. It did take almost five years to do that, but we did overcome some huge obstacles and hurdles in organizations we didn't control.

We were able to reduce the risk of release virtually to zero. When we first got involved, the release process was expected to be, "We're going to try this for a week, hope to God it works. If it doesn't work, we're going to have to figure out how to roll back." That was their philosophy.

We got to the point where we were releasing on a very regular basis, using the idea that if it's painful to do, you do it more, not less. In DevOps, we want to do releases more because it makes us practice and learn how to do it well, and it makes releases really a non-issue.

We integrated security and performance early in the process and into the sprints. Not all of the performance, by the way, mostly just watching to see whether our code and our database calls were getting slower. So we could tune and adjust the software as we could, because we didn't have all the production equipment that you needed to do a load and performance. But we found that hugely valuable in keeping performance under control.

And then I guess the last point was, ultimately, because of the automation, we were able to free up a lot of people to go do something else, go do something more productive. This test team that was somewhere 12 to 15 to 16, kind of flowed, really in the end, with the automation that we built and put in place using both Puppet and our test tools, we were able to reduce that to really two people working.

Which, quite honestly, if you've got a development team of three or four people, two, three people's probably the right metric, if you're doing a good job in automating things. And that was a huge cost savings to the government. But pretty painful, right, for that organization?

Dan Gahafer

Not really.

Jeffery Payne

No? Not for you?

Dan Gahafer

No. The way it worked out, we went to contract with them 1 August, and they had heard that I was going to consolidate their contract into another one, so they sent my money back.

Jeffery Payne

Worked good for you.

Dan Gahafer

It's not like I fired them. They fired me.

Jeffery Payne

Well, all the taxpayers won.

Dan Gahafer

Yeah.

Jeffery Payne

So that was the good news, right?

Dan Gahafer

Right.

Jeffery Payne

Because clearly there was a better way than it was being done.

Some lessons learned. So we've started DevOps initiatives and transformations from almost every angle you can imagine. We've started top-down in commercial companies. We've started bottom-up from dev. We've started in test. We even started in security and ops.

And the reality is you got to start where you're at. If you're not coming top-down, you got to start with what you control. So start with what you control. Get your house in order before you start telling everybody else what to do. That's point one.

Point two is demonstrate the power of automation. When we first automated the Puppet stuff and showed you, and you showed others up the chain, the power of this, I think it really helped sell the organization on how effective this could be. Right?

Dan Gahafer

Oh, yeah.

Jeffery Payne

And that helped immensely. So automate and demonstrate capability, and then coach and help the people to at least use it as a push-button capability if they don't have the ability or the initiative to change.

Standardizing your environments. Can't stress that enough. The more you can make them all look production-like, it doesn't have to be perfect, it radically reduces the inefficiencies and the issues around your environments. And ultimately, to be successful, you've got to gain the trust of the other people in the organization. The more trust you can gain, the smoother it gets when doing these types of efforts.

Dan Gahafer

And that's the most important point I think we all learned in this journey. It took us a lot of time to do some of the other ones. But they weren't hard. They just took a long time to actually implement. Some policies had to be changed. People had to be convinced.

But trust takes a long time to build, and some things can occur early on that can prevent that from ever happening. So you should be careful when you're starting your DevOps journey to not let those things happen. Respect for each other. Everybody has something they're bringing to the table, and if you alienate people early on, then you're never going to be able to build that trust environment.

Jeffery Payne

Yeah. That's a great point. So I think we're out of time. Thank you, everybody, for coming, and we'll stick around afterwards for questions. I think we're the last session in here, right?

Dan Gahafer

Yeah, you are. Yeah.

Jeffery Payne

So if anybody wants to come up and chat, we'll be up here for a bit. Thank you.