Busting Silos & Red Tape: DevOps in Federal Government

Log in to watch

San Francisco 2015

Busting Silos & Red Tape: DevOps in Federal Government

Senior Software/Research Engineer · Software Engineering Institute

All organizations face challenges in changing their culture and adopting DevOps philosophies. This is especially true in many federal government agencies. Through well-intentioned policies and procedures many agencies have created extremely silo’d environments where change is slow and difficult. Finishing the last leg of large scale software development project acquisitions can be particularly challenging and expensive. Barriers often impede getting hardware and software systems system fully tested, transitioned, and up and running in production on schedule.

Through our experience as a passionately DevOps focused software development group within Carnegie University’s Software Engineering Institute, a federally funded research and development center, creating, delivering and transitioning cutting edge software solutions to government organizations, we have struggled with and overcame challenges in helping government to adopt DevOps principles. Learn how we have conquered these challenges in shifting our government stakeholders’ thinking by coaching and initiating DevOps in their operational and development environments.

Chapters

Full transcript

The complete talk, organized by section.

Aaron Volkmann

I'm Aaron Volkmann. I work for the Software Engineering Institute, run by Carnegie Mellon University. I'm from Pittsburgh, Pennsylvania.

Today, I'll be talking about "Busting Silos and Red Tape: DevOps in the Federal Government."

Now, this next slide, I really didn't agree with the contents very much, but unfortunately, due to legal reasons, I was forced to include it in the deck, so everybody get ready. There we go.

So, in the beginning, just to tell you a little bit about what we do with the federal government and why I'm up here talking about doing business with them. Software Engineering Institute, we're an FFRDC. That's a federally funded research and development center. We've been around since, I think, 1984. The group that I work within is part of CERT, which started out as the Computer Emergency Response Team. They started back in 1988 in response to the Morris worm. But we regularly partner with government, private industry, law enforcement, academia to develop advanced methods and technologies that counter sophisticated, large-scale cyber threats.

So, just a little bit about the group that I'm in. We're inside of CERT. We started out as young pups, bushy-tailed, not quite as literally as these guys here, bright-eyed, optimistic. We were given some funding to develop some prototype software for a government agency. I can't mention which one it is, or else I'd have to kill you, or they would come and kill me or something. So, they shall remain anonymous.

But anyway, prototype software, pretty decent budget, development team of around 20 developers. Supposed to be from the beginning, the way the contracts were set up, they were supposed to be prototypes. But once it got into the users' hands, they liked it so much that those prototypes slid down the slippery slope into production.

Who here is familiar with prototypes in production? It's a fairly common thing that's happened at every organization that I've been at.

So, that's where we get into some friction and some challenges.

Anyway, we're in Pittsburgh, and our customer's in DC, of course. And the way it works is we have a liaison that's a Fed that works with us, that helps us navigate through the structure inside of their agency. And the liaison then talks to the business people who are in charge of ordering up what the software looks like that we're developing, then those people talk to the users.

So we're a couple of layers removed from our end user, which you could immediately see is a little opportunity for improvement. Sort of like Office Space, where you have a role that takes the specs from the developers and hands them to the customer, the people person.

And in the beginning, we had no direct contact with ops at all. We were developing relatively in isolation, happily plugging along, doing our thing.

So we got in our harnesses and our sled team, and we were ready to go.

We're an agile shop, meaning that we stood up every day and had Scrum meetings every day. I know a lot of people call themselves agile just because they do the Scrum thing. But I think everybody has their own implementation of agile and what works best for them. And through process improvement, they land where they need to as far as what kind of agile they practice.

We did continuous integration, continuous deployment, automated testing, all that good stuff. The continuous deployment, though, landed in a test environment that was on our network internal to Carnegie Mellon and had nothing to do with the production environment or the test environment at our customer site.

Because our customer's systems were so quarantined, we had no access to any test systems on their side. And actually, our environment parity was enforced through email from ops, where they would say, "Well, our environment looks like this, this, and this, running this version of Java," for instance, just communicated over email. So you can imagine, through the course of a year of development, some things could change where the email model of enforcing environment parity might break down.

Anyway, we went through one year of development. We're almost ready to deliver the product into production, and we got a little bit stuck in the snow.

We ended up testing in a pseudo-production environment where the software had to run on the production network but be pointed to test servers, if you could imagine that. Production problems were found very late in the game, right before funding dried up for our project. And we learned, whenever it was almost too late to save it, that the right stakeholders were not brought in at the right time, particularly ops and security.

So whenever money gets involved and money dries up, and money has to be borrowed from other places, what do we think might have happened? Things get particularly nasty, and there was some conflict that involved. And this is a really mild version of the conflict, if you could imagine.

In industry, there's the abbreviation CYA, cover your A. That held very strong here. There was a lot of blame, where actors wanted to assign blame to why this happened. We ended up getting enough funding to solve this issue, but we had to borrow money from other projects and even created friction within our own organization to resolve these issues.

So what we did is we rested our little heads in the snow, closed our eyes, and took a step back to regroup.

Through working with our customer, we gained access to proper test environments on our customer's side because we learned that environment parity is key, and there's just no way that we're able to 100% replicate their production environment within our own organization.

By that time, we learned who the key people were who needed to be engaged early, and so they were involved and on our side. I think this whole bad experience that we had is what woke up our customer to being aware of the issues of actually implementing software from a third party.

The government deals with a lot of contractors, and those contractors for this particular agency were on site and talking to them in front of them every day. We were a little bit different because we were somewhat removed through a couple layers of abstraction through different people. So we didn't get the same kind of attention that their normal contractors that they dealt with day in and day out got.

So first we took a look and we looked at our workflow in delivering software to them.

Our customer is really used to dealing with monolithic, huge releases, maybe like once a year, twice a year, maybe quarterly. We were semi-agile, so we were able to release on a sprint basis every two to three weeks. We got them used to accepting these releases more often and testing them in their environment.

We got to the point where there was enough trust where we could VPN in and do continuous delivery into a production-like stage environment on their side where the customer could look at it and deal with it.

Another thing that we did workflow-wise was we started treating our documentation as source code. We did all our documentation in Markdown and checked those Markdown documents into our source code repository alongside of the application source code. Then those Markdown files would be built by a build server and transformed into PDFs, Word documents, and HTML, and automatically published to an intranet site that our customer could see.

So they'd have access to the most current documentation at all times. It also saved us from having to do the Microsoft Office shuffle and locating the correct version of documentation.

In looking at our workflow and looking at our bottlenecks, we found one really big one that made me particularly sad. It was a security bottleneck. The agency we were working for had a requirement for a really long security process in order to accept new versions of software into their environment.

So the smallest change, if we were to change an icon to cornflower blue and it would generate an MSI file for an installer with a different MD5 hash, then it would have to go through a multi-week-long security review in order to get that into our system.

So we developed a methodology by which we could satisfy those requirements and only do the really long process whenever it made sense. We'd go through the long approval process, then reuse that authorization over and over as long as the changes we were making were small enough where it didn't make sense to do a complete new security scan.

So this is what our process kind of looked like. Across the top is the standard CI/CD pipeline: continuous integration, we do our automated testing, and then we deploy into production.

This Dr. Bunsen Honeydew guy, the security controller, is actually a system, but at first it was a role. We got this process on paper, going manually to test it out before we built anything around it. The security controller is a web application with RESTful APIs that can integrate with anything and record everything that happens.

So just to walk you through this, if we were doing CI on a brand-new piece of software, the security controller would say, "Well, this is a brand-new piece of software. Lower the gate there. Production deployments are on hold. We're not going to deploy to production until somebody takes a look at this." And then we would automatically create a Jira ticket for our security team to go in and take a look at it.

Once the security assessment was completed and any mitigations were completed, the security team would go in and mark that build of the software in the security controller as, "Okay, this is A-OK." Then it would raise the gates and allow production deployments to go on.

So same thing: if a large change or a change that would affect the security profile of the application would be detected by the security controller, we'd close the gates. We would schedule a security assessment. Then once that was done, we would raise it up again and allow the pipeline to move through.

Oh, a large change, how we would detect that. We started out with something as simple as having developers put in a reference, an issue in our issue tracker that would mean this was a security-affecting change. And any time our security controller saw that in a commit message, it would lower the gates.

So we're also experimenting with using natural language processing, machine learning, to look at the contents of commit messages and also looking at code in order to determine if this change warrants us having a second look.

Eighty, 90% of the time, it follows this path, where we're just making small tweaks to the software after the initial security scan. We're just doing all those last-minute things to get the software working for realsies in production, and security controller sees a small change, and he says, "It's cool, deployments allowed."

So this is how we sort of went around the red tape by doing things that make sense and not following a process that doesn't make sense whenever we want to move fast and want to be able to deliver software quickly to the customer.

Another thing we did was foster an environment of experimentation and learning, where our customer learned through us deploying to them so often that early failures are a good thing instead of a bad thing, where we need to start pointing fingers and assigning blame, and that this is the normal way of doing things, and this is the only way that we're going to be able to innovate and do things well by getting our paws wet.

The result of this was we encouraged a little more risk-taking. We were able to do things without the fear of repercussions. All the risk-taking that we were taking, of course, was mitigated through proper back-out strategies or doing small pilot releases mainly, or small canary releases. All those techniques that I'm sure we're all a little bit familiar with.

Along with experimentation and learning, so we're doing all these experiments all of a sudden with the way we work and the way we operate. We did a little bit of Toyota Production System, and this is a really, really simple, effective way that we did some learning and did continual process improvement.

We did a process called, very, very simple, called PCSAM. Very easy to remember. So what we did was every day at our stand-up meetings, everybody was required to bring up a problem they had or that they witnessed in their day-to-day work.

And a problem could be something like, "Oh, it was too slow to get this software into production," or, "The port was wrong in our web config," something like that.

Look at the problem, and then we would drill down to the root cause by asking why. "Well, why did this happen? Okay, it's because of this. Well, why did this happen?" We keep asking why, why, why, why, why, why, why, why, why, until we would get to what we thought was the root cause of the problem.

We would come up with a solution to that problem. The action is the actual instantiation of what we're going to do to solve the cause of the problem, and then we would also document a way of measuring to see how we know that this problem was fixed.

This is a way of making sure that mistakes only happen once and to minimize the chances of them being repeated in the future.

It can be as simple as just keeping a deck of index cards and having a stack of index cards with the PCSAM process on them, and then adding them to the backlog and taking care of them in the order where they could... We'd work on the ones that give us the biggest heartburn the first.

An interesting thing from that, by doing these measures, by measuring the solutions to our problems, we could trace those to actual business measures and find metrics that would be useful for tracking our progress and tying the changes that we made to actionable business measure-y things that we could use to display success.

Related to this, this is a very common DevOps thing, a quote from Bruce Lee: "I fear not the man who has practiced 10,000 kicks once, but I fear the man who has practiced one kick 10,000 times." And this just goes into constantly refining our process and making it better and better and better, instead of doing a shotgun approach or trying a bunch of different things and seeing what works. Incrementally changing.

Another thing that we did was we improved our feedback loops. We amplified them. We got in touch with our users so that we could have direct communication from our users to developers, that we would be aware of what they were experiencing, instead of playing the telephone game and going from users up to the business person, to our liaison, to us back in Pittsburgh.

Opening up a direct line of communication was very beneficial. The documentation server that I described was very beneficial, where the customer had a direct line into our line of thinking. We also gave them access to our issue trackers and to our wikis so that we had complete transparency of the status of our projects and what was going on, so that they would be aware of any incident like what happened before where we went over budget and over schedule.

Through the documentation exercise, we made sure that specialized knowledge wasn't in people's heads. We developed a culture where, if it's not in the documentation server, then it didn't happen. And if somebody wanted to reference something, it had to be properly documented, or else we wouldn't even treat that as a valid thing. That's still a work in progress and a cultural change on our part.

What that tends to do, if anybody read The Phoenix Project--isn't that requisite for being here? If not, go out and read it, or at least skim over it. It's very useful to show to your customers.

But here's my quote to describe the Brent effect in The Phoenix Project: "An actor operating as a singleton is sabotaging the system." So if you have some superhero or some genius who seems to save the day, they're maybe giving you short-term gains, but in the long term, they're sabotaging the machine and making it broken on a larger scale.

What I found that a lot of the things that we did--it all comes down to empathy and having empathy for the other side. We were in development, government's operations. Them having empathy for us and us having empathy for them and getting us to work together was a huge thing.

An interesting study was done in McGill University up in Ontario, where they found that they put two mice together in a cage and measured their stress response. And they found that mice who were familiar with each other did not exhibit a stress response, but ones who were complete strangers, they did have a stress response to each other.

So what they did was they scaled that up to humans and showed that me and my buddy are in a room together, we're not feeling stress, but if I'm in a room with a complete stranger, there's a little bit of stranger danger there. And there's a stress response on both parties.

And they correlated that with the amount of empathy that you feel for the other person. So if you're with a friend, you're going to feel much more empathy for that person than with a complete stranger.

And they proved this by giving some sort of knockout drug that would diminish any stress response. And they put two strangers in a room, and they felt the same empathy as if two best buddies were in the room.

Well, anyway, since we can't take Xanax every day or some Vicodin or whatever, they found that through sharing in a shared experience, and they used the video game Rock Band, just by playing Rock Band for 15 minutes would move people from the stranger zone into the friend zone, and then they would exhibit the same empathetic response to that other person as if they were really familiar with each other.

So I think that's very telling in how to increase communication and collaboration between groups, is to have some sort of shared experience.

Just from my own anecdotes, pizza parties, they don't seem to work because people tend to gravitate towards people that they already know. After-work bowling parties, the same thing. Everybody aligns in their lanes of their own silos. But shared experiences tend to work. We can figure out what that is besides Rock Band, because not everybody likes Rock Band and some people just suck at it.

So, I know Doom Deathmatch, my boss back in the '90s, he used Doom Deathmatch, and that's what brought warring teams together into one tribe.

So the mathematical equation for this: strangers equals stress equals lower empathy. So if we divide stranger on top of stranger equals friend, and then stress turns into lower stress, then we have higher empathy.

So the results of all these activities resulted in a stronger partnership with our customer. On the empathy part, I think a lot of the empathy came through our shared experience of struggling through doing that painful last mile, getting something into production, and finally working closer with them.

We have an increased visibility of progress, which is good. We have resulted in fewer defects, faster lead time for features, increased transparency, and everybody happier overall, as you can see.

If you want to know more about these little tidbits that I showed, the SEI has a DevOps blog: insights.sei.cmu.edu/devops. We publish every other week an article about DevOps and what we're doing with DevOps.

And in two weeks in Arlington, Virginia, for any East Coasters, we're hosting a free DevOps symposium. It's a one-day seminar on a Thursday, and there's the URL up there. If you want to go and register, it's totally free, and we have some great speakers and great activities going on.

So top five takeaways from my presentation and from my experiences, and it's really funny because this seems to mirror a lot of the other presentations that I saw this week.

Culture, that's the number one barrier to change. All the technology stuff's easy compared to changing people's attitudes. I think by increasing empathy, that's how you start to bend things around, particularly.

We talk about shifting left everything, shift left star. One thing that I never even dreamt of shifting left is our understanding of our key stakeholders, being outsiders to an organization. The big challenge is, of all these things we have to shift left, is to keep a checklist of these things so that we don't forget them the next time we do a new engagement.

Continual process improvement, the PCSAM process. It can expose very useful metrics in measuring your progress.

AppSec can't be fully automated yet, but we can do better through techniques like I showed.

And empathy's just huge, and it's fixable through shared experiences if you can get there without it costing a ton of money or being particularly painful.

As far as what I'm looking for help with, what I'm focusing on lately is automating all the security things and getting as much security assurance into an automated pipeline as possible so that we're able to keep up with having as much coverage as humanly possible.

Again, thank you. I'm Aaron Volkmann. There's my Twitter handle and my email. If you have any questions or want to collaborate on anything, please reach out. But thank you.

Q&A

I got about a couple minutes left. Are there any questions from the audience?

Q: I have one.

A: Yeah. I got a mic for you right back here.

Oh, cool. So you show up on the YouTube.

Oh, no.

Q: So pretty much when you were talking about being able to identify if there's any new changes--

A: Yes.

Q: --or small changes--

A: Yes.

Q: --if we are going to rely on the engineers to do that, that's a little hard.

A: Yes.

Q: So what do you guys exactly do to be able to identify those changes?

A: We did experiments looking at... Well, that was the first pass, is saying, "Engineers, you do this," and just trust in the code reviews that they'd be honest and show. Because that was an improvement over the existing thing where our security guys would just have to have a gut feeling of, "Oh, this is what's going on." So that was better than doing nothing.

We're looking at static code analysis to make those calls. Things like wrapping API calls in common libraries, so you could do regexes on those in source control, and if those come up, then you know that something new security-wise is going on. Doing machine learning and NLP on commit messages to look for keywords like security, things like that. Those are other things that we're still working on and continuously improving upon.

Q: Okay, thanks.

A: Thanks. Anyone else?

Q: Have you gotten to the point yet where you start to embed security folks into the Scrum teams so that you can release--

A: We are totally not there yet.

Q: So I'm familiar with a couple of programs where people are doing that. It really does help because you can test the important controls before you release to production, and therefore, you can pre-accredit.

A: Right. I think that's definitely the ideal, but then you come up... Same thing with embedding, making ops folks first-class citizens on a project team. You can run into the resourcing issue, and that's what we run into, where people say, "Oh, we just don't have enough time to do that realistically." But I agree, that would be the ideal, and would definitely advocate for that if it's possible.

Q: So you would move a lot faster, then.

A: Right. Yeah, I think the challenge is to produce metrics that prove that in order to get the momentum moving on that.

Anyone else? All right. Alrighty. Well, hey everybody, have a good flight home if you're taking a flight, and enjoy the rest of your day. Thank you so much for coming.