A Principled Approach to Driving Change in DoD IT

Log in to watch

San Francisco 2017

A Principled Approach to Driving Change in DoD IT

Data Systems Tech Lead · Joint Warfare Analysis Center

The military works best with a clear objective. It’s no different for the systems that support it. When you are running short on time, people and patience, it helps to see each release marching toward the goal. Just don’t forget about bringing the people with you!

I’ll share some lessons learned and forgotten, successes and mistakes made as we try to provide faster, more accurate options to our nation’s warfighters.

Chapters

Full transcript

The complete talk, organized by section.

Kurt Hockaday

So, in case it isn't obvious, battleships can't turn on a dime. Neither, apparently, can fairly mature government organizations. From people who point to their SOPs and their position descriptions as evidence of, "This is not how it's done," to government hiring policy, which ensures a snail's pace of change to the workforce, introducing large-scale change into the DoD is never a quick endeavor.

But for the same reasons, momentum will eventually become your friend. Once things get moving, it's really easy to keep pushing.

I'll walk you through, at breakneck speed, our last three-plus years trying to modernize the way our IT department responded to our command requirements, highlighting some of what did and didn't work along the way. I'm only going to be skimming the wave tops. If you want any deeper explanations, please come and find me later. We'll see what we can do.

So, I work for the Joint Warfare Analysis Center, an operational Department of Defense command under the United States Strategic Command. We answer the questions that our nation's warfighters have about how to accomplish their mission.

We support long-term strategic planning options, spanning months and sometimes years, all the way down to immediate tactical solutions required in a couple of hours. We're located outside the traffic and craziness of D.C., but we still remain close to our intelligence community and DoD partners and data providers, helping us respond quickly when necessary.

Beginning about in 2011, JWAC had a number of events that drove us towards a different mindset in IT. About that time, our former headquarters, Joint Forces Command, was disestablished, and we were caught up in that reduction. Almost overnight, we lost 30% of our money and 30% of our people.

Now, a RIF is a mindless workforce reshaping tool. The people with the least amount of tenure are cut. It doesn't matter your performance, your talent, your value, your drive, only your time served. That was a psychologically devastating time, even to the ones who were safe.

Emerging from those ashes, now under STRATCOM, we needed to prepare to handle more diverse and more restrictive data sets. Some of this data couldn't even sit on the same hardware and architecture as other data types. Now, in the past, we would've just stood up another network, carved off a dedicated team to manage it, and we're good. That wasn't possible any longer. Less people, less money.

We were already operating in four primary air-gapped networks and were managing a whole host of other smaller ones. And in the middle of it all, Edward Snowden made some already super-careful data owners even more worried about losing their data. So, why not impose even more restrictions on how to handle it, to include the principles of least privilege and other things that would make it harder for IT admins to justify access to their data?

Also, budgets are tightening across the DoD, and we need to be moving as much as possible to community-hosted solutions. This backdrop was a harsh reality check that we simply couldn't keep doing what we were doing.

But what do we do? Our remaining admins were overloaded just trying to keep up. Fortunately, we had begun paying attention to some incredibly bright and oh-so-helpful gentlemen with some intriguing ideas on how to make it all better. From John Willis to John Allspaw to Gene Kim, we were led to something called DevOps.

It all seemed so simple. Making sure we covered down on culture, automation, measurement, and sharing while observing the Three Ways would lead us to the land of well-oiled machines.

Given the problems that we were trying to solve, we started out focusing primarily on deploying and maintaining consistent systems and services. If we could do that right, then those services should be able to maintain themselves to a great extent, allowing us to finally have an environment deployed and managed by our analysts without any IT involvement, if the data requirement deemed it necessary. We heard some about that from Damon Edwards a little bit earlier.

So, to that end, we decided to come up with a collection of principles or behaviors derived from many conversations, books, podcasts, and articles. These principles read bottom-up and serve as guideposts to our decision-making.

When we start with system and service delivery, the main objective is that our environment must be cookie-cutter deployable, such that every system and service should be built from standardized templates and orchestrated by repeatable workflows, all of which are stored in a configuration management system. Service installation and configuration should be automated.

Then we can move to proactive management. This one is primarily focused on monitoring and alerting. All services should be instrumented to feed a central monitoring system with enough information to know whether or not the service is functioning as expected. We also have a schedule in place for updates to ensure that they are tested before being applied in production.

Which then brings us to consistent operations. This deals with making sure that all changes are tested in a development environment which mirrors all of the production ones. Changes are not made in prod. They are developed, tested, weighed, and measured in dev, and then deployed in prod.

Also, services should be architected to make it easier to simply redeploy a system rather than troubleshoot it and fix the problem. Troubleshooting in production should only occur on repeat failures which indicate a larger problem.

That brings us to self-service provisioning, the ultimate end-all, be-all. All of the previous principles build to allow a service to function in this manner, where a non-IT user can deploy, manage, maintain, and destroy needed services without any admin involvement.

Once the principles were in place, then we began to work. But first, we needed a space. We took over a room and tore out all the cubicles in it. Then we filled it with tables, chairs, computers, and monitors, and we painted all the walls with whiteboard paint.

We were sitting shoulder to shoulder, face to face. What happened was pretty amazing. None of us knew how to build what we needed to build. None of us could see the path from A to B.

But being that close to each other and watching everyone struggle, trying to figure out which way was up, we all started jumping in on all the problems. Conversations all over the room were being joined in by anyone who thought that they could help. Behaviors changed, bonds formed, respect was earned. We did more and learned more as a collective in about eight months than we had done in years.

Then we needed to expand, to scale out. The little team that got us started was growing as we incorporated more services, and the room that we were in just wasn't big enough to hold us. We also needed to bring in additional people to help us run it.

So we decided to move into our war room, which is JWAC's command hub for dealing with crises in the world, on the understanding that if a crisis came up, we'd move out. It's big, it's open, it can support a lot of people. Sounded like a great option.

So we moved in. We arranged all the teams to facilitate conversations within the teams and sat the operators right next to the engineers. Two things happened, one good and one bad.

The bad first. The behaviors that we had built in the smaller room with no walls began to fade in a much bigger room with short walls but really bad acoustics. People were less able to hear the problems other teams were dealing with and less likely to help out.

On the positive side, sitting the operators right next to the engineers was a huge win. Making sure the problems with a service actually running in production were seen and felt by the engineer made the service and the dashboards much more robust.

We've since moved out of the war room into another room with cubes again, shorter walls, and groups who encourage team collaboration, but we've got walls nonetheless. It seems unlikely that we'll recapture the magic of our small, collaborative environment as we had to scale out, but we're still looking for a way.

In trying to jumpstart our learning and get ourselves working together, sequestering ourselves off into that little room, we made a misstep there. It quickly began to look like we were fostering the team of haves to the detriment of the have-nots. We still needed to keep our production systems up and available, and we still needed to respond to the requirements and problems of the command, and yet we carved out some of our best IT admins and developers to build this cool new system. Against all of our hopes, we went from dev and legacy to dev versus legacy.

Now, we knew we had to automate the dickens out of all the things. We're primarily a Microsoft shop, so PowerShell and Desired State Configuration were our core staples for system config, with Puppet underwriting the Linux systems. We had used VMware for years, so vRealize Orchestrator was a natural fit for our overall orchestration tool, and everything was backed up in Visual Studio Team Foundation Server for source control.

With a few exceptions, this stack works well to keep everything consistent. The only tricky part is with Windows applications that aren't prepared for silent installs or automated configuration. Some of those take a lot of extra work to make hands-off.

In order to automate all the things, you have to monitor all the things. We use Sensu as our monitoring platform, and we have checks and automated remediations out the wazoo. This was one of our biggest successful behavioral changes since before this effort, our chief monitoring system was the telephone.

The principles themselves have served as a great guidepost when we're trying to deploy a new service. For services that are built to meet the principles, there are few surprises, few problems, and happier folks.

I mentioned some of our problems with Windows applications. We also have some trickiness with LDAP and Active Directory integration on the Linux side that messes with our ability to seamlessly deploy as we would like, and we still have some people who don't want to put in the extra work to automate their service.

Now, once we could consistently deploy services and ensure they were managed with minimal fuss, now we could start standing up our environment. We build and test in dev, review and approve for release into production, then we stamp out a production environment just as it existed in dev. Sweet, most of the time.

Those problem services from the last slide, yeah, this step becomes tricky for them. Luckily, the ones who simply don't want to do the work are realizing that it really isn't optional. The problems they have simply aren't shared by the automated services, and they reluctantly are beginning to add more of the principles in.

Now, if you have something in production, you need somebody to run it. We're not at real self-service yet. Remember that least privilege driver? One way we implemented that was to have the engineer with full system privileges on the development environment, but not in production. So we needed to bring in some operators.

Back and forth on the names here. Names have power, after all. We really wanted to drive home the point that this was a different job than they were used to. We didn't want to continue calling them admins. We couldn't put engineer in their title. The government has some very specific definitions for who can be called an engineer. They were going to be doing operations, so we called them operators.

Ouch. That did not go over well. They felt devalued and little more than button pushers. It led to some bad feelings that haven't fully gone away.

Now, as we bring in new operators and walk them through the job and the responsibilities, it becomes less of an issue, but not everyone has been integrated yet, and the ones still on the outside, they have a bad taste in their mouth.

Now, compounding the labeling issues was an overall lack of understanding about what was going on and what would be expected. Everyone was busy, and there wasn't a concerted effort made to bring operators in early or even try to alter their current jobs with the set of principle behaviors that we were trying to build in this new environment.

We could have altered their current tasking to include the expectation of scripting solutions to solve their current problems and build their own PowerShell expertise. We could have been more militant about attending training designed to get them up to speed with these new skill sets. We could have incorporated them into the project earlier. We didn't. Now we're struggling to find enough people with the skill sets that we need.

So where has all of this left us? Most relevantly, we have supported several analytic models and experiments designed to more quickly respond to warfighters that were impossible before, with the added bonus that when they're actually ready for operations, there will be no surprises.

In one instance, we were able to bring in a model that had taken around four days to run and run it through in an hour and shut the whole thing down. Awesome.

For services which have adhered to our principles, we have not had a change break our production environment yet. Our build times for an incomplete environment have been reduced from six weeks to about a week. And we're living by the mandates of least privilege, with engineers having elevated but restricted privileges in the development environment and operators having very few elevated privileges in production.

This next one was huge. SME and domain-specific knowledge is now encoded into our processes. Anyone could run what used to be so complex that only a few people knew how to do it, and all of us can look through the code to understand how it's done.

Our auditors also have an easier time. We can print out the complete configuration of the system, walk them through all the settings, and only spot-check a few systems. It's a much easier process than it used to be.

Now, it's not all roses. We haven't gotten any service all the way to self-service yet. Our integration of operators into our engineering teams has been slower than we'd hoped, and that has had a number of impacts, chief of which has been many challenges which are only seen once a perfectly engineered service hits the real world. Those have come as a surprise. We've had to learn how to inject some of these lessons on the fly.

And there's a few other things that we struggle with. If you deliver software or systems to the government, especially the intelligence community or Department of Defense, I would recommend you taking a look at this site. Here you'll find a lot of the security rules and implementation guidelines that we have to follow.

You can even look at the secure host baseline, which our OS builds with the security already baked in, and you can understand how your software will function, or more often not, in our environment. And if you're delivering us a Windows application, making sure it supports a silent install and automated configuration would be huge.

Some things that we could use advice on are methods to switch between normal and privileged users when the normal user doesn't have privileged access, or when that user has both certs on their smart card.

Our networks are air-gapped from each other and helpful little things like the internet, so software that is required to phone home, that falls flat. And we have the added annoyance that really helpful stuff like continuous delivery is rather tricky when the bits can't get there from here.

And we have so many questions about how to make Active Directory handle all that we're throwing at it.

Now, last but not least, you can spread the word that we're hiring. We're currently looking for soon-to-be or recent college graduates, U.S. citizens, who are interested in working on an adaptive IT environment, which will be the platform upon which some of our nation's toughest challenges will be solved.

That's who I am. If you want any other information, feel free to drop me a line. Look me up. I'll be here through the end of the conference, and I think we've got some time for questions.

Q&A

Q: As you tried to get started on moving from traditional command-and-control style of operations that the military has afforded to this whole collaborative culture, how far up did you have to get to get support for that sort of initiative? That's breaking 200-plus years of, just celebrated our birthday, for one.

A: Yes. So, me. I was and am a 14. But I was a technical 14. I wasn't in the management chain.

But part of our model had already been to embed our developers with our analysts. So that helped. Our management team already had the understanding of, "Hey, that kind of makes sense, where you're taking the problem and you're putting the solution to the problem right in the middle of it."

So a lot of what we were describing early on resonated. We were also, like I mentioned, right in the middle of the transition to STRATCOM. And there was a heavy desire with our leadership team to have a pretty good story to our new parent command. So there were some elements of that, too.

STRATCOM was also looking to stand up a data center, and they were looking for expertise in how to monitor and alert on a lot of the things they were doing. So it was kind of a fortuitous turn of events, but I kind of started it based on some of the stuff I was seeing out in the real world.

Q: How did you manage to procure—

A: Ooh. Sorry. Yes.

So we did that in two thrusts. One, we made the very intentional decision that we were going to go after open source as much as possible, which sounds odd for a DoD organization. So we did that in very targeted fashion.

We went after the open source equivalents of paid things, so that there was a company out there that would support an enterprise version, like Puppet. And we just brought in the open source version of that, and then our cybersecurity shop, they made that translation in their head and said, "Yeah, that kind of makes sense. We don't have to question too hard there."

The other side, we really tried to stick with the tool suites that were already in our building. So I talked about VMware. We had been VMware customers for a while. So bringing in Orchestrator and vRealize Automation, that wasn't a huge stretch.

Our contracts folks, they were understanding, "Yeah, we see what you're doing, and we can do a mod to this contract and pull money from here." So that's kind of the way we attacked it, though. We didn't try to go too far afield. We tried to keep everything pretty familiar to the contract while still doing some pretty radically different stuff with it.

Anybody else? Go party.