DevOps Handbook Experiments in Accelerating Delivery

Log in to watch

San Francisco 2017

Download slides

DevOps Handbook Experiments in Accelerating Delivery

Jim Grafmeyer

Digital Solutions Architect · Nationwide

Cindy Payne

Director IT Application Development · Nationwide

Continuing the Nationwide journey from the first 3 years, this focuses on the experiments which are based on areas from the DevOps Handbook and the results achieved working with 2 model line teams. The experiments include:

- MInimize WIP

- Fast and Reliable Automated Tests

- Automate/Enable Low Risk Releases

- Continuous Flow

- Continuous Automated Feedback Loops

Results (to date) include:

- significant reduction in lead time (over 60%)

- increased frequency in deployment

- significant reduction in time to run automated tests (20X factor)

- integrating security and performance testing into CI/CD flow

- reduction of manual activities, waste and wait states

This will include technology enablers (e.g. Docker) and collaboration enablers like ChatOps and GitHub and also the SRE role in the experiments.

We will also discuss the experiment design and interaction model including the collaboration and balance between practitioner input and Systems Thinking ( https://www.linkedin.com/pulse/systems-thinking-carmen-deardo ) and also culture ( https://www.linkedin.com/pulse/devops-culture-changes-start-respect-people-carmen-deardo ).

Chapters

Full transcript

The complete talk, organized by section.

Jim Grafmeyer

I'm sure everybody's seen the Peyton Manning commercials. You're well aware that we sell auto insurance. The message here is we're way more than just auto insurance. We're actually number one in a lot of things outside of that.

So we're number one in pet insurance. We're number one in 457 retirement plans, number one in corporate life insurance, number one in farm insurance. We've actually been around a long time, too. We've been around for about 90 years, and we are a mutual company, so we are not public, which gives us some flexibility, especially from an IT standpoint, because we can make more strategic long-term decisions and not have to have quarterly expense pressures.

Also, fantastic place to work. We've been on Fortune's Top 100 Places to Work for the past three years.

Cindy Payne

So the message here is really about scale. A lot of the things we talk about, if you think about it on a small scale, it's very achievable. With Nationwide, our challenge is that we want to do everything on an enterprise scale.

So we have 200 Nationwide Agile Lean development lines that we are trying to scale across. That's across 50 primary technologies that we deliver software solutions in. These aren't technologies that support those solutions. They're actually the technologies that we actually deploy to production, so very complicated, large enterprise situations. That's over 2,500 applications.

When you start talking about how many associates we have, we have 5,100 IT associates, and about half of those are developers and automated testers. Again, we're a fantastic IT department, rated by Computerworld Best Places to Work in IT.

And then a little bit about what Jim was getting at, which is we have a lot of different businesses. We've organized those into 23 different business units. Some of those are internally facing business units.

And then at the core, we have been very much a risk-averse, disciplined-type company, and we're really in the mode now where we're trying to balance innovation and discipline.

So a little bit more about how we're organized at Nationwide within our IT department.

One of the things I want to highlight here is that we have multiple shared services organizations across Nationwide IT. So we have an infrastructure and operations area that handles a lot of our tools, our run services, and our platforms. But then technical consulting and other professional services like requirements testing, project management, they all live under a different stack. And then we have this business unit that's called Enterprise Applications that runs IT for IT applications.

So when you start, and we'll get into this deeper, but when you start talking about architecting the delivery pipeline across three different CIOs, it becomes complicated at Nationwide. And then you can see Jim is actually a solution architect that faces off to one of our externally facing business units, Nationwide Financial.

Jim Grafmeyer

Okay, so a little bit more about Nationwide. We made a decision about 10 years ago that we were going to build a globally competitive in-house software development capability.

And at a high level, this is how we've been doing it for the last 10 years, with a little bit of adjustment here we'll talk about.

So first, it's Agile. We were an early adopter of Agile and Agile at scale, and that's how we deliver software with high quality.

This year, we actually added to this picture how we deliver with speed, efficiency, and low risk as DevOps.

How we manage IT at enterprise scale is with Lean.

And because we want to be globally competitive, in other words, we want to be primarily insourced, we have to prove that we are competitive, and that's how we use CMMI to benchmark ourselves against the industry.

So a little bit about our history and our journey so far.

If you rewind about 2015, we really back then were figuring out what were the practices and cultural elements that we wanted to focus on. And actually, our friend Carmen DeArdo, who's sitting up here, came up with this DevOps house.

But the idea was it was a metaphor, and there were two pillars of the house, and there were these seven practices and seven cultural elements that we were going to focus on.

At that time, we were kind of narrowly focused on our deployment software. We were using a legacy piece of software with high licensing costs, and we were focused on swapping that out for the UrbanCode suite.

Fast-forward a little bit to 2016, and that's when an exciting, large, across-all-of-IT transformational initiative spins up called internally IT Delivery Model.

This transformational program has a lot of work streams, one of them being DevOps, but there are also work streams around: how do we do Lean management at scale? How do we change our software development practices to evolve to the next level? How do we optimize our use of suppliers and contractors?

That last one, in particular, allowed us to harvest a lot of efficiencies that we could invest in things like DevOps.

2017 is where the majority of the rest of this talk will be focused, but to give you a preview, we took an experimentation approach and spun up what we called two model lines that we ran experiments through, and we'll touch on what those were and how we went through it.

2017 was also when we decided to focus on the developer experience. In the past, we probably had too much width, and we were all over the place on what experiences we were optimizing. We made a conscious choice this year to optimize the developer experience.

And then all these experiments we anchored back to The DevOps Handbook, which Cindy will talk about in a few minutes.

And then looking forward into 2018, we introduced a lot of new tooling, and it's really about how do we spread that new tooling to the 200 lines that we talked about?

There's also some work being done around Google's SRE model and how do we bring that in-house, as well as, as Cindy mentioned, treating the pipeline as a product.

So if you look at all the tools we use from when a business has an idea to when we're monitoring it in production, we hit about four or five different areas inside of Nationwide, which makes it very complicated to do CI across that pipeline.

Okay, so a little bit more about these experiments we are running. This is the interaction model, for lack of a better term, that we came up with to run these experiments. And really what this is, is kind of our plan-do-check-act cycle that we implemented inside for this construct.

We had a number of guiding principles when we kicked this off. We had three guiding principles, the first being speed. So all the experiments we ran had to be anchored back to speed. If we got efficiencies, awesome, but that was not the goal, to get efficiencies. It was all about being able to prove that this experiment could get us speed.

Number two was it had to be practitioner led. We did not want a top-down DevOps transformation that we were mandating these teams to go through. We wanted the teams themselves to give us ideas of where they thought they could get faster.

And then finally, the scope. So we could have gone crazy with our value stream and figuring out where we wanted to focus. We could have went to the far left or the far right. For these experiments, everything had to fit in when a card hit the backlog of an Agile line to when it was ready for deployment. That was our scope.

So to the interaction model. In the middle there, you'll see the two teams, the two model lines that we spun up. And both of those teams are sending ideas up to a common DevOps backlog.

That common backlog is managed by the DevOps leadership team, which Cindy and I sit on, and Carmen, and a few others. And we're basically tasked with making sure that the experiments the teams are coming up with align to our guiding principles and the scope and the speed that we want.

And then we're also applying systems thinking when we're crafting the experiment to make sure that, if successful, it can scale to all 200 lines.

We also have a platform team, which initially we didn't start there, but we realized quickly that there are some problems we didn't want the model lines to solve themselves. Think of something like if we need APIs to automate some of our change management processes. We don't want our model lines wasting time building that enterprise capability. We kick that to a platform team. They develop the enterprise capability, and then the model lines use it once it's done.

And then finally, there's a governance team. This governance team is probably different than most in that we have multiple CIOs that sit on it, and they are basically there to resolve blockers and issues for us. But they're also intimately involved in the work that the teams are doing. They attend show-and-tells after each iteration, and they're able to see the results hands-on, which we're finding is way better at influencing them than giving them a PowerPoint after the fact of what the results were of the experiments.

Cindy Payne

All right. So, as Jim mentioned, this had to be practitioner led. We didn't want to tell them how they were going to speed up their delivery pipeline. We wanted them to tell us what was slowing them down.

So in the beginning, we had so many ideas from these lines. They knew exactly what was slowing them down, which we knew that was the case, but we had a little bit of trouble organizing it as a team.

And very early on, we realized that pretty much everything they were suggesting tied back to The DevOps Handbook in some way, and we would start referencing it. In fact, Carmen goes around and he knows the page numbers of everything that we talk about.

So very early on, we decided that the epic stories that we would have and rally around in our product backlog would be the main themes of the handbook. And every time the team came up with an idea, we pretty much mapped it back to that. And if we couldn't find it in the handbook, we really questioned whether it was the right thing to do.

So because I'm sure everybody's read the handbook and is very familiar with it, I'm just going to dig into the automated testing one.

Especially because we're Agile and we've been doing automated testing so well for so long, I really didn't think it was going to be a place where we spent so much time. But it ended up being, when you just really listen to the teams and what they discovered about themselves, that this was a place where we had a lot of experiments.

So first of all, the teams were telling us that their test beds were actually too big and unmanageable. So when you've been testing for so long, I think it's pretty common that you develop a bloat around your test beds, and they really had not put in the time to keep their test beds lean and meaningful. So they'd lost trust in them. And often they were broken, and nobody was paying attention to them. So that was one thing they fixed right away.

The next thing they said was there's a lot of duplication. So we had silos within our lines where we had developers and testers that would have a handoff, even though they're part of the same team. And what was happening is they were testing the same thing, and so we had duplication in our test beds.

So they figured that out, and they wanted to do poly-skilling, and they figured out that if they stopped the dev-test handoff and they were just pairing with each other, that they didn't have duplication in their tests.

And once they got the tests kind of lean and green, and started trusting them, then they could develop this culture around that they would have no tolerance for them to be broken. So they used to, it would be broken for a day or two, nobody would do anything. They changed their culture and said, "No, if it's broken, it's meaningful because we trust them now, so we all have to stop."

And then finally, when you get to something like performance testing, that was a whole other problem that we knew we were going to run into right away, which is we had a centralized performance testing team, and it was located in our I&O organization. And it was a 90-day process to request a performance test to when you actually got the results. And the results actually came when you were in code freeze and couldn't do anything about it.

So we knew both teams were going to hit that one. It was probably their longest wait state. And what we did there was an A/B test. They each had their own idea how they wanted to fix the problem. But mainly, they federated back. They took control of the scripts, they took control of the running, and they went from a 90-day process to a two-hour process where they're in complete control.

So just that one change alone had dramatic effects, which Jim will talk about in a little bit.

But anyways, that's how we organized everything. It made sense to our senior leadership. With the handbook there, we didn't have to say we're making this up. We always were grounded in things that we knew the industry had already figured out, and so it just sped up the whole process in trust and empowerment.

Okay. So Jim talked about we're also focused on the developer experience, and this is not something we had initially. He talked about our guiding principles initially, and developer experience wasn't part of it.

But what happened was when we were having these show-and-tells, the developers kept showing us that the developer experience was really what they were fine-tuning. So they love all the new tools and capabilities we brought to them and adopted them all, but they showed us all the context switching they had to do in the course of a couple of hours in order to use all the tools.

So right away they asked us for Slack, so that they can implement ChatOps and start automating that interaction with all these tools. And we ended up going with Rocket.Chat, which Jim will explain later why we did that.

But when we did that, we had to explain to our senior leadership why we wanted to add one more tool to the mix. So we get a lot of support when we go to our senior leaders with, "We want to take a tool out, we want to consolidate down, or we want to do a direct replacement that saves us money." But anytime we say we want to add a tool, that causes problems.

So we had to really paint the picture for them of what was going on with the developers. And I drew this so many times on a whiteboard, I finally drew it in a PowerPoint so that I didn't have to draw it anymore.

But I basically took them around and said, "Just to do one change, these are all the tools that they have to touch."

And we explained that basically when you add that new tool, Rocket.Chat in our case, that it really simplifies the developer experience. They still get all the benefits of the best-of-breed tools, but they don't have to interact with them directly. So they're really just writing code in whatever their favorite IDE or text editor is. They're using GitHub, have the collaboration, the pull requests. A lot of the static code analysis gets done there, and then everything else is an integration with a bot in Rocket.Chat.

And going back to what Jim said about show-and-tells, watching these teams show our CIOs all the steps that they had to do before they had Rocket.Chat, and then showing them how a bot works and eliminates all of those clicks.

One of our tools, I think, not to dig on UrbanCode too much, but I think there's 20-some clicks to do a deployment in UrbanCode. And then they go to Rocket.Chat, they just say, "Deploy master to IT," and it's done in a few seconds.

And so that show-and-tell was so powerful, and it was one of the things that gave us the support to add another tool.

Jim Grafmeyer

So we're running all these experiments, and word's getting out that something cool is happening in the DevOps area. And we're getting a lot of teams coming up to us and saying, "Hey, I want to get involved. How do I go on this journey myself?"

And this is what we came up with. This is our one-pager that we hand teams, and we tell them, "Hey, if you want to go on your DevOps journey, this is what it's going to look like."

And it's a mountain metaphor on purpose, because we know it's going to be hard for these teams, and we can't tell all 200 teams, "Here's how you do it." These are what we think the camps will look like as they move up the mountain, the base camps that their journey will look like.

We'll dig into a few aspects of this in the coming slides. But the idea is that we've organized practices that we expect that a team that is in the north camp, for example, to be practicing.

Cindy Payne

Can you back up one slide?

Jim Grafmeyer

Yeah, I can back up one slide. Yeah.

Cindy Payne

Thank you.

Jim Grafmeyer

Anything for you.

Okay, so this is the journey slide. Let's dig into a few things on it.

In the top there, we had a compass that kept us grounded on what our true North Star was. And for Nationwide, that North Star was lead time.

You'll notice that we have all the DevOps metrics called around the outside: mean time to recovery, deployment frequency, change rate success. We've really rallied around lead time for one really good reason: that we can measure it across the enterprise.

So one of the advantages of having consistency across 200 lines and doing this at enterprise scale is everyone's in the same tooling doing their Agile card management. And we can easily produce metrics at the enterprise level, at the business unit level, at a team level of what their lead time is, including process times and wait times.

So when teams are going on their journey up the mountain, they can quickly see each iteration: did I change my lead time? Did I make things faster? We're not relying on anecdotal evidence anymore. We have a frictionless metric that's created for the teams automatically.

At the very base of the mountain, there was what we call outfitters, but really tools that we expect you to have before you start your journey. You'll notice nothing here is too surprising. Rocket.Chat is the one that surprised at least me when we started the year.

Just a brief note on Rocket.Chat. So it's a free, open source, I'll call it Slack competitor. But the key for us is it lives on-prem. And since most of our infrastructure is on-prem and we were wanting to orchestrate and automate a lot of that infrastructure, something like Rocket.Chat eliminated all that friction.

And they provide a Docker image, and you can spin it up in about 30 minutes, and it was extremely easy to get set up, and teams immediately embraced it.

We also have the handbook on the outfitters, the only non-tool up there. We are expecting the teams that are going on this journey to not only read the handbook, but come together as a team and have a book club at least once a week and discuss the chapters that they read.

So when you have 200 teams, we know that some are going to be at various maturity levels. And we needed to have some sort of support model to help the teams of lower maturity. And we came up with this idea of Sherpas.

So Cindy's team and Carmen's team has groups of Sherpas that can help out and get embedded with a line for a given period of time to help uplift their process or their mindset or help drive cultural changes.

And then coming in 2018, we're going to have some experiments around immersive pairing. So taking members from a less mature team, embedding them on a more mature team, and vice versa, to really let people see that things can be done differently.

And then, of course, monthly dojos, and then Cindy will talk about this whole delivery pipeline support, which is a continued pain point for us.

Cindy Payne

So what? So all of this great DevOps work and progress that we've made in the last year at Nationwide, what's it done for our business?

And I wish we had a business partner up here, and that's one of my goals for next year. But this year, I'll tell you the business story for them.

So just a few, less than a couple of months ago, I'm sure you all are aware of Hurricane Harvey, along with all the other hurricanes and fires and whatnot. Nationwide has been on your side, paying claims.

And so Hurricane Harvey, in particular, was predicted to be a wind event, or at least that's what we thought. And what ended up happening was a flooding event, if you remember.

So right away, our business realized that our online claims experience was not optimized for flooded cars, which is what the majority of the claims that were going to come in. And so very quickly, they went to IT and said, "We really would like to optimize the online claims process for flooded cars."

Well, that's the first thing that's changed. The business went to IT the same day and asked for a change. A year ago, I do not think that that was possible.

The second thing that wasn't possible is that they would have actually gotten the change in time to make a difference for a Harvey claim, for our members that had Harvey claims.

So in this situation, the IT and business collaborated and redesigned the online experience to remove about 40% of the steps needed for putting in claims for flooded cars. That took about an hour. And then the next seven hours was the IT department hitting multiple applications that needed changes, and making all those changes happen within seven hours and releasing them with zero downtime deployment by the end of the day.

And this all happened during the business day. Nobody worked overnight, nobody worked outside their hours. And by the end of the day, they had the optimized claims process for our members, which is really what this is all about, and we try to keep in mind. And we did keep the goal of speed in our head all year, and I think this is the payoff, that we get to tell that story.

And a year ago, a change like this would have taken us probably 90 days.

Jim Grafmeyer

Yeah, absolutely.

We saw the model lines, even though they were some of our best teams to begin with. They were about a 90-day lead time, the lead time metric that we talked about earlier.

Just that performance testing change alone let them cut their lead time in half. And then they hit other bottlenecks and worked through those. So down to the point where they were able to design the change, and implement the change, and deploy the change in an eight-hour workday with not a lot of stress.

Cindy Payne

So the final slide: what we still need help with.

So we talked a lot about how the developer experience kind of came to us as we were watching the teams tell us their show-and-tells, and tell us what they were fine-tuning and speeding up. And one of the things I think that I've realized, that we've realized, is that we don't do the same thing for our leaders.

So when you're managing an IT shop at scale, trying to get real-time metrics to leaders to take action. Right now, Carmen loves to say that our senior leaders are still in the newspaper business. Once a month, they publish an ops review doc and try to make decisions off of that.

But we have this lead time metric that we've been able to deploy across all the 200 lines, and we have real-time data. So why aren't we providing our leaders with the real-time data? Why do we publish a monthly report on lead time?

So that's just an example. So if anybody's doing that really well, if anybody has optimized their leader experience as much as they've optimized their developer experience, we would love to hear those stories.

The next one is our matrix org model. So Nationwide is a matrix organization. Although, after reading A Seat at the Table, Mark Schwartz's book, which I recommend to everybody, I would say we're probably a sub-optimized matrix organization.

We oftentimes get out of balance with our horizontal thinking, and we let our project-- We're still very much subjected to project funding and project decisions. We let that overwhelm our horizontal thinking.

So if anybody's in a matrix org model, I know there's books written about this, but we'd really love to hear from you how you help balance that a little bit better.

And then finally, a little plug for Carmen and Mik Kersten's talk, which is on Wednesday, about the case for value stream architecture. I think it's still a place where we struggle.

Jim mentioned why we struggle, because we still have our pipeline spread across four CIOs and six different application owners. So we're working really hard to solve that problem, to figure out how to be able to drive our value stream architecture internally, but I'm really looking forward to hearing if anybody's really figured that out, especially in a complicated organizational model.

So that's what we have today. I appreciate your time and--

Jim Grafmeyer

We'll be around after.

Cindy Payne

I'll be around all week.

Yes. Thank you.