The Last Mile Continued - Incident Management

Log in to watch

Las Vegas 2019

The Last Mile Continued - Incident Management

In the follow-up to Damon's 2018 "Operations: The Last Mile" keynote, this talk will examine incident management in the era of DevOps and SRE. Responding to incidents has always been the core job of Operations.

Today, the influences of DevOps and SRE are changing how Operations work gets done, and even who is doing the work.

This talk will look at how high performing organizations are applying DevOps and SRE practices to shorten incidents and reduce escalations. Less frustration for the engineers. Lower costs for the business. Everybody wins.

Chapters

Full transcript

The complete talk, organized by section.

Damon Edwards

My name is Damon Edwards. I'm from a company called Rundeck. Just so you know, the slides are already online. There's a lot of content in the slides, so if you want to take a photo of this slide, you can also. That's my Twitter if you want to tweet at me. I'll post the slides there as well too, but you can get all the slides from that link, and then obviously they'll be in the conference SlideShare as well.

How many folks saw my talk last year? Some folks did, good. All right, thank you. Thank you for coming and you're still here, appreciate it. My thesis last year was that operations is really the last mile. It's a thing we need to unlock the full value of these DevOps transformations. All this work we're doing, we'll not get the value out of it unless we unlock and transform how we do operations. I talked a lot about the issues we have with silos, with work queues, with excessive toil, low trust. That talk is online.

The thesis this year is really digging in a little bit deeper and going to talk more about incident management. Why? Because the ability to respond and resolve incidents is the true indicator of an organization's operational capability. This is where the rubber meets the road around how we can respond to and handle our incidents.

A little bit of a definition first, because there are a lot of definitions of what an incident is in the world, in fact maybe even your own organization. I look at it as an incident is an unplanned disruption impacting customers, which is pretty obvious, or business operations. The customer side is pretty self-explanatory: outages, service degradation, kind of what we'd classify in the classic ITSM world around incidents. But I also look at it as unplanned disruption in our business operations: work interruptions, delay, waiting, short-notice requests. I'll use that for a euphemism, as nobody lets you know until right when it's necessary. All those things impact people's work. That work eventually bubbles up into delay at the delivery and business level. We are avoiding fixing technical debt, fixing problems that could avoid future problems.

To me, these disruptions in business operations end up being, in some way or another, customer impacting; it's just a little more diffused. Why would we separate these things out and say, "Let's talk about just the outages and service degradation as one thing, and then the rest of it is something completely different"? It's still the same people doing the work. We're still causing the interruption and a blast radius in the organization. That's my definition.

Quick show of hands: how many folks here actually work in operations? Anybody? Okay, keep your hand up. How many of you go more than an hour without somebody interrupting you, asking you to do something or something you didn't plan for? How many can go more than four hours? Take your hand down if I pass this. How many can go more than a week without somebody interrupting you? Anybody? Nobody? See, this is the world we live in. Those interruptions are just as important as anything.

The format of this talk: I'm going to talk about the life cycle of incident management and how I see a lot of high-performing organizations transforming and attacking this problem. There are a lot of people I'm going to mention or reference. They may not agree with everything I'm saying. I'm not implying they're endorsing this. And there might be people I left off. Please don't tweet at me.

Before we talk about that cycle of an incident, I want to talk about the context that we're living in, the stew that we're all marinating in, because I think this is very important to understand why we've arrived at where we're at. The first one: digital transformation. Nobody groan. Please, keep it to yourself. What's going on with digital transformation? There are a lot of definitions out there, probably more than DevOps. If you think about it, what's going on is this impulse from the board level, and I've seen enough of these communications from the board level down to the technology organization.

What are they really after? Number one, everything's got to be integrated. No more of this customer service agent does something on this machine, then this machine, then this window, then this window. They want to see it all integrated. They want to have one common system for things. They want things to be responsive. They don't mean web responsive; they mean that they want to be responsive to the industry, to the competition, to customer requests. They want to feel like the IT organization has that responsiveness, not that it just goes into a queue and sits there forever. They want it everywhere: on all the devices, on your mobile, on your laptop, in your Siri. And they want things always there. It's got to be on. The idea of maintenance windows in 2019 are an afterthought. All this is flowing down to the technology organization.

I think Cornelia Davis, one of the speakers here, does a great job in her book "Cloud Native Patterns" of breaking down what the digital transformation flow-down to the technology organization looks like. Digital transformation has driven us over to this idea of these cloud-native technologies, and there has been this explosion of new architecture, new platforms. John Willis and Kelsey Hightower are two people I follow that do a great job at deciphering what's real and what's not real. Obviously, Kelsey is a little more Kubernetes-focused than most, but still does a great job of bringing out the reality.

We've seen this explosion in new platforms. But really what that has enabled is this, I don't know if you've seen this yet, these Death Star diagrams. Cornell University does a great job of trying to do this visualization of microservices. This is actually for a popular online service. The complicated world of these microservices is exploding.

I think one of the best explainers of what this means to companies is Adrian Cockcroft. He did a great talk at DockerCon back in 2014 talking about that architecture enables speed. What we're after here is being able to decouple the organization. If you decouple the organization, they can move faster. Speed is that competitive advantage that is being driven down from that board level. At the technology level, we're pushing for that decoupling, we're pushing for that speed and that ephemeral nature of our infrastructure, which has led us all the way over to DevOps.

Well, that's driven: how are we going to manage these people? How are we going to take advantage of all of these new infrastructure, new capabilities? We're here at Gene's party, so might as well reference Gene here. He brought us the Three Ways, taking a lot of people's work and bringing it together and saying, "Look, it's about this flow, fast feedback loops." Now, with The Unicorn Project, we've got these Five Ideals. The reality is that what a lot of that has been interpreted as is it's all about dev, and it's all about the go, go, go. What about operations? Oh, that's the thing that's still burning in the background. That was the whole point of my talk last year.

What is that pushback if we're all go, go, go? I think that's really come in the form of the SRE movement. It's starting to provide that feedback of the system to say, what is operations' response to this go, go, go nature of the DevOps world? Ben Treynor at Google was kind of the first person to coin the term SRE, put together the first SRE organization, but it's really following patterns that a lot of cloud-native organizations are following. They really bring about these principles: SREs need service-level objectives with consequences; SREs have time to make tomorrow better than today; SRE teams have the ability to regulate their workload. It's all about that feedback, that where everything's go, go, go towards ops, how do you provide that feedback to stay in control?

Folks like Tom Limoncelli, Stephen Thorne, Liz Fong-Jones, Niall Murphy, all people I follow in this area, are doing a great job of surfacing what's special about this new world. And there's a third O'Reilly book called "Seeking SRE," which I'm going to plug because I actually wrote a chapter in it, so I think it's the best SRE book yet.

You start to put this together. What's actually happening here? Things like product not project, continuous delivery, shifting left, error budgets, toil limit, cloud-native technology, so on and so forth: it's really building this self-regulating system. We're breaking down our world, decoupling into these horizontal streams like Adrian was talking about, and we're really building self-regulating horizontal systems. John Hall from BMC is actually the first person to really point that characteristic out to me.

Whether you're in these pure cross-functional teams or you've still got a classic dev and ops organization, it's about building value-aligned, self-regulating systems and building shared responsibility models between people in "dev roles" or ops roles to balance out that work.

Now, let's compare this to what a lot of us grew up in, which is more of a traditional ITSM focus to the world, where everything was about a process, and the process got a process owner. That process had inputs, outputs, triggers, metrics. We assign some up-and-coming manager, saying, "This is your process. Here's your metrics, here's your triggers, here's your outcomes we want. Go manage that thing." And they, sharp elbows, are going to manage that thing to the best of their ability. They'll be the best firewall rule-changers west of the Mississippi.

Then you get something like ITIL starting in like '89. They clearly defined 26 of these, or formally defined 26 of these processes. Now they call them practices. On top of that, there is this notion of change authority, that there's some external body granting the authority for you to make change. Somebody else is going to tell you whether or not your change is going to be correct, or they'll say, "Just bring it to us, and we'll sort of give you advice on it." In general, the idea of authority is flowing from some place.

What happens here is, with that horizontal view of the world, we're unintentionally encouraging these silos. People are people. They want to achieve their KPIs and their OKRs, and we end up with unintentional silos that break the flow of work. On top of that, we're encouraging, whether it's unintentional or not remains to be debatable, command-and-control management: this idea that an external source is going to catch us from problems, is going to coordinate our change. If you think about the complexity, think about that Death Star diagram of microservices: how can anybody be that external authority?

Also, if we've all studied Deming, one of his main points is to cease dependence on inspection to achieve quality. External inspection has yet to achieve high quality rather than building quality controls into the system. What you see going on here is this new way, these horizontal self-regulating systems based on DevOps plus SRE thinking and practices, is actually starting to replace and rebuild what we did in the traditional ITSM world. Fundamentally, I think they're actually quite incompatible, but that's another longer discussion maybe for a different talk. Rob England and Charles Betz, two faces probably familiar around here, may not agree with me with all this, but are also great people doing a great job of documenting how this world is fundamentally changing.

Moving along: we've got digital transformation driving our new architectures, and we've got a new way to run our people. One of the things we've realized is we've got this extremely complicated microservices architecture combined with the go, go, go speed. We're really living in a complex world. Paul Reed has done a great job of breaking this down to say that from the development side, people often think it's very deterministic: we know how NGINX works, and if there's a bug, we'll see it right there. But once you get into that Death Star diagram of microservices and understand the unpredictable user traffic going on there, we've moved into a complex system.

If you know anything about complex systems, it's that we can't perfectly predict what the behavior is going to be. We can't just break it down and say, "I understand how NGINX works. I understand how MySQL works. Therefore, I understand how this complicated system is going to work." We have to start thinking about it in terms of a complex world. People who work in operations, I'm sure all of you have felt this for a long time.

There's a seminal paper by Richard Cook. He's not a technologist; he has spoken here before. He's an anesthesiologist but a famous researcher nonetheless. He wrote this great paper in the early '90s about how complex systems fail. I highly recommend everybody look it up and read it. It will definitely, I think, change your mind if you're already on this idea that you cannot stop failure happening in a complex world. You can only cope with it.

A great thought that Charity Majors had, which she has a very pithy way of saying things, is that distributed systems have an infinite list of almost impossible failure scenarios. Hindsight bias, we always say, "We should've seen that coming." But the reality is, it's almost impossible. And it's never going to happen again. As you talk to people in organizations who have spent a lot of time in resilience engineering trying to improve their operations, it only gets weirder as you go.

That brings us to this idea of safety science and resilience engineering and how that's starting to influence our world. Dr. Woods, Dr. Cook, Sidney Dekker: these are folks who are famous in the real world and they study aircraft disasters, nuclear power plant incidents, healthcare disasters. There are decades and billions of dollars of research and time and effort that have gone around the world into these domains, and they're now bringing them into our world.

John Allspaw, one of the people really responsible for bringing this line of thinking into our world, talks a lot about this. The reality is there is above the line and below the line. Above the line is all the things that we think we're doing. We see the people, we see abstractions. But the reality is the real system is underneath those abstractions, and we actually can't see it. We can never really get to it. It's just there. All that we have is an idea in our head of what that actually is. My idea of what it is and your idea of what it is, even though we think we're talking about the same thing, is probably very different.

The only way we can have a hope of managing these systems is to worry about the interaction between the people who have to work on the systems so we can learn together, so we can stay on the same page. It's about the people, just like it's about the people flying the airplanes, about the people operating in the operating room. The human management side is the difficult part, and this above-the-line, below-the-line metaphor is great to understand why.

More folks to follow: there's Paul, who runs a conference called REdeploy. I call it more of a gathering. It is now the epicenter of the resilience engineering folks who have taken these ideas from the broader world, from high-consequence domains, and bring them to managing the complex world that we live in. They hate slogans, so I'm going to put the stuff they're talking about in bumper stickers just to make them mad.

They talk about things like there is no root cause. That's just a political distinction. Human beings like to draw things in a straight line. We like to cast blame somewhere. Going back to the earliest times, there was always the idea of an act of God. We want to cast blame in some place when the reality is, where we stop is not actually a root cause. There are still many other contributing factors going into that. In the same way, they'll poke apart why the Five Whys doesn't achieve what we want to achieve.

There is a new idea of Safety-I and Safety-II. In the old world, Safety-I was we study the problems; that's how we stop future problems. We keep worrying about what went wrong and figuring out. In the Safety-II model, they flip it around and say, "What went right?" The reality is that the same things that people do day in, day out that make the business run, that same effort, the same activities people do in a slightly different context, a slightly different combination, causes disaster. If you don't understand why your systems actually work, because it's probably a miracle that they often do in the first place, then we're not going to understand why they don't work.

I love this idea which is incidents equal unplanned investments. The question is just what's the ROI that you're going to get out of it?

You put all this together, and where we're going with this is it's about elevating the human. How do we get more of the Iron Man model, where we know we have to put the human, and how do we support the human, and how do we give them the tools that they need to get their job done? Not this idea that we're going to build robots that are somehow going to run these complex systems. Trust me, those other domains have spent billions of dollars and decades trying to get the human out of the operating room, the human out of the airplane, the human out of the power plant, and they haven't figured out a way to do it. We're probably not going to solve it on our end.

You also see the other movement: this idea that ops work doesn't have to be miserable. On-call doesn't have to be miserable. How can we focus on elevating the human? They are our best assets. How do we stop burning them out? A very good conversation happened on this stage. Christina Maslach, who John Willis and Gene Kim brought into this world, another world-famous researcher, actually goes on 60 Minutes and talks about human burnout. She has really identified what burnout comes from, what the causes are. It's not just overwork. There are a lot of other contributing factors. She has really brought a lot to this domain. Folks like Jane Groll are trying to identify and highlight who are the humans behind this and how do we elevate them.

Why we want to do this: this came from a recent S-1 I read. There are 18 million IT operations professionals on the planet. That includes networking and everything. And there are 22 million developers. How can we make all their lives a little bit better?

That was the stew, the context that we're living in. Now let's actually talk about the cycle of an incident. I broke it down into three areas so we can see what people are doing: observe, react, and learn. If you notice, this feels a lot like an OODA loop. If you don't know what that is, OODA loop was originally devised by Air Force Colonel John Boyd, very famous in the warfighter community, really talking about that all tactical activity involves observing something, orienting yourself to what you're seeing, making a decision on what you're going to do, and then acting. How fast you can go through that is how effective you can be in operating. In their scenario, it came out first with airplane dogfighting. If you can go through your OODA loop faster than someone else goes through their OODA loop, you're going to win the dogfight.

It's a really fascinating field to look into. This is actually the first drawing that he ever did of the OODA loop. It's so famous in the military world, it's featured in the Marine Corps Museum as a founding document. I'll put the little orienting aside on this map there for the OODA loop purists.

Let's talk about what's going on on the observe side. Monitoring: this one we've known for a long time. Spotting the knowns. How do we set the traps to look for the problems that have happened in the past? But talking about complex systems, talking about this infinite number of failure scenarios, how can we look for those patterns? We can't. That's where this field of observability is coming in, which is really about interrogating the unknowns. How do we look at the unknown events? How do we look at the activity of our systems and break down and try to use the human to figure out what is actually happening?

It breaks down into three parts. Number one is logging. That's the event. We need a record of the event. There are metrics, which are data points over time. We've lost all context of the event, but we can know, is this number higher or lower than it was before? It's an important leg of the three-legged observability stool. And the third one is tracing: those events in the context of a single request. How do we look at all the events that happen in the context of a single request from the human perspective? Charity Majors and Adrian Cole are two folks that I highly recommend following in this particular area.

There is a new kid on the block in the observe world: automated governance. In the enterprise, we can't forget about the controls we need to put in place and to make sure they're being followed. It's an emerging thing. John Willis has been working a lot on this. There was a DevOps Enterprise Summit forum. A lot of people got together and wrote a paper on this. The idea is, with monitoring, observability, governance, how do we put that in the hands of everybody? No longer that it's in the hands of a few for them to have isolated views of these things, but how do we diffuse these three facets of observation, of visibility, and spread them to everybody so they can actually take action?

Moving along on our cycle, we've gone from seeing what's going on, now we've got to start orienting ourselves to it and make a decision of what we're going to do. First step here is incident command: mobilizing, coordinating, communication between the people. A lot of this starts back with something that came from the government side again, the Incident Command System. Now it's under the auspices of FEMA. But a lot of study for how do you mobilize and communicate and coordinate human beings to go resolve an issue.

One of the first people in our community that brought this in was Jesse Robbins. They used to call him the master of disaster; he ran the early game days at Amazon and was one of the big proponents of we have to break things to learn from things. Guys like Brent Chapman, now over at Slack, are doing the same thing. Ernest Mueller is another good influence in this area, as well as the folks at PagerDuty. They've started to document and open source in this GitHub project their incident response plan, all based on this incident command system. I put Matt Stratton in there because I think he does a great job of explaining to the world. I'm sure there are other people who are also doing a lot of work there.

The next thing is, we're going to be mobilizing our people. Who are we mobilizing? There is this kind of split that we see going on, this kind of divide and conquer. Operations first started to get very blurry. This is our T-shirt from the very first DevOpsDays Mountain View, the first one in the US. Andrew Shafer, it was one of his ideas, which was ops who are devs who like devs to be ops, who do ops like they're devs, who do dev like they're ops. If you were a teenager in the '90s, you know what the refrain is after that.

We saw this blurring, and now we see this division that's starting to take place, where organizations say, look, we're going to take what was traditionally the operations domain and split it into two distinct capabilities. One is platform engineering, which looks more like a product or development team, and that's a centralized organization. Then the people who operate the systems are going to be a distributed function. Some people call it SRE, some people call it other things, but there's the distributed function that is embedded in all the teams that is doing the actual running of the systems. Platform engineering is centralized.

If you follow the folks at Disney who are presenting here, I think today, they're doing a lot of work in this area. On the other side of the world, Sean Norris, who was at JPMC, Standard Chartered, now at Pivotal, was also driving large-scale operations in the financial world towards this way too, this divide and conquer.

We've got our people. We've got their incident command system. We're motivating them, we're moving them towards doing something. I think it's interesting to see the view on escalations. There are two themes here. One is avoid them at all costs. How do we push control closer to the people who first spot and respond to the problems? Jody Mulkey, former CTO of Ticketmaster, did a lot of fascinating work there. Took their MTTR from like 40-something minutes for major events down to four minutes. It's all from pushing control closest to the problem instead of having to escalate up through that.

John Hall again comes back in here. I think he does a really interesting job of bringing this idea of swarming that came out of the traditional human call center service management world and saying, instead of having these escalation trees, build organizations that have capabilities to swarm so you make sure that you're getting the problem to the right person as soon as possible, dramatically cutting down on these escalation chains.

Then comes along the return of runbooks. That's something I know a lot about. Being able to take action, diagnose things, restore your problems. Runbooks kind of disappeared for a while. They were big in the enterprise. The configuration management world told us there's going to be NoOps; we didn't need runbooks. But now, thanks to the SRE movement, runbooks are back, and specifically runbook automation. My colleague Alex Honor and I focus a lot on this: how do you give safe self-service access to the expert knowledge that you need to take action?

The knowledge part: it's easy to move the bits. You've got the scripts, you've got the APIs, you've got the command line. But how do you take that knowledge out of those subject matter experts and formalize it and distribute it in the organization so those closest to the problem can use it? It's got to be self-service. Again, we have to empower those closest to the problem. We have to get rid of those escalations. And it's got to be safe, not only safe from the standpoint of mistake-proofing it as much as we can, to hand it off, to make smart choices, to have guardrails for people not to make problems, but also safe from the auditor and compliance perspective as well.

Before runbook automation, no matter how good we were at that first half of the cycle, one of three things happens. Either we're trying to decipher the wiki: is this thing right? When was this written? What's this person trying to say? Or we're doing ad hoc tool and script usage: what did they tell me on Tuesday? It's not dash I, it's dash E, or is this even the right version of the script? But most likely what we're doing is escalating on up. Someone else is going to be able to solve this problem.

With runbook automation, we're empowering those people closest to the signals to actually go ahead and take action by taking that knowledge and turning it into automation that can coordinate all of the incantations of the scripts and the tools and the APIs that we need to do it.

As an illustration, the old way is cartoonish, obviously, but at level one, there's a problem with that service, so let's call the SRE or the on-call for that service. They go, "This is a problem with the application." They call the developer on call. They go, "There's no data. It must be a database problem." They call the DBA. The DBA shows up and goes, "This is a network problem." We're spending all this time escalating up, and it's probably this larger blast radius than all of that. While that's happening, we're injecting all of this delay into our organization, taking people off of the other work they should be doing.

With runbook automation, how do we empower that level one with all of the knowledge of those different subject matter experts to first diagnose that problem, and if they can't solve it right there, they know who to escalate to? Better yet, in the enterprise, problems happen over and over and over again. If we can give them, here is the check to see if this problem has existed. Run all these checks. And the one that you see, here's the action you can take and repair that. We see people talking about 80, 90% reduction in the time it takes to resolve these known incidents that happen over and over again.

Also, we're stopping those interruptions. All those times that the subject matter experts are being interrupted, we're putting two painful things in the organization. One is interruptions and the other is waiting. Someone has to now delay and wait for somebody. So you're being interrupted all day long, and then it comes time to go do something, what happens? You're now waiting in somebody else's queue for somebody else to do something.

Another area where runbook automation is used very successfully is providing people with the self-service so you don't have to constantly be in that chain. Now that we're getting into the DevOps world, and we've got cross-functional teams, we've got to get into let developers do restarts in production, build and run teams, you build it, you run it. How are we actually going to do that? With runbook automation, instead of saying, here's an SSH key and some sudo privileges and a shell script, and say a prayer and have a good time, we can give them named access to these particular procedures.

That's how we're going to get around these security and compliance issues, because we were able to run it through an SDLC. We're able to do a code review. Operations security can say, yes, this is good. Then let's use the access control to turn it around and let somebody else do it. Earlier this week, Bhavik Gudka from Capital One was talking about the custom system that they built. It does a very similar thing. The whole idea of runbooks-as-a-service is to take all this and decentralize it because they want to do two things. One is they want to rapidly know, is this a known problem? Let's try the known fix. If it's not, let's figure out as fast as possible, through the automated system, how to call the right diagnostics so we know who to escalate it to.

Wrapping up, we've got the learn part. Usually that circle, the OODA loop, is the top half there, the observe and react. But after the incident comes the learning. You should look at a lot of things that John Allspaw has been talking about. He talks about how the problem is, in many enterprises, we think the value is the action items. Some people are like, "I'll send an email ahead of time. Maybe we'll begrudgingly get together to talk about the postmortem. Maybe we'll do a report." And all the execs want to know is what's the action items. The reality is, there's very little value in that. You're probably going to actually create more problems that will cause more outages in the future and not really get the real problems.

The real value is in the journey along the way: the focus on the learning, the storytelling, understanding all the contributing factors, changing people's minds towards it's not about the outcome of the action items, it's about that collective understanding so we have a better understanding of how our above-the-line action helps the below-the-line action.

Again, why? Because these incidents are unplanned investments, and the ROI is up to us. The money's already being spent. The money's already being blown. How can we make that money less, but also how can we get the most value out of it for the organization? It's an investment in the survivability and the future of your organization.

To recap: don't forget all the things that we're stewing in. A lot of people jump to talking about we're just going to fix these, just change how we do incident management. The reality is, if we don't bring people through the path of how we got to where we are and what the contributing factors are in our industry, it's going to be harder to just jump to the answers. Then look to what a lot of these organizations are doing across these different parts of this cycle.

My name's Damon Edwards. That's my talk. Again, the slides are there. You can hit me on Twitter anytime you want, or just email me directly. I will be at the Rundeck booth the rest of the day over in the exhibit hall if you want to talk about any of these things. Enjoy lunch. Thank you.