Findings From The Field: Two Years of Studying Incidents Closely

Log in to watch

London 2020

Findings From The Field: Two Years of Studying Incidents Closely

Principal/Founder · Adaptive Capacity Labs, LLC

In the past two years, we have had the opportunity to observe and explore the real nitty-gritty of how organizations handle, perceive, value, and treat the incidents that they experience. While the size, type, and character of these companies vary wildly, we have observed some common patterns across them.

This talk will outline these patterns we've discovered and explored, some of which run counter to traditional beliefs in the industry, some of which pose dilemmas for businesses, and others that point to undiscovered competitive advantages that are being "left on the table."

These patterns include:

- A mismatch between leadership’s views on incidents and the lived experience that engineers have with them.

- Learning from incidents is given low organizational priority, and the quality and depth of the results reflect this.

- “Fixing” rather than “learning” - the main focus of post-incident activity is repair.

- Engineers learn from incidents in unconventional ways that are largely hidden from management (and therefore unsupported).

Chapters

Full transcript

The complete talk, organized by section.

Host Intro (Gene Kim)

Without doubt, almost everyone in the DevOps community knows the name John Allspaw. In fact, if there were a starting gun in the DevOps movement, it was certainly the famous talk at the Velocity Conference in 2009, when John Allspaw and Paul Hammond talked about doing 10 deploys a day at Flickr.

I finally got to meet John Allspaw in person at DevOpsDays in Mountain View in 2010. It is a meeting I will never forget, and over the years I learned so much from him. He got his master's degree from Lund University back when he was the CTO of Etsy. His advisors included the famous Dr. Richard Cook and Dr. Sidney Dekker, famous for their contributions to the safety and resilience engineering community. He now works with both of them as partners at Adaptive Capacity Labs.

He told me a couple of months ago just how much he has learned working with them, taking his colleagues' expertise to study incidents deeply, such as what Erica just shared. What dazzles me about John's work is that he believes that by understanding how organizations handle incidents, we get incredible clues on how organizations learn.

I cannot think of a better presentation to follow Erica's presentation than John Allspaw, who has contributed so much to the space. Please welcome John.

John Allspaw

Before I start, I want to make a few things clear. The first thing is that these are only a few of the most common patterns, and they are across organizations that we have come into contact with. That means both clients and non-clients. And the second is that these are not judgments or comments on any given single organization, or client or non-client. I just want to make sure that we are clear on that. These are observations that seem to have some valid support.

Here is the summary slide, the bottom line up front. First, the state of maturity in the industry on learning from incidents is low. Really low. I have been studying this for a number of years with my colleagues, and I can say that I think that we, for the most part, in the industry think that this is a solved problem and that we are quite good at it. This could not be further from the truth. I would say that a significant part, and I will go into this later, is that tech leadership in the industry fundamentally misunderstands what it means to effectively learn from incidents. Practitioners as well, but there is a gap there.

Second, that gap exists. There is a gap between how technology leaders and hands-on practitioners understand incidents, understand what they mean, and understand how to learn from them in a progressive and effective way.

Third, learning from incidents is given low priority relative to all of the other things that businesses are doing. This results in a really narrow focus on fixing rather than learning. This has a number of follow-ons or cascade effects that tend to be not good for business.

Lastly, there is really an overconfidence in what these shallow incident metrics mean. There is a great deal of significant energy wasted on tabulating them and producing reports on them. These metrics are frequently gamed. They do not have any predictive value. They are about as useful as counting lines of code. When it comes to shallow incident metrics, we think we are doing astronomy, but we end up doing astrology instead.

Let's talk about this gap. This gap between technology leaders and hands-on practitioners is one of the more prevalent, and it is a bit concerning, of these patterns: what is actually learned versus what we think we learn; how learning actually takes place versus how we think learning is supposed to happen and all of the programs we might put in place to support that; and what the incident actually means for the business, not just what it means for right now, but what it means to inform anticipating future incidents and future vulnerabilities in the future.

I want to be clear on this. When I say technology leaders, I mean people who are at the upper echelons, hierarchically, in an organization, and hands-on practitioners who have their hands in the guts of the technology. The reason I am making this distinction between tech leaders and practitioners is to draw attention to something so fundamental that we often do not acknowledge it.

In research in human factors and resilience engineering, technology leaders, those in positions to provide resources and policies and compliance or rules, would be known as the blunt end. This is in contrast to what was known as the sharp end, the people whose day-to-day work is about evolving and maintaining this technology. Practitioners are charged with the responsibility of designing and evolving technology in an organization, and leaders are charged with providing resources and policies that are supposed to at least support and enable practitioners to live up to that responsibility. But they are quite distant.

When it comes to incidents, technology leaders tend much more often to look at summaries and simplifications or abstractions, certainly lots of statistics about incidents, very rarely what makes incidents difficult. The people furthest away from the day-to-day details need to understand what it means to cope with the complexity at that sharp end, at least enough to support those hands-on folks effectively. There is somewhat of a dilemma there.

When we talk about incidents and having a close understanding of what is going on, how all of the different parts interact, and components and subsystems and behaviors, hands-on practitioners understand because it is part of their daily work. They might be on call. They might be charged with fixing bugs or anticipating new bugs. They have a much more almost palpable sense of the health of all of the myriad stuff in a much more organic, living fashion.

Technology leaders, to some extent, might have a little bit of an idea of what is going on in that stuff down below. But for the most part, these are generally simplifications. They wipe away a lot of what makes incidents difficult so that it can fit into a neat story. As a result, they tend to retreat back into a world that is nice and linear and falls into columns and fits in an Excel spreadsheet.

I am going to address technology leaders for this bit, and then I will talk about practitioners. The first pattern we see is that leaders are typically far away from the messy details of incidents. They may have begun their career as a hands-on practitioner, but when that happens, it frequently brings them to overestimate their ability to understand the real details.

Another pattern is that technology leaders frequently believe that their presence and participation in incident response channels like bridges, chats, and conference calls has a positive influence. I can tell you this is absolutely not the case. It is not guaranteed to be negative, but I can guarantee that nowhere have we seen it turn out to be uniformly positive.

Technology leaders also typically believe that incidents are adverse events that exist in an otherwise quiet and healthy reality, meaning that if you just do not touch the system it will be fine, or that incidents are these sort of epiphenomena. But incidents, as we know, represent very important signals to pay attention to. To dismiss them as defects, as some sort of anomalous situation where normal is zero, misses the point. There is no vision zero in modern software systems.

Technology leaders also typically fear how incidents reflect poorly on their performance as leaders more than they fear practitioners not learning effectively from them. There are a number of reasons for this. Certainly incentive structures can encourage leaders to put much more priority on how they look from a political standpoint in the organization than on supporting effective learning at the sharp end, the end that is further distant from them.

Technology leaders believe these abstract incident metrics tell enough of a story for them to understand the state of the system, when as it turns out, they do not. They typically believe that these metrics reflect more about their team's performance than the complexity those teams have to cope with. If there has been a rash of incidents, one common question is whether they are all happening within this team, or whether there is a common individual, group, or part of the organization, centering it on people's performance and not the system's complexity. Finally, technology leaders typically believe that all of these observations do not apply to them.

When I say abstract incident metrics, or shallow metrics, I mean all the ones you have heard of: the typical conventional mean time to resolve, mean time to detect, mean time to acknowledge or know, or mean time to something; frequency of incidents; severity; customer impact. This is all data. It is fine, and it is quite common to get this. But when it comes to learning from incidents, this data does not have predictive value forward. You cannot use it; there is no such thing as trending, despite popular belief, with this data.

If you see incidents, and here I am showing data from a well-known cloud provider, there is no predictive value forward looking at this data, and there is no explanatory value backward. These are just numbers, and they are divorced from the substance.

Quite often people will say, "Well, I get that, John, but we understand it does not provide a lot of insight into the incidents, but they help us ask deeper questions. They kind of point us in some directions." What I would say is that you do not need this chart to ask deeper questions about incidents. Just ask the questions. You either have an incident or you do not have an incident. Ask the questions. More importantly, from a leader standpoint, you are not the most important person to ask questions. The most important people to ask questions are the people who are ultimately going to be responsible, the people who are getting up in the middle of the night.

If you are going to ask the questions, then you should consider recording those questions and the answers to them, the multiple answers, so that others can find them in the future. This is what supporting learning looks like.

Unless leaders encourage and support capturing difficulties of handling an incident, then all they will have is shallow data. What these data are overrepresented about are consequences and impact, maybe some performance. What I mean by performance is not the performance of the people; it is the performance of handling the incident. Was it difficult to handle or not? Did the incident last a long time because it was really difficult, and yet it was still handled amazingly well? Or was it a case that was straightforward but handled poorly? Incident metrics can signal consequences or impact quite often. Difficulty in handling the incident and performance in handling the incident are qualities that have significant importance when it comes to learning from them. Without those, you cannot understand what incidents mean in context.

I am going to move on to hands-on practitioners. When we look across experience, hands-on practitioners typically view post-incident activities as a check-the-box chore. They typically believe in a future world where automation will make incidents disappear. They typically do not capture what makes an incident difficult, only what technical solution there was for it.

There are organizations we have spoken with where the dialogue made it clear that they are writing incident reports and write-ups with leadership as the specific audience. To put the truth, which is "we do not actually know what happened here, we do not actually know how this came about, and even after days looking at it we are not really clear on this," is less palatable to some technology leaders. Therefore the write-up serves to provide some comfort for technical leaders. That is an illusion of comfort.

Practitioners typically do not capture the post-incident write-up for readers beyond their local team. This is not to say local teams and people who were in the incident are not learning from the incident. It is just not happening in a formal and captured way. It is happening in informal social exchanges, ad hoc conversations about this thing or that weird thing. Those are not what are captured in write-ups, and therefore what is known about the incident is truncated to this local group.

Practitioners typically do not read post-incident review write-ups from other teams. Who can blame them, because they are not very useful? They typically fear what leadership thinks of incident metrics more than they fear misunderstanding the origins and sources of the incident. They also typically have to exercise significant restraint from immediately jumping to fixes before understanding an incident beyond the surface level. To some extent this is very difficult to avoid. It is what engineers do. We like to fix things. We want to do that so much that as soon as we have a minimum viable guess about what happened, which may not be an accurate characterization at all, we will reach for fixes. They also typically believe the above observations do not apply to them.

Learning is not the same as fixing. The danger here is not that coming up with action items or follow-on remediations or things to help in the future are bad things to do. In fact, they are amazing things to do. The question is not whether you are doing them or not. The question is how good they are and what understanding they are based upon. If you understand an incident at a very surface level, you run the risk of generating or hypothesizing follow-on action items that are either too large in a boil-the-ocean sort of way, or completely off the mark and do not address what the incident actually has.

At this point you might say, "Okay, I understand, John. We are not doing very well out in the world. What do you have to give us?" There are a couple of things beyond getting a master's degree and doing the work that we focus on every day.

Aimed at technology leaders, I would say this: learning from incidents effectively requires skill and expertise that most do not have, and by most, I mean all. I do not want to sugarcoat it. These are skills that can be learned and improved. You do not become an NTSB accident investigator without having learned some new skills. Prioritize learning this. Prioritize doing it well, and do it when things are going well. Do not wait until you have had the big one to pound the table and say, "We have to get better at incident analysis." There is no better time to fix the roof than when the sun is shining.

It will accelerate the expertise in your organization. With the amount of time, effort, and money you are spending on retaining talent, you could spend just a little bit more time and effort investing in accelerating the expertise that you already have on staff. I am going to link you here to the learningfromincidents.io site. This is not just Adaptive Capacity Labs paying attention to this. This is also a community of people from across the industry who are looking at what deep analysis looks like and how it looks different from typical template-driven postmortems.

Some other suggestions for technology leaders: focus less on incident metrics and more on signals that people are learning. How could you tell that someone is learning? Try to get analytics on how often an incident write-up is being read. Try to find out who is actually reading these write-ups by their own volition. Where are these write-ups being linked from? It may surprise you that we have seen organizations not only link to individual incidents from code comments in the code base, but also from architecture diagrams and roadmaps. I know of at least a handful of organizations that use incident analysis in new-hire onboarding and orientation because they are such good tours of how things work.

Support incident review meetings being optional and track their attendance. I can say this with some confidence: engineers do not go to optional meetings unless they are very interested, and if they are very interested, there is something there. I cannot learn anything if I do not go to the place where I can learn it. You want this number to go up. Track which write-ups link to prior relevant incident write-ups. Seeing how elements, aspects, and facets of your architecture and organization change over time can be done by cross-incident analysis, and the only way to do cross-incident analysis is if each individual analysis is rich enough for you to see these meta-patterns over time.

Now I am going to turn my attention to practitioners. Do not place all the burden on a group review meeting. The conventional view is that you have an incident, maybe make a timeline, and then have a big meeting. Use this meeting to present and discuss analysis that has already been done. There are too many potential pitfalls to bet everything on a single meeting. There is the highest-paid person's opinion that dominates the conversation. There is groupthink. People can get off on tangents. There are political redirections, elephants in the room, and being down in the weeds. This is an important meeting, so prepare for it like it is expensive, because it is very expensive.

Perhaps controversial is that practitioners should consider practitioners to not be stakeholders. Your role is not to tell the one true story of what happened. Your role is not to dictate or suggest what to do. Maintaining a non-stakeholder stance signals to others that you are willing to hear a minority viewpoint. Minority viewpoints are part of getting multiple diverse perspectives, which is the fuel that makes a story compelling. It is the fuel that makes something interesting. You want these interesting stories because without them people will not learn, because they will not pay any attention.

One way of thinking about this is that half of your job, fully 50% of your job as an incident analyst, is to get people to genuinely look forward to and participate in the next incident analysis. Having a neutral, certainly non-stakeholder stance, is a way of getting that done.

Perhaps even more controversial: separate generating action items from the group review meeting. Separate them in time. Make the group review meeting about understanding the incident. You are not going to construct a timeline; you are going to explore one that has already been constructed, and you are going to talk about themes synthesized from the data you have already analyzed. This makes the meeting focused, and it does not burden the meeting with coming up with action items.

Lots of people have experience being in a post-incident review meeting where, at some point, someone says, "We only have ten more minutes," or "I only have fifteen more minutes. We should come up with action items, so let's come up with some." If you separate action items in time, what we call soak time, later, even just a day later, and it could be done asynchronously, the generation of action items will necessarily be better. This is the same dynamic most senior engineers know: when you are banging your head against a wall on a really tricky bug, stop doing what you are doing and go for a walk. Go to the gym. Take a shower. When you are not programming is when answers will come to you. It is the same dynamic.

I am going to put forth a challenge and leave you with this. I challenge you, no, I dare you all to take me up on this.

For technology leaders, I dare you to do this, and come back to me to tell me what happened. Start tracking how often incident write-ups are voluntarily, not mandatory, voluntarily read by people outside of the teams closest to the incident. Also start tracking how often incident review meetings are voluntarily attended by people outside of the teams closest to the incident. Do this for one quarter. Have somebody on your staff do this for one quarter. Do not publicize or advertise that this is what you are doing. How often people read and how often people attend is your starting point for making progress on learning. If people do not read or attend, there is not learning happening. I thought that was obvious, but I figured I would spell it out.

Practitioners, I have a dare for you. For every incident that has a red herring episode, a wild goose chase, a rabbit hole, capture the story of that red herring in detail in the incident write-up. Put it in its own section. Do whatever you need to do. Capture it. Write it. Ask the people who were there what it was. Most importantly, ask what made following the rabbit hole seem reasonable at the time. Anybody could say, "You should not have followed that red herring. Next time, do not follow that red herring." Red herrings and rabbit holes and goose chases exist because they almost always work, and we are convinced that they are working. Only hindsight tells us that they do not. That is my dare.

As you can imagine, I am interested in incidents. Dr. Richard Cook, David Woods, myself, and all of the people in the learning from incidents community, which is growing very quickly, are interested in hearing your stories. We want to know. We want to hear cases. There are academic works being written, degrees being achieved, doctorates being achieved on this topic. It is time for decades of research in human factors, resilience engineering, and cognitive systems engineering to reach the technology industry. Things are too important to screw this up. Bring your stories and reach out to us.

Thank you for listening. I very much care about this topic. These issues are so important. It is not just about the incidents; it is about how an organization learns and achieves any goals that it sets out to. I am very excited that I was able to speak right after Erica Morrison, who spoke pretty brilliantly about how these things, in a real case, can make a difference for your organization and your customers.

If you have any questions for me, if there is anything I can do to help you venture down this path, let me know. I have learned a lot over the past number of years. The benefits are so much larger than we think. You will find me in the conference Slack instance for the DevOps Enterprise Summit, and you can always reach out to me at allspaw on Twitter and allspaw@adaptivecapacitylabs.com. Thanks. Thanks for listening.