Learning Effectively From Incidents: The Messy Details
Much has been written about "organizational learning" and "learning organizations." This continued and growing attention on these topics in the software world is encouraging and warranted! However, creating conditions for people to genuinely and effectively learn from the incidents they experience is difficult to do, never mind sustain over time.
The frequency, severity, and even absence of these events do not represent what is learned, who has learned what, or how learning might be taking place in an organization. This talk is about the "messy" and practical realities of learning effectively from incidents, including a number of paradoxes and ironies that technology leaders face as they work to make progress in their organization.
Chapters
Full transcript
The complete talk, organized by section.
Host Intro (Gene Kim)
Gene Kim: Thank you, Shelby and Liz. In my opinion, the person who is leading the charge in redefining the mental models we need in order to design and operate the massively complex socio-technical systems that we all live in every day is John Allspaw.
Certainly almost everyone in the DevOps community is familiar with his decades of work. In fact, if there were a starting gun to the DevOps movement, it was likely the famous talk that he gave at the Velocity Conference in 2009 with Paul Hammond, where they talked about doing ten deploys a day, every day, as part of their work at Flickr.
I finally got to meet John Allspaw in person at the DevOpsDays Mountain View in 2010. It is a meeting that I will never forget, and over the years I have learned so much from him. He got his master's degree from Lund University back when he was CTO at Etsy. His advisors included the famous Dr. Richard Cook, Dr. Sidney Dekker, and Dr. David Woods, famous for their contributions to the safety and resilience engineering community.
We all talk about our desire to create learning organizations, and that is why I find John's work so intriguing, because he believes that by observing how organizations respond to incidents, we can gain incredible clues on whether and to what degree organizations are actually learning. Here is John.
John Allspaw
John Allspaw: Thanks. My goal today is to describe one of the most effective ways we can learn from incidents. It might not be intuitive, but at the same time it might be very intuitive, just unclear on how to do it.
Before we jump into it, I want to start out by giving you the summary. I want to give you the conclusion of this talk before I start. So here is a too long; didn't watch slide, the main gist of what I want to get across.
The first is, learning is never not happening. It is what humans do. It is an integral human activity. Second is that it requires remembering. Learning and remembering are inextricably linked. That means that learning from incidents effectively means discovering and highlighting aspects and qualities of the story of an incident that makes it more likely to be remembered.
So what are those aspects? They are elements of surprise and difficulty, misunderstandings, dilemmas, paradoxes. This is what makes for good stories. This is what makes for stories that can be remembered. If you can remember it, it is likely that learning is going on.
Something that I want to mention is that I am using the phrase "messy details" in the title of this talk very deliberately. You may have heard me reference this phrase before. It is a reference to this paper, and it captures quite eloquently, in my opinion, that when it comes to work in complex domains, the details of what people do and how they do it is what matters, almost more than anything else. These details are easy to miss, and they are not often looked at closely.
So that is what I mean when I say the messy details. If you have not taken a look at this paper, I cannot recommend it more highly. Just because the topic surrounds the domain of healthcare does not mean that it is not applicable to other domains.
I want to first start with a story from my time at Etsy, a company here in Brooklyn, New York, where I am right now. I worked there for a number of years. The story of this incident is that an engineer on the job for no more than a couple of weeks made a change that brought all of Etsy.com down. The site was not just slow or degraded, it was down hard.
This is sort of what a typical write-up looks like about incidents: recently hired engineer made a change to production, blah, blah, blah. It was about an hour and ten minutes for them to figure out what was going on and what to do. So nothing catastrophic, but nothing nothing either.
I want to park that for a second. We will come back to this story.
Let us talk about learning in general and learning from incidents. As I mentioned before, people are always learning. It is difficult to prevent people from learning. The question is not whether they are learning or not learning. The question is, what are they learning, and how useful or how productive is what they are learning going to be to help them do their work in the future?
The challenge is not getting people to learn. It is about creating conditions where a couple of things can happen. We want to create conditions where people at every level of the organization have opportunities to discover new things they did not know, or revisit things that they thought they already knew but either were wrong or slightly dated in their knowledge. It is also about creating conditions where experts are supported in describing and teaching others, telling stories about what they know and how they know it. This is actually a lot more difficult than just getting something on the calendar and asking somebody, "Tell me what you know."
What we know about studies of expertise is that experts are not necessarily expert at describing what makes them an expert. But rich stories are valuable. You want to create conditions where they are viewed as assets, just like any other valuable asset to the success of a business.
As I have mentioned before in some other talks, learning is not the same as fixing. Often, especially in the industry at the moment, they sometimes seem to be confused or swapped for one or the other. One way of saying what is important about learning is: if you cannot remember something, you cannot say you have learned it.
So analyzing incidents means finding what made the incident surprising or difficult. These are what make for memorable stories. What if incident analysis is less about solving the problem that the incident responders responded to, and more about understanding how the incident responders understood and experienced handling and working through the incident?
We work with a lot of different organizations. One of the first questions that we always ask when we first talk to them is, "Do you have any stories about incidents?" That is it. That is the prompt that we give them. We do not give them anything else. We say, "Any incidents come to mind? Could you tell us a story?"
A couple of things show up for us that are always true. First, they are really enthusiastic. They always respond, "Oh, yeah. Well. Oh, God. Let me tell you about this one." They use their hands. Clearly, they sometimes almost flip into a different mode when they tell the story.
They tell the story in suspenseful ways. They know what is going to happen. That is how they have the story. We do not. Whether they know it or not, they lay out the story using what scholars in narrative composition would call suspense structures, even if they do not know what a suspense structure is. They include what was surprising. They include what was weird, strange, difficult. They give us a backdrop: "Oh, so you have to remember, right about this point was when our CEO took the stage at a thousand-person conference," or something along those lines. They are giving context for when in time, sometimes in space, the story took place.
They can tell it in detail, and this is one of the most gratifying and fascinating things for us. Even if it has been years since it happened, they can come up with all sorts of esoteric details. They could even write on the whiteboard what this little piece of code looked like.
After they are done telling these stories, we always ask them, "Hey, so this is an amazing story. Is there a place where I could read about this? Or when somebody joins the company, could they go read about it?" They always respond in some form of, "Oh, yeah, well, we have got a postmortem document. Hold on, let me see if I can get it," and we will take a look at it. What is interesting to us is that the story that they tell is always different than the official write-up. We wonder why that is. Does it have to be that way?
I am going to tell you a little bit of an example, to show how the telling of a story can or cannot reflect the richness of an event or series of events. Here it goes, one sentence: "A high school senior in Illinois led their classmates on an eleven-hour crime spree, committing fraud, grand theft auto, and cybercrimes."
That is the story, right? It is unclear whether you picked up that what I described is Ferris Bueller's Day Off. To be fair, it is not wrong. All of the facts in this sentence, all of the statements are true. It is just incomplete, and not only that, it may take some liberties in some of the descriptions or not. It can be true and also be pretty anemic as far as story is concerned. If I give you that sentence and you go see the movie, you will see that they are quite different. This is that richness and those messy details that I want to get across, such as Abe Froman, for those who have seen the movie.
So let us revisit this incident that I was telling you about before. September 2012, afternoon. This is a tweet from the Etsy status account saying that there is an issue on the site. I will give you a little bit of some flavor of what was going on in chat. People said, "Oh, the site's down." People start noticing that the site is down. A couple of observations: we are OOM-ing, out-of-memory errors, all over the place. More observations: signals that there is something about memory going on. Seems like some templates were rebuilt in the last deploy.
Interestingly, there was a deploy, but it was actually spaced in time. Usually, at least back then, if a deploy had an issue, there was some sort of bug or anomalous behavior, it would not take very long. As soon as the code was out there, it was not five minutes, and this was roughly about five minutes. Things seemed fine. So that was of interest.
Whatever was in the deploy still was not clear. People said, "Oh, well, maybe there was some sort of template-related thing." People said, "Well, it looks like we need to actually get a restart," and somebody said, "It is really hard to even connect to some of the web servers."
Meanwhile, people who were making the change, or making changes trying to work out what was going on, said, "Oh, well, we could deploy this, we could deploy that." People said, "Well, actually, it is going to be hard to even deploy because we cannot even get to the servers." People said, "We can barely get them to respond to a ping. We are going to have to get people on the console, the integrated lights out, for hard reboots." People even said, "Well, because we are talking about hundreds of web servers, could it be faster? We have to power-cycle these. This is a big deal here." So whatever it was in the deploy that caused the issue, it made hundreds of web servers completely hung, completely unavailable.
People said that even deploying, even if we knew what was going on, was going to be pretty hard to do until we could power-cycle everything. Somebody pointed out, "We are going to have to actually disable the load balancer bit or disable traffic coming in, because we do not want them to come back up after we power-cycle them because they are still going to have the code. Whatever is going on, it is only going to happen again. So we will block all the traffic, reboot all the boxes, deploy a change, whatever that is. We do not even know what that is yet."
But we had hundreds of web servers, so people were fanning out: "You get this number and you get this number. I will get web one through ten. You get eleven through twenty-one," so on and so on. They were in reboot fest. At some point, they get to a spot where they had walked through all of those steps. A lot of people ran a lot of commands in a very short period of time to get these boxes up and running. They finally got it up.
What is interesting about this? Let us go back to one of the changes. It seemed that there was something about templates. What they worked out afterwards was that there was a ticket. One of the tickets was for this newly hired engineer who was on boot camp. At Etsy, you would start in your first week, spend a week in this team, then spend a week in another team, and spend a week in another team. We called it boot camp. Lots of organizations do this. Then you would finally land at the team that you were going to be part of more permanently. It was like getting a bit of a tour.
One of the tasks was with the performance team, and the issue was old browsers. You always have these workarounds because the internet did not fulfill the promise of standards. So: let us get rid of the support for IE version 7 and older. Let us get rid of all the random stuff.
Etsy, if you do not know, was written in PHP. It might still be. We used a templating engine to help put together composed pages called Smarty. In this case, we had this base template used by, as far as we knew, everything. There was this little header-ie.css. This was the extra workarounds. The idea was: let us remove all the references to this CSS file in this base template, and we will remove the CSS file. This had been tested and reviewed by multiple people. It was not all that big of a deal of a change, which is why it was a task that was sort of slated for the next person who came through boot camp in the performance team.
They made this change, and like I said, some time passed. What they figured out later would happen is: a request would come in for something that was not there. 404s happen all the time. The server would say, "Well, I do not have that, so I am going to give you a 404 page. So then I have got to go and construct this 404 page. Huh, but it includes this reference to this CSS file, which is not there. Which means I have to send a 404 page." You might see where I am going. Back and forth: 404 page fires a 404 page, fires a 404 page. Pretty soon, all of the 404s are keeping all of the Apache servers, all of the Apache processes across hundreds of servers, hung. Nothing could be done.
The team looked at how many servers they had, and then when they split up, it became clear that they had to power-cycle. They would take ten at a time so that lots of folks could reboot them quicker in parallel.
That is a little bit more of the story than what I first gave you. I just want to be clear on something. This story, I am hoping that many people who worked at Etsy at the time see this talk, where one CSS change could not only break Etsy.com, but break it so spectacularly that its entire fleet of web servers required hard power-cycling. I am going to go out on a limb: it is one that I do not believe anyone who was there will ever forget. It is very memorable.
Side note on this particular case: this is the case that led us to build an award that we gave every year called the Three-Armed Sweater Award. I will leave that for a different talk. There are other talks about it.
What I am trying to get across here in this story is that we need to make effort to highlight these messy details. What was difficult for people to understand? What was surprising for people about the incident? How did people understand the origins of the incident? When the people in the CSS case first went looking, they dismissed the change that had just been made as being relevant because some time had passed, and that was a very reasonable thing to do. What mystery still remained for people? There are some details of this story that, and I was there, I am still not clear on.
The goal of effective incident analysis is to capture the richest understanding of the event, represented for the broadest audience possible. This means multiple trade-offs at different levels. You do not want to capture in written form something that is so technically detailed that you have lost a whole bunch of readers. You also do not want it to be so vague and hand-wavy as to basically tell you nothing, like that first slide I showed you of the CSS case. It did not really say much.
Just one quick note on what I would say is the toughest. There are many barriers, many challenges on getting this done well. First is hindsight and hindsight bias, or as Baruch Fischhoff called it, the "I knew it all along" effect. This is a tendency to simplify these complex, messy details of the event down to the one true story. As a result, this tendency can basically produce a story where all of these details, these multiple perspectives, all get sort of wiped away in favor of a story that makes sense to me, the person looking back. We want to do it to be efficient and crisp, but that is lossy. It means that smoothing out this messiness and boiling it down to how long the incident took, an hour and ten minutes. Is an hour and ten minutes the most interesting part of that story?
What you want to do in capturing, you will note that I have not told you how to do it because that is much more beyond a talk. You want to support the reader regardless of how you do it. You want to write incident descriptions to be read, not just to be filed. You want to describe the data that you relied on in your analysis. Was it just Joe who responded to the incident and took ten or fifteen minutes filling out a template? I do not know. Maybe there is more than just Joe's view on it.
You want to make it easy for readers to understand terms or acronyms that they have not seen before, and you could use this proprietary knowledge trick here. You could use hypertext linking technology. Look it up, it is amazing. You want to have connections. Incidents are not these extra side distractions. They are a part of the work that you are doing. Remember, you are preventing them all the time. You want to increase the amount of preventing them all the time.
Use diagrams or other graphics to describe complex phenomena. Do not be afraid of using pictures. Make it easy for others to link to the write-up document.
How can you know if you are making progress? I have described some of these before. Here are some signals that can tell you that you are making progress in the right directions. More people will actually read post-incident write-ups because you are tracking them. More people will voluntarily attend post-incident group review meetings, and they will participate. They will talk about their view, their perspective, what happened for them, what was surprising for them. More people will link to these write-ups from code comments and commit messages, architecture diagrams, other related incident write-ups, and new-hire onboarding materials.
I can say now, after working with a number of organizations for a couple of years, this happens. There are companies where voluntarily, I know of one organization where voluntarily eighty engineers showed up to a group review meeting, and a huge majority of them added, calibrated, and helped modify their understanding collectively about the incident. Months after an incident write-up has been written, still people are commenting on it. Still people are linking to it. Still people are reading it. Tens of people a day are reading it and sharing it with their colleagues.
This is difficult. Organizations that we know of are doing it. I will say this: your competitors are hoping that you will not pay attention to any of it.
These markers of progress, I just wanted to point out here, I literally asked you and challenged you to pay attention to these things last year in the talk that I gave at the DevOps Enterprise Summit. So my snarky response now is, how is that going?
Here is the help that I would like. I would like, in the conference Slack channel, people to offer up their stories. I want people to challenge me on things that I have said in this talk. I want people to keep conversation about these messy details alive and moving and evolving forward. This is how we become better at learning from incidents.
Thanks very much. My name is John Allspaw, and I appreciate your attention.