Incident Analysis – Your Organization's Secret Weapon

Log in to watch

Europe 2021

Incident Analysis – Your Organization's Secret Weapon

Nora is the co-founder and CEO of Jeli. She is a dedicated and driven technology leader and software engineer with a passion for the intersection between how people and software work in practice in distributed systems. In November 2017 she keynoted at AWS re:Invent to share her experiences helping organizations large and small reach crucial availability with an audience of ~40,000 people, helping kick off the Chaos Engineering movement we see today. She created and founded the www.learningfromincidents.io movement to develop and open-source cross-organization learnings and analysis from reliability incidents across various organizations, and the business impacts of doing so.

Chapters

Full transcript

The complete talk, organized by section.

Host Intro (Gene Kim)

I hope you are having an amazing and exothermic day two of this conference, and we have two great talks coming up next. So I've been following the work of the now many cohorts of graduate students who chose to enroll in the Human Factors and Safety Systems program at Lund University. Among the first of these was John Allspaw, famous for his 2009 talk, "10 Deploys a Day, Every Day" at Flickr. But I've been particularly interested in the work of Nora Jones, because her work is informed by so much of her first-hand experiences in some of the most famous properties in the world, such as being head of chaos engineering at Slack and being involved in so many aspects of chaos engineering at both Netflix and at Jet.com, which was acquired by Walmart.

I love her stories because they help bridge the world of theory and practice, underscoring why it's so important to deeply learn from incidents. In other words, how organizations prepare for and learn from incidents is so critical, famously called an investment that was made on your behalf but without your consent. I believe that doing this well is one of the key hallmarks of dynamic learning organizations. Nora is currently founder and CEO of Jeli.io. Okay, here's Nora.

Nora Jones

Hello, everyone. We've all had incidents. They're unexpected, they're stressful, and sometimes in management, there's inevitable questions that creep up. What can we do to prevent this from ever happening again? What caused this? Why did this take so long to fix? The organizations I've worked in, and the research that myself and my team has done in this space has shown the following responses to the question of why do we do postmortems. I'm honestly not sure. Management wants us to. It gives the engineers space to vent. I think people would be mad if we didn't. We have obligations to customers. We have tracking purposes. We want to see if we're getting better, or we want to have the answers to the board's questions. I think we all know that some form of post-incident review is important, but we don't all agree on why it's important.

We want to make efforts to improve. We want to show that we're improving, but we're spinning our wheels in a lot of ways because we're not actually making efforts to improve the post-incident reviews themselves. We're making efforts to try to stop incidents. But without making efforts to try to improve the postmortem reviews or improve the incident reviews, we're actually not going to improve incidents on any level. The good news is incident analysis can be trained and aided, but it has to be trained and aided to be improved upon. In this very conference before, John Allspaw has talked to us about how the metrics we are tracking today, like MTTR and MTTD and number of incidents, are actually shallow metrics. Right? I get why we're tracking those things. It's an emotional release. It's something that can make us feel better.

But he posed an open question and challenge to the audience. He said, "Where are the people in this tracking, and where are you?" We haven't changed much as an industry in this regard. Gathering useful data about incidents does not come for free. You need time and space to determine it, and today I'm going to talk to you about why giving this time and space to your engineers and to your organizations to improve post-incident reviews can actually work within your favor. It can give you that ROI you're looking for and level up your entire organization. And I'm going to tell you about this through multiple stories that I've experienced myself and new paths on how you can do this in ways that are not disruptive to your business and next steps for you to embark on.

Spoiler alert, sometimes a thorough analysis or incident review actually reveals things that we're not ready to see, hear, or change yet. So as leaders, we have to be open to hearing some of these things. I'm Nora Jones. I've seen this on the front lines as a software engineer, as a manager, and now as I'm running my own organization. In 2017, I keynoted at AWS re:Invent to an audience of around 50,000 people about the benefits of chaos engineering, purposefully injecting failure in production, and my experiences implementing it at Jet.com, which is now Walmart, and Netflix. And most recently, I started my own company based on a need I saw for the importance and value add to the whole business of a good post-incident review. However, I also saw the barrier to entry of getting folks to work on that.

I started an online community called Learning From Incidents in Software. This community is full of over 300 people in the software industry, and we're sharing our experiences with incidents, we're sharing our experiences with incident review, and we have folks from all over the industry. That led me to starting my own organization to help companies get more ROI from post-incident reviews at Jeli. This equation is in a book called "Seeing What Others Don't" by Gary Klein. Gary Klein is a cognitive psychologist who studies experts and expertise in organizations. This metric he came up with is performance improvement is the combination of error reduction plus insight generation. You can't have one without the other, yet we focus as an industry way too much on the error reduction piece and not on the insight generation piece.

Except we're not actually going to improve the performance of our organizations if we're only focusing on the error reduction piece. And I get it. That is an easy thing to measure. As software engineers, we're taught to look for technical errors. We're taught to look for some of these things. We're not so much taught to generate insights. We're not so much taught to disseminate insights, and we don't get celebrated for it. That's something that we can do as leaders is we can actually celebrate the insight generation and dissemination and training materials by folks in our organization. Today I'm going to tell you three different stories about the value incident analysis brought about in different organizations. These are based on true events I have witnessed or been a part of, but the names and details have been changed.

When I was at Netflix, I was on a team with three other amazing software engineers. We'd spent years building a platform to safely inject failure in production to help engineers understand and ask more questions about areas in their system that unexpectedly behaved when presented with turbulent conditions we see in everyday engineering, like injecting failure or latency. It was amazing, and we were happy to be working on such an interesting problem that could ultimately help the business understand its weak spots. But there was actually a problem with the way that we implemented the tooling and the way it was being used. And when I took a good look at it, most of the time I realized that the four of us were actually the ones using the tooling.

We were using the tooling to create chaos experiments, to run chaos experiments, to analyze the results, which meant what were the teams doing? Well, they were receiving our results, and sometimes they were fixing them and sometimes they weren't. I'm sure all of us have been a part of an incident where the action items don't get completed. It was a similar situation, and it wasn't the team's fault, right? So why is this a problem? Well, it was a problem because we were the ones doing most of the experimentation and generating the results, and we were finding our mental models, but we weren't the ones on the teams for the chaos experiments we were running. Right? We weren't on the search team, we weren't on the bookmarks team, but we were running experiments for them.

We weren't the ones whose mental models needed refining or understanding, but we were the ones getting that refinement and understanding, which actually didn't provide much benefit to the organization. We were leading this horse to water, but we were also pretending the horse was drinking the water, and sometimes teams would use it, but that would actually last only for a couple of weeks, and then we'd have to remind them to use it again. So we approached this problem like any good software engineer would approach it and started trying to automate away the steps that people weren't taking in order to get them easier access to the harder parts of the tooling. But that part isn't what this talk is about. It's about one of the other things we did.

We wanted to give them more context on how important a particular vulnerability that we found with the chaos tooling was or wasn't important to fix. So to know if something was or wasn't important to fix, I started looking at previous incidents. I started digging through some of them to try to find patterns, to try to find patterns of systems that were underwater, or incidents that involved a ton of people, or incidents that costed a lot of money, so that we could help prioritize the results we were finding with these chaos experiments. I wanted to use this information to feed back into the chaos tooling to help improve the usage of the tooling. But I found something that was much greater.

Incident analysis had a much greater power in the organization than just helping them create chaos experiments and prioritize the results better, and spending time on it opened my eyes up to so much more, things that could help the business far beyond the technical. And so here's the secret I found. Incident analysis is not actually about the incident. Right? It's this opportunity we have to see the delta between how we think our organization works and how it actually works. Yet most of the time we're not good at exposing that delta. It's a catalyst to understanding how your org is structured in theory versus how it's structured in practice.

It's a catalyst to understanding where you actually need to improve the socio of your sociotechnical system, how you're organizing teams, how people in different time zones are working together, how many people you need on each team, how folks are dealing with their OKRs, given all the technical debt that they're working through as well. Incident as a catalyst is showing you what your organization is good at and what actually needs improvement. This reminds me of a separate story. I was at an organization where an incident had occurred at 3:00 a.m. That's when all the bad incidents occur, right? I came into the office the next day and I was tasked to lead the investigation of this highly visible incident after the fact. This was something that made the news.

But a senior engineering leader pulled me aside in the office the next morning and said something along the lines of, "Hey Nora, I don't know if this incident is actually all that interesting for you to analyze. I feel like maybe we should just move on." I asked why, and they said, "Well, it was all... I know I'm not supposed to say this, but it was human error. Kieran didn't know what he was doing. He wasn't prepared to own the system. He didn't need to respond to that alert at 3:00 in the morning. It could've waited until he was in the office and he could've gotten help with it." I was shocked. This was in an organization that thought they were practicing blamelessness. Right? We've all heard about blameless postmortems, but yet we all use it a little bit incorrectly.

They thought they were practicing this without a deep understanding of it, and when something like this happens, a Kieran makes an error, it's usually met with instituting a new rule or process within the organization without publicly saying that you thought it was Kieran's fault. Yet everyone, including Kieran, knows that folks think that. That's still blameful, right? It's not only unproductive, it is actually hurting your organization's ability to generate those new insights from that equation we looked at earlier and build expertise after incidents. And so you're actually harming your organization's ability to improve your performance. I get it. It's easier to add in new rules and procedures. It's easier to add in gates. It's easier to update a runbook and just move on. It allows us to emotionally move on, and we need that as humans.

We need to feel like we're done with the thing. But these implementations of new rules and procedures don't actually usually come from the folks on the front line either. And that's because it's much easier to spot errors in hindsight, especially from a management perspective. It's much more difficult as leaders to encourage insights. But unfortunately, adding in these new rules and procedures actually diminishes the ability to glean new insights from these incidents.You're not giving people the space and time they need to glean these new insights. Because what Kieran did, someone else is going to do in the future, even if you add those guardrails up. So despite all that, I still decided I wanted to talk to Kieran, and I wanted to figure out what happened.

So according to the organization, Kieran had received an alert at 3:00 a.m. that, had he spent more time studying the system he was on call for, he would've known could've waited until business hours to fix. I came into a conversation with Kieran completely blank, and I asked him to tell me about what happened. Well, he said, "I was debugging a chef issue that started at 10:00 p.m., and we finally got it stabilized. I went to bed at around 1:30 a.m. At 3:00 a.m., I received an alert about a Kafka broker being borked." Interesting finding number one, Kieran was already awake and tired and on call from debugging a completely separate issue. That's interesting to me. I wonder why we have people on call like that for two systems in the middle of the night, and we're not keeping an eye on them.

I asked him what made him investigate the Kafka broker issue. He said, "Well, I had just gotten paged for it. My team just got transferred this on-call rotation for this Kafka broker about a month ago." I asked if he had been alerted for it before. He said, "No, but I knew this broker had some tricky nuances." That led me to interesting finding number two. Kieran's team had not previously owned this Kafka broker. And I wondered, at this organization, why did they get transferred the on-call for this Kafka broker, and how do on-call transfers of expertise work? Who originally held the expertise for this Kafka broker, if not this team? I then asked him how long he's been at this organization. He said five months.

Interesting finding number three, Kieran was pretty new to the organization, and we had him on call for something like this for two separate systems in the middle of the night. I don't really feel like this is Kieran's fault so much anymore, and I'm starting to think that this really wasn't human error. If I was in Kieran's shoes, I would've absolutely answered this alert at 3:00 in the morning. I'm new to the organization. It's a new team that I'm on call for, and I know this broker has tricky nuances. It makes sense, but yet if we hadn't surfaced all these things and we hadn't had the opportunity to have a good incident review with Kieran, we wouldn't have surfaced this. Right? We would've kept repeating those hacky on-call transfers.

We would've kept putting new employees on call when they maybe weren't ready yet, or maybe when we hadn't trained them yet. And so by digging into this a little further, we were able to surface these things. But if we had just implemented a new rule or procedure, this kind of stuff would just get repeated again. Maybe not with this Kafka broker, but with another on-call system in this org. So, let's go back to this point. Most incident reviews are important, but they're not good. And what's worse is when an incident or event is deemed to have a higher severity, we actually end up giving our engineers even less time to figure out what happened. Sometimes it's due to SLAs that we have with customers.

But it's important that the time and space that is given after that customer SLA is met to come up with actually good action items, to come up with the how of how things got the way they are. Give your engineers space to work through them, especially if it was an emotionally charged incident. When you do an incident analysis of incident Slack channels or Zoom transcripts or chatting with people, you can talk to people one-on-one like I did with Kieran. We call this an interview or a casual chat. And these individual interviews prior to the bigger incident review can determine what someone's understanding of the event was, what stood out for them as important, what stood out for them as confusing or ambiguous or unclear, and what they believe they knew about the event and how things work that they believe others don't.

Especially with emotionally charged incidents, we should set up some one-on-one individual chats like this. If I had asked Kieran the questions I had asked him or the chat I had asked him about in the incident review meeting myself, it probably wouldn't have revealed all the things that he revealed to me in that one-on-one chat. Now, there are certain ways we can ask questions, and we call these cognitive questioning or cognitive interviews. Now, knowledge and perspective gleaned in these early interviews or the way we ask these questions can point to new topics to continue exploring. They can point to some relevant ongoing projects. They can point to past incidents. They can point to past experiences that are important for the organization, important historical context to know to help level everyone else up.

There's a bunch of sources of data that we can use to inform this incident review. Right? And we can iteratively inform and contrast the results of cognitive interviews with these other sources of data, like pull requests and how they're being reviewed, or how Slack transcripts are going, or docs and architecture diagrams, or even JIRA tickets where the project got created. Now, my last story is one that we might all be familiar with a little bit as a software industry. I was in an organization where promotion packets were due. Now, promotion packets in this organization consisted of an engineering manager putting together a little packet for someone on their team that they thought deserved to be promoted. As this organization was growing larger and larger, this became harder to read all the packets, and so they became very number-driven.

Did this person complete the things that they said they were going to complete at the beginning of the quarter? And so that's what it was mostly driven off of, is if they had completed those things. Right? And so people were losing promotions when they hadn't completed things at the beginning of the quarter. But I know we've all been at organizations where we've committed to something at the beginning of the quarter, we get midway through the quarter and realize that that's not the most important thing anymore. But yet this is what we were judging people on. So what do you all think happened?

Well, people would commit to things at the beginning of the quarter, realize they weren't relevant anymore, but knew that that's what they were getting judged on for their promotions.And so they'd rush to complete those things just before promotion packets were due. Now, we saw certain upticks in incidents in this organization during the year. And as I was analyzing the incidents for this organization, I was analyzing individual incidents, but I was also analyzing historic themes and if we could correlate them with certain events, traffic spikes, big uses of the application. And I saw spikes and incidents around the time promotion packets were due, just a few weeks after, because we would see an uptick in things getting merged to production, maybe things that weren't ready.

And I would sit in some of these incident reviews and engineers would say, "Yeah, I wasn't going to get promoted unless I pushed this in." And so this engineering organization thought they were incentivizing the right things, but they were actually ending up creating poor incentive structures, right? This was the organization they were creating. But without actually looking into incidents and without actually looking into incident analysis, they weren't able to figure out that this is what was happening. And this is why this kind of stuff is important, is it can help you structure your organization better. A good incident analysis should tell you where to look. And I mentioned this before, we're not trained as software engineers to analyze incidents.

We're trained in different pieces of software and distributed systems, and we can figure out technically what happened, but we're not really trained to figure out socially what happened. And it can be kind of awkward sometimes, right? Figuring out what questions to ask, figuring out what people to talk to. But as leaders, we can help not make it awkward, and we can help make it psychologically safer. Now, I mentioned a good incident analysis should tell you where to look. This is really hard with some of the tools on the market today. And I want to show you a quick screenshot of Jeli, which is the tool my company is working on. We're not GA yet, but I wanted to give you a little teaser today just so you can see where incident analysis can show you where to look. So where would one look here?

Well, you can see a heat map of all the chatter on the team. You can see a heat map of when the Slack conversations were going off or when PagerDuty alerts were alert storming or when certain pull requests are going through. You might be interested in the absent chatter on early Saturday morning, where it looks like management was the only one online. Maybe that's a sign of actually good management taking one for their team there. You might be interested in the fact that customer service seemed to be the only one online late Friday night. I wonder if they were getting supported. You might be interested in some of the tenure of folks on the team and their participation level. Are we relying solely on folks that have been here for a while? What about folks that are fully vested? Are we relying on them a little bit too much?

What happens when they leave? You might be interested if we relied on folks that weren't actually on call, right? That can tell us if we need to unlock tribal knowledge, if we have knowledge islands in the organizations. You might be interested if people were on call for the first time ever and how we're supporting them. A good incident analysis should tell you where to look, but it can also help you with a number of things. It can help you with headcount, right? If you're always relying on people from a certain team or people that weren't on call, that can help you understand if you actually need to spin up a team there, if you need to spin up training there. It can help you with planning promotion cycles, as we talked about earlier. Quarterly planning, unlocking that tribal knowledge, figuring out what people know.

I was in an organization once where every time a certain guy came into the incident channel, everyone would react with the Batman emoji in Slack. And he was amazing, but it was actually a poor thing in this organization because we relied on him a little bit too much. Those engineers are expensive, and they usually leave organizations quickly because they burn out, right? And they take all that knowledge with them. This can help you see how you're actually supporting that. You can see how much coordination efforts are costing you during incidents. As an industry, we pay a lot of attention to the customer costs of incidents and the repercussion of the incidents. We don't pay a lot of attention to our coordination costs.

If we're working with a team we've never worked with before, if we're working with people we've never worked with before in the midst of an incident. And it can help you understand your bottlenecks, not just in your technical system, but in your people system. Now, you're probably thinking, I can't give one-on-one interviews for every incident. I don't have time to do this, right? And I want to go back to my earlier point. A lot of the reason that you don't have time is because the incident reviews today are not that great, and it feels like why should we spend more time on something that's not that great, right? But you can make it better. Now, there's some starting points you can do, like which kinds of incidents should be given more time and space to analyze. It doesn't have to be every incident.

So, and it doesn't have to be every incident that just caused customer impact or just hit Twitter big time. There's certain signals that you can use to see which incident should be given more time and space. Like if there were more than two teams involved, especially if they had never worked together before, or if it involved engineering and a non-engineering team, like customer service or PR or marketing working together. That's a good indication that more time and space should be given. Or if it involved a misuse of something that seemed trivial, like expired certs. I think every single organization I've been in, someone from leadership has been like, "Why are we having all these expired certs incidents? Let's look into them a little bit more."

Usually, when it's something seemingly trivial that is triggering a lot of incidents, it's actually an indication of a deeper organizational problem, not someone not knowing how expired certs work. If the incident was almost really bad, if we found ourselves going, "Whew, I'm so glad no one noticed that," that's usually an indication that we can dig into this deeper and that we have a lot to learn from it in a nice way that gives us time and space. If it took place during a big event, like an earnings call, or if the CEO was doing something within the organization, or if we had a big demo, or if promotion packets were due, or if everyone was out of the office.Those are usually indications as well. If a new service or interaction between services was involved, if more people joined the incident channel than usual.

Are you tracking how often there's lurkers as compared to actual participants in the channel? There's usually a lot of people wanting answers, but sometimes there's three or four people actually debugging the incident. That ratio of lurkers to actual participants can tell you a lot about the incident as well, and usually indicates there's more to dig into there. So when are we ready for incident analysis? When are we ready to level up our postmortems and not just have this standard RCA doc, and not just have this meeting that people feel like they wasted time at? You're ready now. Having customers means you're ready to benefit from incident analysis in some form. And the earlier you start, the better. The earlier you can ingrain this in your organization, the better. So what can you do today to improve incident analysis?

You can give folks more time and space to come up with better analysis, and this can be trained and aided. Use incidents that were not high profile, that didn't have a lot of emotional stakes, and give them a couple weeks to look at them in addition to their regular work. It doesn't need to be something that they drop everything and work on. But you can get a lot of value out of giving them some time and space to actually review the incident under a different lens. Come up with some different metrics. Look at the people. Don't just have MTTR, and MTTD, and error counts. But look at the teams. Look at if they've worked together before. Look at if they were pushing out a new service to production. Look at how many people they have on their team. Look at how often we're relying on people that were not on call.

Look at how many lurkers to actual incident responders we have. Look at the coordination costs of the incidents. You can do investigator on-call rotations. Treat this like you would incident response. Have folks that were not involved in the incident doing the incident review, because you get that unbiased perspective. You get someone that can ask Kieran those questions, without Kieran feeling like they're blaming them. Having folks that weren't involved in the incidents doing the incident reviews actually levels up your entire organization, because now they're learning about a system for an incident they didn't participate in, and that expertise is amazing to see. And allow investigation for the big ones. You need time for this.

And I know you're getting asked answers from your boards, I know you're getting asked answers from your C-suites, but giving them time and space is going to actually help with these big ones over time, and they're not going to seem as big over time. My company actually offers a couple things to help with this, too. We have a Move Fast and Learn from Incidents workshop where we give you two fake incidents that you can practice some of this in without using one of your real ones. And we also have a product available that's in closed beta today that I'll give you some more information to reach out afterwards. So how do you know if it's working? There are more folks attending the incident reviews and more folks reading them, not because they're being asked to, not because they're required to, but because they want to.

This is an indication that they're actually learning something. I actually saw folks get promoted because of what they were learning in these incident reviews at an organization where they really invested the time to level up their people and level up their incident reviews. You're not seeing the same folks pop into every incident. You're not having to react with that Batman emoji anymore. And folks are feeling more confident about their on-call rotations. They're not hesitant about ignoring an alert or responding to an alert. They're feeling better about it. Teams are collaborating more. You're not seeing as high of coordination costs in your incident. And there's a better shared understanding of the definition of an incident.

Something I challenge you to do is ask a few different folks in your organization what an incident is and see how many answers you get without them needing to pull up your SEV doc guide. That's also usually an indication that your coordination costs might be quite high. I want to share some testimonials from people that improved incident reviews in their organizations and spent time and space to do this. Someone said, "I just changed the way I was proposing to use this part of the system in a design that I was working on as a result of reading this incident review document." They were working on a completely separate project and were able to learn about how a piece of technology got implemented because of reading an incident review. That's what incident review should be for. It doesn't need to just focus on the socio or on the technical.

It's a training mechanism. I had someone say, "Never have I seen such an in-depth analysis of any software system that I've ever had the pleasure of working with." He was saying that folks that read this document are coming out with a better and more informed understanding of services that started out of just having one or two people understanding them. And it ends up being educational pieces that people pull up later in the organization. I've seen the incident review gets outputted and people are still pulling it up months later, not during incidents or anything, but as part of implementation, as part of onboarding guides, as part of getting ramped up to a team. They can be beautiful living documents. There are a few components I recommend as parts of a strong post-incident process.

An incident occurs, we assign an impartial investigator. The initial analysis is done by the investigator to identify if there's people we need to talk to one-on-one. And then they do an analysis of the dispersed sources involved, like the Slack transcripts, the Zoom transcripts, the PRs, the tickets. Then they might do some individual chats before the incident review. Then we might want to align and collaborate on something together, facilitate the meeting, output the report, and then after some soak time, after a day or so after this, then come up with action items. I promise your action items are going to be so much better if you don't do them right away, and you'll actually see people getting them done because they're inspired to, not just because they feel like they have to.

I realize this might feel like a lot for every incident, so think about the metrics I gave you earlier. For certain incidents, you should apply some of this, too, and it can be condensed and consolidated for other incidents as well. And if you're interested in further resources on incident analysis, the learningfromincidents.io community open sources a lot of our learnings. We write about how we're doing this in organizations, actual chop wood and carry water stories. Not so much on the theory, but actually how it's working in practice. And if you're interested a little bit more on the error accounting mechanisms I had brought up earlier and why they can actually hurt us sometimes, there's a very quick paper, it's about two pages, called "The Error of Counting Errors" by Robert L. Wears. And it's taken from another industry.

Software has a lot to learn from other industries like medicine, and aviation, and maritime on how we look at accident investigation. We don't need to reinvent the wheel there. And it's a really great paper to look at. And Gene asked me to include this slide at the end on the help I'm looking for. As I mentioned before, I'm building a company around these capabilities, which are exactly the tools that I wanted to have when I was an engineer at previous organizations. I took the time to build them. If you have any interest in how we're thinking about this kind of work or the problem and solution, you can reach out to me via my email here or on the Contact Us form on our website, and I would love to talk. Thank you so much.