Rethinking Reliability: What We Can (And Can't) Learn From Incident Metrics

Log in to watch

Las Vegas 2022

Download slides

Rethinking Reliability: What We Can (And Can't) Learn From Incident Metrics

Courtney Nash

Internet Incident Librarian · Verica

Rethinking Reliability: What We Can (And Can't) Learn From Incident Metrics

Chapters

Full transcript

The complete talk, organized by section.

Courtney Nash

[00:00:17] Hello. It is nice to see you all here in person. Thank you for joining me. I am here today to talk about rethinking reliability: what we can and cannot learn from incident metrics.

[00:00:29] A little bit about me. My name is Courtney Nash. I used to be the chair of the O'Reilly Velocity Conference. I was involved in some of these things and worked at some of these places. I used to study brains, and I think mountain bikes are the best technology that we have ever invented.

[00:00:47] Okay, cool. I am glad you all agree with me. We are going to do great.

01How far DevOps has come

[00:00:50] The first thing I wanted to say is how far we have come from where we started to where we are now. 2009-ish, I would say: the first DevOpsDays, all the way through The Phoenix Project and Accelerate, and then there is some weird stuff that happened, and we are going to pretend like that did not happen. And now we are all here, so many years later, together again.

[00:01:12] I think the coolest thing about this is that along the way we figured out how to take this skunkworks project, this weird goofy way of doing things, and get it embraced on such an enterprise and global scale. We did that with data, right? We convinced people of the value of what we are doing with metrics and with data. I think all of us here are metrics data nerds.

[00:01:39] But I think we have been wrong about one particular thing. So we will just talk about the MTTR DevOps elephant in the room. Does anybody remember this one? Anybody remember the DevOps elephant? Yeah, I first learned about this one from Andrew Clay Shafer at some various Velocity Conference back in the day.

[00:02:00] The actual metaphor does not really matter so much, but I want to get into why I think we might have an issue with this, particularly matching how many people here track this at your organization? Do you track MTTR? Okay. Does everybody know what I am talking about when I say MTTR, mostly? Okay, so mean time to resolve.

02The VOID

[00:02:26] Why do I think this is a problem? Well, let us talk about this project that I started last year. The VOID is the Verica Open Incident Database. It makes public software-related incident reports available to everyone. There are tons and tons of security breach databases and all those things out there, and now we have one for those of us that care about things that fall over for other reasons.

[00:02:50] We really want to raise awareness of, and hopefully increase understanding about, software-based failures, and maybe bust some myths along the way, because the internet, we really want it to be resilient and safe. As you all know, the internet kind of runs the world, and we run that, so we want to do a better job of that. I put a URL to the community link for that down there.

[00:03:12] What is in the VOID? These are public incident reports. These are artifacts that have been written by somebody about some point where software stopped working as intended. It can be anything from social media posts to status pages, blog posts, all the way up to super meaty post-incident reviews and postmortem reports.

[00:03:33] We currently have almost 10,000 in the VOID across 613 organizations. I think for the past 15 years, so these actually date back to just a little bit before the beginning of DevOpsDays times.

[00:03:49] We collect a bunch of metadata about all of those reports directly from the information in the reports. We are not guessing. If we have to guess, we do not put it in. Things like the organization, the date of the incident, what kind of a report is it, the duration if that is available, if it was DNS, because we track those things and mark them down.

[00:04:12] We also note a few other things like the impact type, if people are using a particular kind of analysis, and then severity, if that is available, which typically comes from status-page type things. Today we are going to really hone in on and talk about duration.

03Duration as grey data

[00:04:30] I like to call duration grey data. It is high in variability. It is low in fidelity. It is fuzzy on both ends. When did it start? When did it end? Who decided when it started? Who decided when it ended? Was that automated? Sometimes it is, sometimes it is not. Sometimes it is updated. Sometimes it is not.

[00:04:48] Ultimately the duration, as we perceive it, of an incident is a lagging indicator of things that have happened in your system, and it is inherently subjective. There are a lot of grey areas, and when you average a bunch of grey areas together, you get one big grey blob, which I will argue we use as a very objective measure of system reliability. But I do not think it is that, and that is exacerbated by the distribution of these types of data.

[00:05:26] I do not have my speaker notes, so I am running with how many times I have practiced this one. This is just one view of what I call these grey data. These are actual MTTR data collected from status pages from a variety of companies.

[00:05:41] Instead of what I want you to be doing right now, which is looking at the numbers and being like, whoa, those numbers are all over the place and they vary, you all are literally doing right now, you are like, is Cloudflare doing better than Atlassian? Are you not?

[00:05:54] Because that is what we want to believe. We want to believe that if we look at those numbers, we can deduce some meaning about what those companies are doing. But you cannot. This just means some stuff happened. That is all that means. One of you is now going, what happened at Wistia in Q3? Let me tell you: they updated some logs. It looks really bad, but they were not that worried about it. It just took 116 hours.

[00:06:23] This is the point I want to make: they are not a measure of reliability, of how well these teams are responding, of the complexity of their systems. They are a reflection of the underlying distribution of these kinds of data, of these kinds of incidents, and the way they arise in the world.

[00:06:44] This is what we are used to seeing. This is a really nice big normal curve, right? There is a mean in the middle. You can get standard deviations off the side. That is not what your data look like. Based off of almost 10,000 incidents in the VOID, your data look like this. There is a big hunk of them up here, and then there is this nice big long tail full of updating your logs.

[00:07:12] This matters. You will notice I am not telling you which company is which in this case, but as this goes through it, you will see the pattern is incredibly consistent, and it does not look anything like that one.

[00:07:24] When you have skewed data like this, the mean is not a useful measure. It is not a good way to understand your data. Statistically speaking, some people can argue that the median is a better version of that, but I am going to come back to that in a minute, because even with these kinds of data, when you try to use something like a median, which is not as affected by all of these outliers that you are seeing, you still cannot tell what you think you can tell from MTTR.

04Detecting change in MTTR

[00:07:53] We measure something for a number of reasons. Some of them are better than others, but typically we want to know: did we get better at that? Did we get worse at that? You want to be able to detect change in that metric.

[00:08:09] As I was going through all of these data in the VOID last year, I thought, these distributions look really familiar. I have seen this somewhere before.

[00:08:17] Last year a Google engineer took data from Google incidents and scraped incidents from a couple of different companies' status pages. He did not say which ones. He did kind of the same thing we did: looked at them and said, oh, that is interesting, those are these skewed data. When you have skewed data, you really have a hard time having an accurate view through the mean.

[00:08:39] What he did was he took a big chunk of them, a couple hundred in each case, split them in half, and ran Monte Carlo simulations. Monte Carlo simulations are like A/B tests, really, with data you already have versus production data coming in from your website or from your systems. Then he made half of the incidents faster, right? Just magically, poof, the best product in the world. Then he ran a bunch of simulations collecting the mean for those.

[00:09:08] What he found, and what we found when we replicated the same experiment with about 7,000 incidents across nine or 10 companies, is that that variability, that skew in those underlying data, made it basically impossible to accurately detect changes in MTTR across all of those simulations.

[00:09:26] Even when he was, and when we were, consistently making a big chunk, half of the incidents, faster, a third of the time the detected change in MTTR was actually longer. Sometimes it was way, way better than you made it. Now you are in this universe where you cannot trust MTTR. We want to trust it. It feels solid. It feels real, but the underlying data make it essentially untrustworthy.

05What duration and MTTR cannot tell you

[00:09:58] If you leave here with one thing in your head, I want you to understand that duration and MTTR cannot tell you how reliable your software or systems are, how agile or effective your team or your organization is, if you are getting better at responding to incidents or not, whether the next one will be longer or shorter.

[00:10:20] Then we are going to take one more little statistical detour: how bad any given incident is. I think we all have these intuitions that longer incidents are somehow worse.

[00:10:35] We had severity data along with those duration data from 7,000 or so status pages. They are not. We did not find any statistical correlation between the duration and the severity of any of the incidents across those 7,000 or so. One tiny caveat: two of the companies showed, statistically speaking, the tiniest of effects, but it is not the kind of thing that would ever make it into a peer-reviewed journal as a big deal.

[00:11:04] This is what the lack of relationship between the duration and the severity of your incidents looks like. You can have long ones that are terrifying. You can have short ones that are terrifying. You can have long ones that are updating the logs for 116 hours. You can have long ones that are brutal, either on you as the people responding or on your customers.

06Shallow data and incident analysis

[00:11:29] Now what? Duration, MTTR, severity are what our good DevOps friend John Allspaw calls shallow data. He calls them that because they obfuscate the messy details that are actually what is going to tell you about what is happening with your systems, how reliable or resilient they are, how your team is poised to adapt and to act or not.

[00:12:01] This is a topic near and dear to my heart currently. It is real smoky in Washington right now, and many of you have probably lived through this in California and Nevada and everywhere.

[00:12:11] In my mind, measuring the complexity of our software systems is like saying how good we are doing at responding to wildfires by looking at the number of wildfires or how long it takes to fight any given one of them, without really understanding the complexity and the reality on the ground, or the people who are actually tasked with dealing with these things when they have really no idea how it started, or they did not have a say in how it started.

[00:12:42] If not MTTR, then what? I get asked this all the time. Do not take it away, but if you are going to take it away, what are you going to give me in return? I offer this: incident analysis.

[00:12:57] You are like, that is not a metric, Courtney. No, it is not. I am sorry. It is a process. It is an investment. But what I am trying to get you to understand is you do not need MTTR. It falls apart after a while. It is like one of those words you say too much. You say orange and then you cannot even say it anymore.

[00:13:17] You do not need MTTR to tell you to go look at your incidents to actually get those deeper understandings of what is happening with your systems. You just go do it, and then you will know more. But it is a process. It is not a number, but you can put numbers on it.

07Sociotechnical data

[00:13:34] Your systems are sociotechnical systems. They are not just technical systems. They comprise humans and machines and code and assumptions and algorithms and all of those things working together on a daily basis to make the machines run. So you need to collect sociotechnical data.

[00:13:53] If you want to know how your sociotechnical systems are doing, you have got to get in, poke around, and talk to people.

[00:14:01] These are some of my favorite three, if you are looking for this, and you can put numbers on these. If someone up the chain needs a slide with numbers on it about how you are doing, these are the ones I would pick.

[00:14:13] Cost of coordination: this comes from the doctoral thesis from Dr. Laura Maguire, who is a researcher at Jeli, who makes a product that plays very squarely in the space in terms of helping people analyze their incidents. You can look at things like the number of people who are hands-on involved in an incident. Some people like to even go as far as saying how many people were paged. Were they paged overnight? How long did they end up having to do that?

[00:14:36] You can look at how many unique teams were involved, using which tools, across how many Slack or Teams channels, were there concurrent incidents running at the same time, and did it get to the point where PR and comms were involved? You can put numbers on all of those things, and you can track how you are doing over time.

[00:14:55] Participation: you can also track the number of people reading write-ups, number of people voluntarily coming to post-incident review meetings, the number of links to those write-ups from things like code comments and commit messages, from architecture diagrams, from this incident pointing to that incident and going, oh, maybe something is going on here that we could go look at even more closely.

[00:15:17] The other one is near misses. Admittedly, this one is harder than the other ones, but I do believe if you invest in this, you are going to start to see some very interesting things. How many near misses did you have versus actual incidents?

[00:15:31] How people decide what a near miss is or is not might be just as hard as saying what MTTR is, so I will give you that. But if you can look at how many times things did not fall over compared to how many times they did, then you are starting to actually look at the numerator and the denominator. You start to understand how much more success you are having and what the sources of those successes in your systems are.

[00:15:58] You also find out weird things like knowledge gaps, assumptions, people have misaligned mental models, and you start to develop a sense of what your safety margins are for your systems before they get to showing up on your status page.

08IBM example and closing

[00:16:12] Right now you are thinking, Courtney, that is ridiculous. That will never work. That will never fly at my organization. This is entirely impossible. And yet, up one floor and down the hall somewhere, I do not know exactly what room he is in, David Lee from IBM is giving a talk about how they are doing this right now in an organization of 12,000 people.

[00:16:33] These are his beautiful slides. I have two of them that I put in here: his T-shirt-size learning from incidents, how they have managed to look at the ways that they do these, what the effort is, and all of those things. They have implemented a monthly CIO learning from incidents meeting.

[00:16:52] They went from only when things were on fire do we then do a root cause analysis and then move on with checklists and never look back again, to starting to track exactly the kinds of numbers I was just telling you about.

[00:17:04] Starting in early 2022, they started looking at how many people are coming to these monthly CIO learning from incidents meetings. I am nearsighted, so I cannot read those. I am going to stand over here: 85 to 100 participants attending each monthly meeting, 27 people attending more than one meeting, 100 unique views of the meeting recordings, and 300 unique views of the reports.

[00:17:30] We run internet properties, right? We know how to collect these kinds of metrics. This is easy. Actually, you can do this. I want to encourage you to invest in that so that you can learn more from your systems.

[00:17:44] That is my general argument. We need a new mindset, toolset, and skill set for talking about, analyzing, learning from, and sharing incidents.

[00:17:53] I harbor a pretty deep suspicion that, much the way that DORA and all of the metrics that we collected so far about DevOps will show, organizations who do these things will have some form of competitive and performance advantage.

[00:18:13] This new approach treats incidents as opportunities to learn, favors in-depth analysis over shallow metrics, views humans not as the things that cause the problems but the solutions and the ways that we actually develop adaptive capacity for future problems, and we study what goes right along with what goes wrong. There is a whole secret sauce in there that companies can unlock if they are willing to invest in it.

[00:18:40] Gene makes everyone put one of these slides in, but I had it anyway. How can you help? How can you get involved, if this piques your interest in the slightest bit? Analyze your incidents. Start with one. That is the DevOps way, is it not? Start small, demonstrate the value, get a little skunkworks team of people who want to do that, who want to go along with you, and then build that trust within your organization. That is how David did it at IBM. That is exactly how he did it.

[00:19:07] Here are a couple links for places you can go to learn more about developing that skill set within your organization. Then, of course, obviously, I want you to submit them to the VOID, because the more data we have, the more we can learn and the better we can get at understanding what is happening here. There is a really not very good form. If you have a lot of them, you can just email me, and I will help you out.

[00:19:28] Last but not least, get involved. We do have a membership program that helps people learn more from these types of data, from these types of incidents, connects like-minded folks together, and we would certainly love to have you all join us in the VOID.

[00:19:46] That is the end. There are two reports you will be able to download soon. The first one, if you go to this URL, you can get a lot of the data and a lot of the ideas behind what I have already explained today. You will get that immediately. I cannot tell you yet when you will get the second one, but it is coming soon. If you go to that page and you put in your email, you will get the first one immediately and the second one very soon. With that, I say thank you.