How Your Systems Keep Running Day After Day

Log in to watch

San Francisco 2017

How Your Systems Keep Running Day After Day

My goal today is twofold. One, I'm intending to challenge you. I'm hoping to provoke new thoughts, new questions in your mind. If, by chance, any of these new questions give rise to some anxiety, I want you, to assure you that that's quite normal. Don't worry. We'll get to some sort of resolution at the end. The anxiety may remain, but before we get started you'll notice I changed the title of the talk to How Your Systems Keep Running Day After Day, because that's really the general gist of this. Before we get started, I want to start with something. Can everybody read this? I don't know why everybody's laughing. I want to ask ... Don't worry, it's rhetorical, because I have the microphone. Is this dangerous?

Well, at the very least, my expectation's that you'd say that it depends, right? Much like Nicole was saying earlier. All right, so let's take another one, a little bit more complicated. Right? What we see here is a diff. You see a change. This change is to an HTML comment. Change the case on the K. Right? Is this dangerous? Would your answer change if I tell you that this is for a load balance or health check? Okay, so let's get started. The point of both of those is that all work is contextual. ""It depends,"" is an answer we give quite a lot, and that's important. We'll come back to this. Here's a slide about me. I won't spend too much time on it. Here are some of the places I've worked and things I've written, some places that I've studied. As Gene mentioned, I gave this talk, though I want to just point out that the last time I felt so strongly about the topics that I'm about to talk about was 2009 when I gave that talk with Hammond.

What I want to talk about is new. It is different, and I feel very, very strongly about this. Another piece that might be relevant is my, the degree in Human Factors and System Safety. My thesis was Trade-Offs Under Pressure: Heuristics and Observations Of Teams Resolving Internet Service Outages. This helps set the stage, I guess, a little bit. I don't want you to worry too much about this. I want to give you a, some of you may have heard of this, what's called the Stella Report at a high level. I'll put the link up later. At a high level, this report is the result of a year-long project of a consortium of industry partners. IBM, Etsy, and IEX, trading company, a trading exchange in Manhattan. Over this year, folks from the Ohio State University Cognitive Systems Engineering Lab, David Woods, Richard Cook, and a number of other folks looked deeply at an incident in each of those organizations.

Despite the fact that those organizations, from a funding, from a resourcing, from a market standpoint, from population standpoint, they found these six themes and that were common across all of them. What's most important is ... Certainly the results are quite important. It's how that research was done that I want you all to take a look at a little bit later, and yeah, just as a quick little bit of a cliffhanger, postmortems as recalibration. I'm going to talk a little bit about that. Blameless versus sanctionless. Controlling the cost of coordination.

Visualizations, strange loops, and something that I want to pique your interest on. Dark debt. Okay, so that's the Stella Report.

Here are my main points that I'm going to give you. One, we have to start taking human performance seriously in this industry. If we don't, we will continue to see brittle systems with ever-increasing impacts on our businesses and on society. Number two is that we can do this by looking at incidents going beyond what we currently do in postmortems or post-incident reviews or after-action reviews or whatever the hell you'd call them. Number three is that there do exist methods and approaches from the study of resilience in other domains, but they require real commitment to pursue. I'm going to talk about this. Doing this is both necessary and difficult, but it will prove to be a competitive advantage for businesses who do it well. That's the high level...

John Allspaw, CTO/Researcher, Adaptive Capacity Labs

Chapters

Full transcript

The complete talk, organized by section.

John Allspaw

But before we get started, you'll notice I changed the title of the talk to "How Your Systems Keep Running Day After Day," because that's really the general gist of this.

So before we get started, I want to start with something. Can everybody read this? I don't know why everybody's laughing. I want to ask, don't worry, it's rhetorical because I have the microphone: is this dangerous?

Well, at the very least, my expectation is that you'd say that it depends, right? Much like Nicole was saying earlier.

All right, so let's take another one, a little bit more complicated. All right, so what you see here is a diff. You see a change. This change is to an HTML comment. Changed the case on the K, right?

Is this dangerous?

Would your answer change if I tell you that this is for a load balancer health check?

Okay, so let's get started. The point of both of those is that all work is contextual. "It depends" is an answer we give quite a lot, and that's important. We'll come back to this.

So here's a slide about me. I won't spend too much time on it. Here are some of the places I've worked and things I've written, some places that I've studied.

As Gene mentioned, I gave this talk. I want to just point out that the last time I felt so strongly about the topics that I'm about to talk about was 2009, when I gave that talk with Hammond. So what I want to talk about is new, it is different, and I feel very, very strongly about this.

So another piece that might be relevant is the degree in human factors and system safety. My thesis was Trade-offs Under Pressure: Heuristics and Observations of Teams Resolving Internet Service Outages. This helps set the stage, I guess, a little bit.

I don't want you to worry too much about this. Some of you may have heard of this, what's called the STELLA Report. At a high level, I'll put the link up later. At a high level, this report is the result of a year-long project of a consortium of industry partners: IBM, Etsy, and IEX, who's a trading company, a trading exchange in Manhattan.

And over this year, folks from The Ohio State University Cognitive Systems Engineering Lab, David Woods, Richard Cook, and a number of other folks, looked deeply at an incident in each of those organizations. And despite the fact that those organizations, from a funding, from a resourcing, from a market standpoint, from population standpoint, they found these six themes that were common across all of them.

What's most important is, certainly the results are quite important. It's how that research was done that I want you all to take a look at a little bit later. And yeah, just as a quick little bit of a cliffhanger: postmortems as recalibration, I'm going to talk a little bit about that. Blameless versus sanctionless, controlling the cost of coordination, visualization, strange loops, and something that I want to pique your interest on, dark debt. Okay, so that's the STELLA Report.

Here are my main points that I'm going to give you.

One, we have to start taking human performance seriously in this industry. And if we don't, we will continue to see brittle systems with ever-increasing impacts on our businesses and on society.

Number two is that we can do this by looking at incidents, going beyond what we currently do in postmortems or post-incident reviews or after-action reviews or whatever the hell you call them.

Number three is that there do exist methods and approaches from the study of resilience in other domains, but they require real commitment to pursue. We're going to talk about this.

Doing this is both necessary and difficult, but it will prove to be a competitive advantage for businesses who do it well. So that's the high level.

So first, I want to start with a little bit of a baseline, a bit of vocabulary that's going to be important as I sort of walk you through this. I'm going to describe a sort of picture representation, like a mental model of your organizations, and it's going to have an above-the-line region and a below-the-line region.

So we'll start with this. If you imagine what we have depicted here, don't worry about it being a cloud, just think of it as like a bubble. What this is here is your product, your service, your API, or whatever, that your business derives value from and gives to customers, okay? And inside there, what you see are your code, you see your technology stack, you see the data, and some various ways of delivering this, right? Presumably over the internet or some other sort of way.

Now, but if we stay here, nobody's going to believe me that that's what we call the system because it's fine, but it's not really complete. So what's really connected, and I think what a lot of people have been talking about here in this community, especially in the last couple of days, is that all of the stuff that we do, and this is really familiar, all of the stuff we do to manipulate what goes on in there.

And so we have testing tools, we've got monitoring tools, we've got deployment tools, and all of the stuff that's sort of wired up. These are the things that we use. And you could say that this is the system, because many of us spend our time focused on those things that are not inside the little bubble there, but all of the things that are around it.

But if we were to stay just with this, we won't be able to see where real work happens. So what we're going to do here is we're going to draw this line. This is a line that we call the line of representation. And then, play with this a little bit further.

What we see here is you, all the people who are getting stuff ready to add to the system, to change the system. You're doing the architectural framing. You're doing monitoring, right? You're keeping track of what it's doing, how it's doing it, and what's going on with them.

Now, you'll notice that each one of these people have some sort of mental representation about what that system is, and if you look at it a little bit more closely, you'll see that none of them are the same. By the way, that's very characteristic of these types of roles. Nobody has the same representation of what is below the line.

So to summarize, I'm going to get in a little bit of a view here. Your product or service is here. This is the stuff you build and maintain with, and here's where work actually happens.

So this is our model of the world, and it includes not just the things that are running there, but all of you, the kinds of activities you're performing, the cognitive work that you're doing to keep that world functioning. And if we play with this a little bit more, we end up with this kind of model.

So this model has a line of representation going through the middle, and you interact with the world below the line via a set of representations. Your interactions are never with the things themselves. You don't actually change the systems. What you do is that you interact with a representation, and that representation is something about what's going on below. You can think of those green things as the screens that you're looking at during the day.

But the only information that you have about the system comes from these representations. They're just a little keyhole, right? And what's significant about that is that all the activities that you do, all of the observing, inferring, anticipating, planning, correcting, all of that sort of stuff, has to be done via those representations.

So there's a world above the line and a world below the line. And although you and we mostly talk about the world below the line as if it's very real, as if it's very concrete, as though it's something that, that's the thing, here is the surprise. Here is the big deal.

You never get to see it. It doesn't exist. In a real sense, there is no below the line that you can actually touch. You never ever see code run. You never ever see the system actually work. You never touch those things.

What you do is that you manipulate it in a kind of, well, not imaginary. It's not imaginary. It's very real. But you manipulate a world that you cannot see via a set of representations, and that's why you need to build those mental models, those conceptions, those understandings about what's going on.

Those are the things that are driving that manipulation. It's not the world below the line that's doing it. It's your conceptual ability to understand the things that have happened in the past, the things that you're doing now, and why you're doing those things, what matters and why what matters matters.

So once you adopt this perspective, once you step away from the idea that below the line is the thing you're dealing with and understand that you're really working above the line, all sorts of things change.

And what you see in the STELLA Report and that project, and other projects that we've been engaged with, is taking that view and understanding what it really means to take the above-the-line world seriously. So this is a big departure from a lot of what you've all seen in the past, but I think it is a fruitful direction that we need to take.

So this is the bit here that I want to bring your attention to. In other words, these cognitive activities, in both individuals and collectively in teams up and down the organization, are what makes the business actually work.

So now I've been studying this in detail for quite a while here, and I can tell you this: it doesn't work the way we think it does.

And so finally, to sort of set this frame up, the most important part of this idea is that all of this changes over time, right? It is a dynamic process that's ongoing. So this is the unit of analysis I want you to have in your mind as we were talking through here.

So once we take that frame, we can ask some questions. We can ask some questions about above the line, like this. How does our software work, really, versus how it's described in the wiki and in documentation and in the diagrams? We know that they're not comprehensively accurate.

How does our software break, really, versus how we thought it would break when we designed safeguards and circuit breakers and guardrails?

What do we do to keep it all working?

So question: imagine all of your organizations today, starting today. Imagine at 6:00, all of your companies, hands off keyboard. They don't answer any pages. They don't look at any alerts. They do not touch any part of it, application code or networks or any of it.

Raise your hand if you are confident that your service will be up and running after a day.

I thought I'd actually see more hands.

For those of you, raise your hands really high. All right. There you go. Keep your hands up if you think it'd still be working after a week. You're not touching anything. You cannot respond to any bit.

Keep your hand up if you think that your system is still going to be running after a month.

All right. I want to talk with you two afterwards.

So, the question then is how to discover what happens above the line. Well, there's a couple of things. We can learn from the study of other high-tempo, high-consequence domains, and if we do, we can see that we can study incidents.

By the way, when I say incidents, I mean outages, degradations, breaches, accidents, near misses, glitches, basically untoward or unexpected events.

What makes incidents interesting? Well, the obvious one is lost revenue and reputation impacts on a particular business. I want to assert a couple of other reasons why incidents are interesting.

The one is that incidents shape the design of new components, subsystems, and architectures. In other words, incidents of yesterday inform the architectures of tomorrow, right? Incidents help fuel our imaginations on how to make our systems better. And therefore, what I mean is incidents below the line drive changes above the line. And that's the thing.

This can cost real money. Incidents can have sometimes almost tacit or invisible effects, sometimes significant. Raise your hand if you're splitting up a monolith into microservices.

No kidding. Right. A lot of people do that because it provides some amount of robustness that you don't have. Where do you get that? You're informed by incidents.

Another reason to look at incidents is that incidents tend to give birth to new forms of regulations, policies, norms, compliance, auditing, constraints, that sort of thing. Another way of saying this is that incidents of yesterday inform the rules of tomorrow, which influence staffing, budgets, planning, roadmaps.

Let me give you an example. In financial trading, the SEC has put into place Regulation SCI. SCI, I'm just going to go out on a limb and say, is probably the most comprehensive and detailed piece of compliance in modern software era. And the SEC has gone and been very explicit. We have this as a reaction to the flash crash of 2010, to Knight Capital, Bats IPO, Facebook IPO. It is a reaction to incidents.

Even if you go back a little bit further, it's often cited that PCI DSS came about when MasterCard and Visa compared notes, realized they lost about 750 million over 10 years. So incidents have significant... And by the way, as a former CTO of a public company, I can tell you, I can assure you, that this is a very expensive, distracting, and inevitably burdensome albatross for all of your organizations.

So incidents are significant in this way, too.

But if we think about incidents as opportunities, if we think about incidents as messages, encoded messages that below the line is sending above the line, and your job is to decode them. If you think about incidents as things that actively try to get your attention to parts of the system that you thought you had a sufficient understanding of, but you didn't. These are reminders that you have to continually reconsider how confident you are about how it all works.

Now, if you do, and if you do take this view, a whole bunch of things open up. There's an opportunity for new training, of course, new tooling, new organizational structures, new funding dynamics, and possibly insights that your competitors don't have. I'm going to unpack this a little bit.

Incidents help us gauge the delta between how your system works and how we think your system works. And this delta is almost always greater than we imagine.

I want to assert perhaps a different take that you might be used to, and it's this. Incidents are unplanned investments in enterprise, in your company's survival. They are hugely valuable opportunities to understand how your system works, what vulnerabilities in attention exist, and what competitive advantages you are not pursuing.

If you think about incidents, they burn money, time, reputation, staff. These are unavoidable sunk costs. Something's interesting about this type of investment, though. You don't control the size of the investment.

So therefore, the question remains, how will you maximize the ROI on that investment?

So, switch gears a little bit. When we look at incidents, these are the type of questions that we hear, and it's quite consistent with what researchers find in other complex systems domains.

What's it doing? Why is it doing that? What will it do next? How did it get into this state? What the fuck is happening? If we do Y, will it help us figure out what to do? Is it getting worse? It looks like it's fixed, but is it? If we do X, will it prevent it from getting worse, or will it make it worse? Who else should we call that can help us? Is this our issue, or are we being attacked?

Right. This is consistent with many other fields: aviation, air traffic control, especially in automation-rich domains.

Another thing that's notable is that the beginning of any incident, it's often uncertain or ambiguous about whether this is the one. This is the one that sinks us. This is our Equifax moment. This is our Three Mile Island moment.

At the beginning of an incident, we simply don't know, especially if it contains huge amounts of uncertainty and huge amounts of ambiguity. If it's uncertain and ambiguous, it means that we've exhausted our mental models. They don't fit with what we're seeing, and those questions arise. Only hindsight will tell us if that was the event that brought the company down or if it was, eh, a tough Tuesday afternoon.

Incidents provide calibration about how decisions are focused, about how attention is focused, about how coordination is focused, about how escalation is focused, the impact of time pressure, the impact of uncertainty, the impact of ambiguity, and the consequences of consequences.

Research validates these opportunities. We should look deeply at incidents, non-routine challenging events, because these tough cases have the greatest potential for uncovering elements of expertise and related cognitive phenomena. From Gary Klein, the originator of naturalistic decision-making research.

And there's a family of well-worn methods, approaches, and techniques: cognitive task analysis, process tracing, conversational analysis, the critical decision method.

How we think postmortems have value looks a little bit like this. An incident happens. Maybe somebody will put together a timeline. We have a little bit of a meeting. Maybe you've got a template, and you fill that out, and then somebody might make a report or not, and then you've got, yeah, action items. Finally. Right?

And we think that the greatest value, perhaps maybe the onliest value, is this, right? When you're in a debriefing and people are walking through the timeline and you're like, "Oh my God, we know all this. Can't we just get to the..."

This is not what the research bears out.

The research bears out that if we gather subjective and objective data from multiple places, behavioral data, what people said, what people did, where they looked, what avenues in diagnosis did they follow and weren't fruitful, well-facilitated debriefings get people to contrast and compare their mental models that are necessarily flawed.

You can produce different results, including things like boot camp, onboarding materials, new hire training. You can have facilitation feedback if you build a program to train facilitators. You might make roadmap changes, really significant changes based on what you learn.

I can tell you this from some experience. There is nothing more insightful to a new engineer, or an engineer just starting out in their career, being in a room with a veteran engineer who knows all of the nooks and crannies, explaining things that they may not have ever said out loud. They have knowledge. They may draw pictures and diagrams that they've never drawn before because they think everybody else knows it.

Guess what? They don't.

The greatest value is actually here, because the quality of these outcomes depend on the quality of that, that recalibration. This is an opening to recalibrate mental models. From the STELLA Report, it informs and recalibrates people's models of how the system works, their understandings of how it's vulnerable, and what opportunities are available for exploration.

In all of the research contained in the STELLA Report, and it fits with my experience at Etsy as well, one of the reflections strongest from people who, in a facilitated way to do this comparing and contrasting: "I didn't know it worked that way."

And then there's always another: "How did it ever work?"

Which is funny until you realize it's serious. And what that means is the way not only I thought it worked a different way, now I cannot even imagine it. I can't even draw a picture in my mind of how it could have possibly worked. That should be more unsettling.

By the way, I want to say, this is not alignment. Like I said, via representations, we necessarily have incomplete mental models. The idea here is not to have the same mental models because they're always incomplete, because things are always changing, and because they're going to be flawed. We don't want everybody to have the same mental model. Now everybody's got the same blind spots.

Blameless. This is the blog post that I wrote in 2012.

Blameless is table stakes. It's necessary, but it's not sufficient. You could build an environment, a culture, an embracing, welcoming organization that supports and allows people to tell stories in all of the messy details, sometimes embarrassing details, without fear of retribution, so that you could really make progress in understanding what's happening.

You can set that condition up and still not learn very much. It's not sufficient. It's necessary, but not sufficient.

So what I'm talking about is much more effort than typical post-incident reviews, right? This is where an analyst, a facilitator, can prep, collating, organizing, analyzing behavioral data, what people say, what people do.

There's a raft of data that they can sift through to prep for debriefings, a group debriefing or a one-on-one debriefing. Postmortems hint at the richness of incidents. Following up on this takes a lot of work.

By the way, everyone's generally so exhausted after a really stressful outage or incident or event that sometimes everything becomes crystal clear. That's the power of hindsight. And because it seems so crystal clear, it doesn't seem productive to have a debriefing because you think you already know it all.

The other issue is that postmortem debriefings are constrained by time as well. You only have the conference room for an hour or two, right? Everybody's really busy, and the clock is ticking. So this is a challenge for doing this really well, even given those research methods.

There's a handful of things on this slide that people in the back aren't going to be able to read. I'm going to read it to you. Don't worry about it.

The other issue, especially if you build a debriefing facilitation training program like I did at Etsy, there's still challenges that show up. What I like to call it is everyone has their own mystery to solve, or don't waste my time on details I already know.

And in a cartoonish way, you can think about it as this way.

Before you go into a debriefing, network engineer might say, "Oh, this outage seems pretty straightforward. I just don't know why they didn't call me to help sooner."

A DBA might be thinking, "I understand how the database got wedged. I hope we don't waste time going over that part. I just have no idea about how the load balancer got involved in all this, and I hope we cover that because that's the real mystery."

The CEO might think, "Well, I need answers quick. I got three board member voicemails lighting up my phone, and we don't have time to waste on details that don't matter."

Customer service agent says, "I hope I can get a word in edgewise in this. I don't understand how it's so hard to give customers updates more frequently, and that's the real priority here."

And the application engineer says, "I'm glad we used all those feature flags to turn them off, because without them it could have been a lot worse. I just don't understand how the database got stuck. It's such a black box."

By the way, you have an hour. Extract as much learning as you can.

All work is contextual. The two examples I gave you earlier, the answer is it depends. Your job to maximize ROI is to discover, explore, and rebuild the context in which work is done. In an incident, how work and how people thought above the line.

Assessments are trade-offs, and those are contextual.

All right. We are almost at time here. I'm going to leave you with a couple of ideas here.

All incidents can be worse. A superficial view is to ask what went wrong, how did it break, what do we fix? And these are very reasonable questions.

If we were to take a deeper level, we could ask, what are the things that went into making it not nearly as bad as it could have been? Because if we don't pay attention to those things and don't identify those things, we might stop supporting those things.

Maybe the reason why it didn't get worse is because somebody called Lisa, and Lisa knows her shit.

Something from research is that experts can see what is not there. If you don't support Lisa, and you don't even identify that the reason why shit didn't get worse is because Lisa was there. Forget about action items for fixing something. Imagine a world where Lisa goes to a new job.

Useful at a strategic level is a better question: how can we support, encourage, advocate, and fund the continual process of understanding our systems, really take above the line in a sustained way?

So here are some challenges for you.

One, circulate the STELLA Report in your company and start a dialogue. Even if you're too busy or you're not in a position to read it yourself, give it to people who do. Ask them what resonates. Ask them what doesn't make sense. Ask them, "Yes, this is a thing that I didn't have words for this before, but now I do." Start a dialogue.

Second, look deeply at how you're handling post-event reviews. And most importantly, go find the people who are the most familiar with the messy details of how shit gets done and ask them this: what value do you think our current post-incident reviews really have?

And listen.

Will you learn more and faster from incidents than your competitors? You're either building a learning organization or you're losing to one who is.

So lastly, we need to take human performance seriously. This discussion is happening. It's happening in nuclear power. It's happening in medicine. It's happening in aviation, air traffic control, in firefighting.

The increasing significance of our systems, the increasing potential for economic, political, and human damage when they don't work properly, and the proliferation of dependencies and associated uncertainty all make me very worried.

If you look at your own system and its problems, I think you'll agree that we have to do a lot more than acknowledge this. We have to embrace it.

So what you can help me with: please spread this presentation and these ideas. Come to me. What resonated with you about this? What didn't? What challenges do you face in your org along these lines? Come tell me. I'm on Twitter. You can find me. It's quite easy to find me. It's actually quite hard to get rid of me.

The last is that if you're interested in talking more, the group that I've been working with, we have two different vehicles. The STELLA Report was produced by this consortium. This is otherwise known as the SNAFU Catchers. And for direct sort of partnership in a more traditional sort of consulting and training situation, we've launched Adaptive Capacity Labs.

Thank you for listening.

Gene Kim

Actually, whoa. Actually, one moment.

John, if you can just stay on the stage for one moment. I just want to take a moment to thank John for his contributions to our profession. I think it's impossible to overstate how much John has contributed to the work that we do every day. In fact, some of the wisdom that's come from your disciples, from people who work with you, is almost poetic.

John Allspaw

Don't say that.

Gene Kim

Bethany Macri said, "Safety requires prevention. Prevention requires honesty. Honesty requires absence of fear."

For me, it almost borders on poetic. So if you don't mind, could you just all take a moment to say, "Thank you, John."

Audience

Thank you, John.

Gene Kim

All right. Thank you, John.