Working at the Center of the Cyclone
Ironically, our increasingly automated systems have increased the demand for thoughtful, skilled, and skeptical human experts capable of understanding them and intervening, often under intense pressure and high uncertainty.
As the underlying technical entities increase in complexity and importance these pressures, uncertainty, and the stakes of expert decisions also increase. We’ve been studying what these people are doing and how they do it.
Their results are exciting, encouraging, and worrisome. Exciting because the studies open a new window into how distributed systems are resilient. Encouraging because the expert work is so sophisticated and purposeful. Worrisome because it is clear that we are not yet building technologies that are team players and the path forward is not clear.
What is clear is that understanding the cognitive work of experts is a powerful way of understanding the "system" and grounding our own understanding of where opportunities and vulnerabilities lie.
Chapters
Full transcript
The complete talk, organized by section.
Introducer
Gene had asked me if I wanted to introduce the next speaker, and I can tell you I'm honored to do it. I am a superfan of Dr. Richard Cook. I think the stuff that he is introducing in our industry is incredibly important.
I can go over all the reasons why I think he's an amazing man, but then he'd get mad at me. I will say one thing: my introduction to him was almost 10 years ago, a paper called "How Complex Systems Fail."
Again, it's my honor to introduce Dr. Richard Cook.
Dr. Richard Cook
Any Buckeyes? Yeah, a few? Okay. It's an in-joke.
Actually, let me start with this. I'm a middle-aged--actually, I'm an older white guy from Middle America who's had a very conservative career. And so now I find, given the political situation in the United States, that I have to differentiate myself from people who are other older white guys from Middle America, who are thought often to be supporters of various kinds of reactionary views.
And so my political position is exactly described by this. And if anybody wants to look at this, you'll understand who I am and where I come from. And anybody who wants to come to my session tomorrow, I'll give you a sticker like this for you to put on your computer.
I want to talk to you a little bit today about some of the problems that we have in the center of the cyclone, the center of the cyclone being the place where people are trying to keep the systems that you build running.
This is really the result of a bunch of research that we've done. I'm going to tell you a little bit about that and go on and give you some basic ideas. But there's really just three kinds of things that I want to talk about.
The first is clarifying the system. And by describing what is the system, we get a completely different view of the world. And I'm hoping that this will change everything for you. It's something we've been working on for a while.
The second idea is that experience with the incidents that we're having doesn't fit our paradigm. And this problematic nature of the incidents that we're having should inform us about how to approach systems differently, but we're having trouble with that. There's a whole bunch of problems with that.
And I'm going to end up with a little bit of a theoretical thing and ask you to think about whether or not one of the things that we derive from this view of the system might actually be true.
This is all part of a much larger story. I'm not going to tell you that story today, or even try to. Those of you who have some interest in this and want to talk some more, we're going to have an informal conversation tomorrow with circular chairs, hopefully, in Breakout D at 3:15 in the afternoon. So if this stuff piques your interest and you want to talk some more about it informally, we have an hour and a half to do that tomorrow afternoon.
Okay? What I'm going to talk to you about today is a set of research results that have been going on since we began this work around 1987 or so, which is before some of you were born. And we've done this work in a whole bunch of different areas. We've done it in nuclear power plants, in aviation, in military systems, in semiconductor wafer fabrication, in a whole bunch of different places.
The results are basically the same across all these industries, including yours. Okay? So what I'm talking about is a very broad kind of idea, but I'm going to try and focus it in on IT systems and how this works.
The most surprising--well, it's not surprising that your system sometimes fails. What's surprising is that it ever works at all. This is really true. If you were to look inside the systems that you folks build, the conclusion that we would come to is: none of that ought to work. The way you do it, the way you build it, what you're building it on, it should never work at all.
And in fact, it's quite surprising that it works at all. It is not the case that it's surprising that it sometimes fails. Failure is the normal function of your systems, not the abnormal one.
Just to give you a sense of how this might play out, let me give you this example. Here are some failures on the right side of the screen. You may recognize those. Please don't feel bad if your company is not included. It just meant that it was an oversight on my part.
Yes, there are actually some local things here that might be of interest to you: pound sterling, King's College, UCL, NHS. A few of those, and some others. And you can see them on the left-hand side here.
Oh. Well, you get what I'm after.
Systems fail. Systems fail all the time. And by the way, for each one of these celebrated failures that you could read about in the press, there are hundreds or even thousands of failures of the systems that you run happening all the time.
In fact, I've been unable to find any large enterprise that doesn't experience at least one significant outage a day. Now, that doesn't mean that the customer notices it, but that does mean that there are people scrambling around in the background trying to keep the thing alive before the customer does.
Anybody in here who doesn't think that is happening in your system should probably do a little more looking closely at what's going on.
Now, there is some good news. The good news is that every outage could've been worse. Even the outages where you're down for several hours, that could've been worse, right? And the other nice thing here, the good news is, you've got a whole bunch of people who are working to limit the damage and fix the broken stuff, and they are remarkably successful at what they're doing.
Okay? So feel good about it. Every outage could have been worse, and you've got people to fix stuff. Now, that's the good news.
The not-so-good news is something's always broken. Your systems are always broken. There's never a time when you don't have broken stuff. And by the way, the complexity of your system is increasing, and everybody who's spoken today has been talking about trying to thread the needle on how to deal with increasing complexity and the problems that it brings.
So this is not necessarily a really good augury for our future. But the point here is that broken, abnormally functioning systems are the normal. That's not the abnormal. Your system is broken all the time. That's its normal state.
Rather than imagining that you can somehow build a system that doesn't break down and is somehow protected against that, what you might do is think that the systems that you're dealing with are themselves intrinsically failing and fail-prone.
And of course, the complexity thing that's going on here: all of the bits of your system are constantly changing. It's never static. It never stops. It is always changing all the time.
And it's complexity, but complexity and change are actually the same thing. There is no difference, truthfully, between complexity and change. Anything that is complex will necessarily be changing. Anything that is continuously changing will necessarily be complex. They're synonyms.
The big problem is that nobody knows what's going to matter next. Everybody's guessing. Everybody's trying to figure out what is going to be the problem next. Nobody knows. People who stand up and say, "This is what the next problem is going to be," are inevitably proven wrong.
A lot of you are very focused on continuous deployment. I think continuous deployment is a great idea, except that it requires continuous attention, scrutiny, and recalibration. All right? That's the downside, the dark side of continuous deployment.
This recalibration thing is a really critical idea. It goes to the core of what your system is, and we're going to talk a little bit about recalibration and why we need it, and how it's being done, how can we afford to do it, because it's very expensive, and how can we do it better.
Let me just say that other domains have similar problems. There's lots of domains other than IT that are confronting complexity and change and constant failure. I work in medicine. All right? Need I say more?
There's lots of solid research results available. There's a lot of resilience in your system that's already present and being used. Resilience is not something you're going to go out and buy. It's already there. It's playing out every day.
And there's diverse efforts underway to try and make systems work better. And I think we should pay attention not to this approach or that approach, but to the entire span of approaches that are being employed.
Now, there's a group of us who have been studying this. We call them the SNAFU Catchers. It includes companies like Salesforce, IBM, New Relic, KeyBank, IEX, which is an exchange company, Etsy, and of course my home, Ohio State.
SNAFU Catchers comes from a term that was used in the Second World War, which is SNAFU, which stands for "situation normal, all fouled up."
This is an early World War II term. There was even a group of people who flew aircraft to pick up downed sailors called the SNAFU Snatchers, and they would fly these flying boats out. They would go out and pick up downed airmen in the Pacific Ocean and bring them back.
This air-sea rescue thing, this idea of SNAFU snatching or catching, is a very important one because it illustrates the theme that I'm trying to get at very well.
During the Second World War, the Pacific campaign, there was a lot of expanse of ocean that people were flying over, and a lot of planes ended up in the ocean. They would break down. They'd have mechanical problems. They'd run into bad weather. They'd be shot down.
But there were a lot of airmen floating in life rafts in the Pacific Ocean, and there was a need to figure out a way to go out and find these people and bring them back. And one of the things that was used was a PBY flying boat. You can see it's got a boat shape to it, and you could actually land this on the ocean, pick people up, and then take off from the ocean and fly them back.
And it was actually called a flying boat. And there's lots of fascination for this thing. You can actually go see some of these in some museums and so on.
But the idea of this is that the boat itself is a kind of tooling, and you guys are mostly gearheads. That's a nice way of putting it. You like mechanics and tooling and machinery and so on. This is the tooling.
But the important thing about this event, the SNAFU snatching, was that in order to make use of this tooling, it was necessary to have a whole organization and group that was tuned to this mission. It was built over the space of six months, functioned very well for about three years, recovered huge numbers of air people who had been downed in the Pacific Ocean, was enormously successful.
You build things differently when you expect them to fail.
You build things differently when you expect them to fail.
Let me say this one more time, so I'm very clear about this. You build things differently when you expect them to fail.
Failure is normal. The failed state is the normal state. And the way you deal with failure is not by pretending that you are going to build systems that are completely defended against failure, but by building an organization that is able to recover from these failures and restore the functionality.
Now, we've written a thing called the STELLA Report. It came from a winter storm that occurred in New York when we were having a meeting to talk about this stuff with the SNAFU Catchers group. And the report's available to you out there if you would like to go read it. Some of this stuff is covered in there, including some of the pictures that I'm going to show you.
The bottom line here is that there's a lot of stuff that's happening on the other side of deploy. Almost all of our discussion so far has been about how to get things deployed to do stuff. But there's another story which is on the other side of deploy. What's happening over there? What's going on after deploy that makes these systems functional?
And the first thing that you realize when you look at this is that we don't have a very good description of what is our system.
Most of the time when you say to someone, "Tell me what your system is," they'll give you a description like this: we've got some internally sourced stuff, our software that we wrote, our applications and stuff. We got some delivery stack. That's stuff that delivers this thing to the outside world. And we've got this externally stored stuff, databases and things like that, all that stuff. That's the system.
But if you step back from that for just a moment, what you see is there's a lot of other stuff going on. There's code-generating tools and code libraries and test cases and code stuff and deployment tools and organizing tools and container tools and all the rest of that stuff, as well as some external services and monitoring stuff and all that stuff.
Really, we ought to include that in the system because that's how you make it happen, right? And so that is part of the system.
But as soon as you do this, you realize that you've got this kind of other stuff, right? You've got people who are writing the code. They're making the stuff, getting it ready to run. They're adding stuff to the running system. They're framing and building architectures. They're keeping track of what the system is doing.
They're engaged in all this stuff, and they're doing it continuously and intimately in ways that don't allow you to really separate that from the code.
So maybe this is the system.
By the way, there's lots of communication going on between those people, and they are engaged in a bunch of thinking about what's going on. They're building mental models of the system, and those mental models are unique. Sometimes they're deep, sometimes they're shallow. They're always incomplete. They're constantly being tested, and they're always being recalibrated.
And another thing that's kind of interesting here is that these people are seeing the system through a bunch of representations. That is, if you think about these green things here on the top of the code generating and all the rest of these, those are all representations of the system. They're the screens, the screen stuff over here that you scroll up. It used to scroll up.
Back in the old days, it would scroll up. It was green, too. All of it was green.
But these representations are what people are actually seeing. They're not actually seeing the system.
And what are they doing? They're observing, they're inferring, they're anticipating, they're planning, they're diagnosing. That's the work which we see as a group of activities. When you and I look at them, we see them communicating and talking and acting, but what they're doing is something else, this other work that's underneath here.
And in some ways, the key message here is that these representations form a kind of line. I'll call it the line of representation, and that there's stuff above the line and below the line.
And what's significant about this is, well, you never get to see what's below the line. Below the line is completely invisible. The only thing that you can see are the representations.
Now, this is a challenge to you because many of you believe that you know what's down there, but you never see it. It is not visible to you. You may never touch that. Those diagrams, that stack, all of those things, all those tools, you never actually touch those or see those things.
You cannot see them. They are invisible to you. All you can see are the representations. You have the screen, but you cannot see behind the screen. The only thing that is visible about your system is what's shown along that line of representation.
Now, I always get some pushback on this. There are always people who tell me, "I know what's down there." But you don't really. You know what's going on in the representations, and you might be able to reach out and do tests of things and see how those representations change, but it's all inference.
And all the action in your system is above the line. Everything that happens is, in fact, above the line.
Now, this is a really kind of difficult thing to get your mind around, but in fact, you get nothing but the representations. There's nothing to see here, folks. There's nothing underneath except what you infer. It's all in those mental models that those people are making.
And from this, you will realize that incidents, when they occur, are occurring up here, not down here. An incident is something that occurs in the mind of the person who's looking at the representation, not something that's occurring down in the system, in that underneath thing, below the line.
This is a really different way of looking at the world. But I think if you test it for a moment, you'll realize that it's correct. We never ever see what's below the line. We simply infer its presence from the representations that we make and manipulate.
And as a consequence, the mental models that people have, and you can see it in the drawing here, are not the same model. They're very much different because people are looking at different representations of the system and forming those models in different ways.
And there is no privileged mental model. There's no model that is the system. There are only different mental models which can be compared and contrasted and tested in a variety of ways.
This brings us to this idea about incidents being different from the paradigm. And I want to go through these things sort of one by one, but the way we understand what you understand about systems is not by asking you how they work. It's by watching how you deal with incidents.
The study of incidents is the revealing thing about how systems actually work and what's going on. And we use incidents to tell you.
The system, by the way, is awash in incidents. There's different kinds. There's the "this might be an incident" incident. You know those? And then there's the incident incident. And then there's the incident which might be the big incident. And then there's the OMG incident, which is thankfully fairly rare.
By the way, all the incidents look the same at the beginning. The OMG incident looks like the "this might be an incident" incident at the very beginning.
By the way, mean time to repair is nonsense. Anybody who tells you they know what the mean time to repair of their system is is clearly delusional.
Do you know how I know that? Because in one of the SNAFU Catcher studies, I went to a group and I said, "What's your mean time to repair?" He says, "Well, we had one this morning. It was 37 minutes."
I said, "That's pretty good. Tell me about that incident."
He says, "Well," he says, "a couple of weeks ago, the system started behaving sort of strangely."
If you define the start of your classifying something as an incident as the beginning of mean time to repair, you've got to really be... well.
Ordinary firms--I'm talking about firms like the ones that you work with--are experiencing one to five acknowledged incidents per day. That's one to five events that occur where people have to figure out: is this a minor thing? Is this an incident? Is this an important incident? Is this an OMG incident? And respond to it. This is the normal.
By the way, managing incidents has become a thing. Those of you who study this stuff will know this. Managing incidents is now a thing, right?
You have, in some cases, we found as many as 40 people responding to an incident. By 40, I mean 40 people check into the Slack channel after the incident is declared to say, "I don't know what's going on."
And you have multiple channels in IRC now. It's not just the war room channel. It's the customer channel. It's the "what are we going to tell the boss" channel. It's the "well, we don't really know what's going on, but we have to put something out for the media" channel, and all the other ones that you use for incidents.
In fact, the structure of your incident channels is very much a map of the functional organization.
You've got formal role assignments. How many of you have incident commander roles? Anybody have an incident commander? Yeah, that's the person who's going to have to bear the brunt of figuring out what's going on and fixing the problem.
You've got rules of behavior. Don't talk about this in the channel. Talk about that in the other channel.
You've got formal escalation policies. If the incident has lasted for more than 36 minutes, it's going to be declared a category one severity, and therefore PagerDuty will call up these five senior managers, and then the heat will really be on. So let's get it fixed before then.
This is the way it's really being done out there.
And by the way, almost everybody's involved in some sort of automation experiment. Everybody is building bots or applications to manage the incident. So automatically, when you declare an incident of a particular severity, if a certain period of time goes by without it being fixed, the robot will say, "Oh, well. Time to go up to the next level." Bump, and all of a sudden you're at the next level.
Anybody have anything like that? Yeah. It's coming soon to an operating world near you.
The interesting thing about this is that, unlike the deploy side, which you guys have spent just an unbelievable amount of time talking about, there are very few pieces of automation over here. It's as though you kind of ignored the other side of deploy.
You guys have got Travis and Jenkins and all. You have names for everything. What do you have names for this side? Nothing. It's zip. You've done nothing. I mean, nothing. You've done nothing to help people over there.
Why? Well, because you keep imagining that at some point you're going to have this perfect development process, which is going to just make all that stuff go away.
Learning from incidents is actually essential, but it's really quite hard. Most of you do post-incident kinds of things. Those are usually regarded as chores by the people in your organization. People don't want to be in those incident reviews. They have to go to them. It's usually the four people who are stuck with that and have no other alternatives.
The focus is limited to what I call microfracture repair. This is broken, fix this, go on. This code, fix that, and go on. It's often a very pro forma chore, and there are virtually no deep lessons learned from this. Most of the time it never happens. So there's not really a problem with sharing the lessons because no lessons were learned.
It doesn't have to be this way, but this is the way that it functionally is, and it's largely because the pace of incidents is so large that nobody has time to do anything except the microfracture repair.
"As the complexity of the system increases, the accuracy of any agent's model of that system decreases." This is a statement from David Woods. It's a really important one because it, I think, goes to the core problem, which is that our ability to understand the systems that we have is being stripped away by the increased complexity of the systems that we are building.
That means that no individual agent can have an accurate model of the system. All we have are these models, and we have to exert constant effort in order to keep those models up to date and keep them coherent across the group.
And it turns out that this is a really fundamental problem. This is what we call the recalibration problem, which is that as we have experience with this system, as it generates behaviors that we can understand below the line, we have to, above the line, be constantly updating those mental models so that we get a representation that is useful to us in the next round of dealing with the system.
There's a wisdom in incidents. The wisdom in incidents is that they point to the specific places where the mental models of the system are out of phase with the functioning of this thing below the line.
If I tell somebody who's coming into the world that I work in, "I want you to know the system," and they say, "How do I do that?" And I say, "Well, I'd like you to go read these manuals," we're never going to see that person again.
You could easily take new people coming into your organization, you could say, "Well, here's a manual for Linux, and here's a manual for the programming languages, and here's a manual for the database, and go back and read all these, and when you've understood all that, come back to me and we'll talk some more."
It's an impossible task.
The most important information about where people are uncalibrated about what the system is doing are the incidents that you are having. They are unsigned pointers. Does anybody know what an unsigned pointer is? This is a test for the audience.
Ooh, great. I was at a CIO meeting, and no one knew.
These are unsigned pointers that point to areas of the system that are of interest. They're untyped pointers. When you point to that area, you aren't pointing to anything specific. Your job is to figure out what type of thing is there in order to interpret that result.
Incidents are messages sent from this thing down below the line about how the system really works. They're the only important pieces of information that can lead to a recalibration of a stale, inaccurate mental model.
And by the way, they are the prompts to engage in calibration, recalibration activities, but we're not using them that way. That's not what we are doing. We're using them as signals of microfractures that need to be repaired.
The problem that we encounter now is that you're shifting the kind of world that you're in and making it actually harder in a variety of ways. There are ever more dependencies that aren't yours to deploy. This stuff over here that I've diagrammed as external services is becoming a bigger and bigger part of the world that you work in.
And many of the incidents that we see now are not incidents related to the code that was written by people who work for you, but functions of the systems that are provided by things outside. The most obvious one that people have had trouble with is single sign-on or identity management, but there's lots and lots of others.
We are finding fewer and fewer deploy-related events. That is, this idea that we had, which was a piece of almost religious belief, that if we get these changes small enough, and we make them fast enough, they'll break the system immediately, and we'll immediately know what was wrong and be able to back it up, is no longer true.
What we're doing is injecting new modes of vulnerability, new kinds of failures into the system that play out over longer periods of time. And as a consequence, we're finding it more and more difficult to build mental models of the system that accurately represent what's going on.
That is, the kinds of things that used to work when we were responsible for the whole picture and could break the system in these little tiny bits are no longer playing out. It turns out that less than half of the incidents that we see are related to the last deploy.
By the way, that completely throws the idea about rollback out the window. If you think rollback is keeping you safe, you better think again. Because half of the incidents that you're likely to have are not from the most recent deploy.
The step change in complexity is not being matched by the monitoring tools that we have been building into the systems. We are building in complexities at a rate that is much higher than would be the case if we were writing the code ourselves. Because now we just turn it into a service thing. Oh yes, we're going to go out and buy this piece of SaaS stuff, and it will work just fine until it doesn't.
The look-back, roll-back technique is becoming less useful. You're reaching a crisis state. Your ability to fix the system is no longer supported by this idea that we will just simply roll back to the last deploy, because the last deploy, in many cases, is not the source of failure.
As the complexity of a system increases, the accuracy of any agent's model of that system decreases. It is becoming harder and harder to keep the mental models refreshed.
Well, I see we're coming close to the end of our time together. That's what my friends in the psychiatry department say. Those who laughed have been in psychotherapy.
So complexity has changed. I want to introduce this. This is a hypothesis that I have for you: structure and function above and below the line are interwoven. They're not separate things. The system includes above the line and below the line.
If you don't believe that, you are absolutely sunk. You won't be able to get anywhere. The system is what's above the line and below the line, and they are interacting constantly.
Any analysis that is based on looking at one side or the other, either a management-focused analysis or a technical system-focused analysis, will fail. Only analyses that take both and look at the interaction will be successful.
The changing pattern of incidents is pointing to the places where recalibration is likely to be valuable. That's the value of incidents. Incidents are bits of wisdom, and you ignore them at your peril.
Change and complexity are the same entity. Complexity is impossible without change. Change flows from complexity. They are not distinct things. This is a dynamic world. Any kind of static analysis you do is wrong. Anything you do that is not dynamic is wrong.
So here's the challenge to you. You'll all know Conant and Ashby's paper, "Every Good Regulator of a System Must Be a Model of That System." This is the basic idea of the law of requisite variety that's a fundamental part of control system theory.
The fundamental idea is that in order to have a working system, you have to have a controller that has the same complexity. I'm going to make this hypothesis, which says that all the distributed system behaviors that you find below the line, you will also find above the line. All the distributed system qualities that you know about that exist below the line, you will find above the line.
Okay. That's it.
You have an offer to make. Yes. Here we go. I have an offer to make.
Let me do this. These are the points. Come and see me tomorrow. We'll talk--I'm sorry. We'll talk...
I've got to walk through this thing again. Sorry. I did it almost perfectly. I was supposed to stop there at the end for the offer. I apologize.
Here we go. Here's the offer. Tomorrow, Tuesday, 3:15.
The previous speaker was talking about his long history in the industry, and I wanted to show you that myself. That picture that you'll see at the end is me back in 1974 with my first true love, the IBM 360 Model 44, which still has a lot to say for it.
What we have seen in the past is prologue to the future, and it's very important for us now to start thinking about ways in which we will handle the complexity that is coming to us in the future.
You are the resource that's going to do that, and anything that we can do that will help you, like talk about this some more tomorrow, is welcome.
Thank you very much for your time. Thank you, Dr. Griesemer. Thank you.