Fight, Flight, or Freeze - Releasing Organizational Trauma
In this talk I will explain the background of fight, flight, and freeze, and how it applies to organizations. I will give examples and suggestions on how to identify your own organizational trauma and how to help heal it.
Matty is a DevOps Advocate at PagerDuty, where he helps dev and ops teams advance the practice of their craft and become more operationally mature. He collaborates with PagerDuty customers and industry thought leaders in the broader DevOps community, and back when he drove, his license plate actually said "DevOps".
Matty has over 20 years experience in IT operations, ranging from large financial institutions such as JPMorganChase and internet firms, including Apartments.com. He is a sought-after speaker internationally, presenting at Agile, DevOps, and ITSM focused events, including ChefConf, DevOpsDays, Interop, PINK, and others worldwide. Matty is the founder and co-host of the popular Arrested DevOps podcast, as well as a global organizer of the DevOpsDays set of conferences.
He lives in Chicago and has three awesome kids, who he loves just a little bit more than he loves Doctor Who. He is currently on a mission to discover the best pho in the world.
Chapters
Full transcript
The complete talk, organized by section.
Matty Stratton
This talk is called "Fight, Flight, or Freeze: Releasing Organizational Trauma." Sometimes I feel like the title of this talk is longer than the actual talk itself, so one thing I'm going to start working on next year is maybe having more succinct titles for my talks, but that's all right.
Very quick introduction about myself, because you're not really here to hear about me. You're here to hear about things we can do to make our organizations better. My name is Matty Stratton. This is me on Twitter, engage with my brand. I podcast at Arrested DevOps. If you're a podcasting kind of person and you want to learn about DevOps, check us out.
I work for a company called PagerDuty. This is our mascot, Pagey. How many people have heard of PagerDuty? Okay, cool. Great. Now I've talked about PagerDuty, so now I'm done with that part. I'm not going to be talking about our product today, but I will give some examples of how we do work at PagerDuty, where that might come up. Ultimately, the whole point of a resume slide like this is to provide validation as to why you should listen to me. Basically, this is why: apparently I have the best hair out of any developer advocate.
Let's get right into this. The first thing is a forewarning: there's discussion of trauma and post-traumatic stress in this. I'm not going to talk about specific personal trauma, but I always like to make sure that that's clear. Disclaimer the second: while I am a trauma survivor, I am not a mental health professional. None of this is personal mental health advice.
We got that out of the way. Let's talk about zebras.
Imagine a zebra. It's easy to imagine because there's a picture of one. A zebra that's at rest with no threat of predators is operating within what's called the rest and digest function of the parasympathetic nervous system. He's just chilling out there, being a chill zebra, maybe eating some grass out on the savanna, thinking about making little baby zebras. Everything's cool and it's fine.
Then along comes a lion, and the lion starts chasing the zebra. When this happens, drastic physiological changes happen with the activation of what's called the fight-or-flight response of the sympathetic nervous system. The zebra's heart rate increases, the breathing increases, large amounts of stress hormones like cortisol and adrenaline are released into the bloodstream, pupils dilate, the blood pressure increases, and any non-essential functions like digestion stop. The zebra's nervous system is preparing for it to literally run for its life.
If the zebra is caught, its nervous system is overwhelmed and it's got no further solutions. So what does it do? It plays dead. It freezes. This is called the freeze response. The lion may then drop the zebra and say, "Okay, I caught this one. I'm going to go catch another one." So the zebra might get away. If the lion leaves and the zebra is able to escape, does anybody know what a zebra does in this case? It's okay if you don't. It shakes. It literally shakes it off, and it returns its nervous system back to a resting state, back to rest and digest. A traumatic thing happens to the zebra, it's able to release it and move on and become a chill zebra.
I'm about to tell you something. There's a lot of really great learning that you're going to have here at DevOps Enterprise Summit over these days. But if there's one thing that is probably the most important thing you're going to learn over these three days, it's this next bit. Be ready: humans are not zebras.
If you learn nothing else, know this. What do I mean by this? The autonomic nervous system is common to all mammals. That's the fight-or-flight response. But humans have this thing called a prefrontal cortex. Zebras and lesser mammals don't have that. Usually, the prefrontal cortex is an advantage. This is where executive function happens. This is how we are able to identify the difference between good and bad, same and different, and understand consequences.
The problem is this is also where we mentally replay traumatic scenarios, and this activates our sympathetic nervous system exactly like the real threat would. That's the difference between humans and other mammals: the perception of trauma will have the same physiological response because of our prefrontal cortex.
I didn't make up the zebra metaphor myself. This comes from Dr. Peter Levine, who's done a lot of work on post-traumatic stress. Dr. Levine says animals in the wild are not traumatized by routine threats to their lives. They get attacked by a lion, they move on, they go back to being a chill zebra. Humans, on the other hand, are readily overwhelmed and subject to traumatic symptoms of hyperarousal, shutdown, and dysregulation. We don't always know how to get back to being a chill zebra.
This is something called the window of tolerance. In a healthy nervous system, something will happen. We aren't always perfectly in the middle. Arousal activation happens, activates the sympathetic nervous system, fight or flight, and then when we're done, we come back down into parasympathetic, we settle, and we move through these waves. This is operating within a range called the window of tolerance. The window of tolerance is a zone of emotional arousal that's optimal for well-being and effective functioning. This is okay. This is how we normally operate. Things happen, we move on.
The parasympathetic is rest and digest. It's conservation of energy. Sympathetic is the fight-or-flight response. When trauma is introduced, though, when we have undischarged traumatic stress, instead of just having activation and then moving up and moving down, we either become very dysregulated and spiky up and down, like a crazy Nagios graph, or we can get stuck on or stuck off.
Symptoms of being stuck on are anxiety or panic, hyperactivity, hypervigilance. Digestive problems can happen because there are physiological responses. Symptoms of being stuck off are depression and lethargy, also digestive problems, because all these things are connected. When this traumatic thing happens and it's undischarged, we either fluctuate back and forth very erratically, or we get stuck on, or we get stuck off.
Trauma is not simple. It's very nuanced. Nuance is a thing that we as humans have a big problem with. We have a hard time understanding. Things to keep in mind: trauma is when the nervous system has become overwhelmed and your solution, the active response to the threat, does not work. That's what introduces trauma. Nervous system activation is the same for real and imagined threats. Whether it's actually happening or it's the memory or the perception of it, your nervous system will react physiologically the exact same way.
This is really key: it is subjective and relative. What I perceive as overwhelming, or my capacity for a solution, might be different than what you perceive as overwhelming. When I was in treatment for post-traumatic stress, there were people I was in treatment with who had some very severe PTS. I looked at what they had experienced and thought, "Wow, you have been through so much more than... how are we even in the same group? I feel not invalidated, but I don't even belong here. My problems are nothing." But it's nuance and it's all different. This is really important when we're thinking about how we communicate with others when we understand trauma.
The other thing to bear in mind is that trauma is not a zero-sum game. My traumatic experiences do not take away from yours. I'm not invalidating a bad thing that happened to you by talking about a bad thing that happened to me. We have to remember that they're nuanced.
Why does this matter to organizations? One of the great things about DOES is you hear from all these different companies, and they're all different than you. Their stories can be helpful, but there's no one particular answer. Understand how your organization responds is going to be different than another one, and that's okay.
How does this apply to an organization? You're like, "Okay, great. We learned about zebras and prefrontal cortexes and activation and windows of tolerance and all this nonsense. That's great. I don't know, Matty, if you heard or not, but this is called the DevOps Enterprise Summit. How does this apply to an organization or a company?"
Let's go back and look at the window of tolerance again. Imagine this as your company. Things happen, and you respond to them, and then you go back to business as usual. A traumatic event for a lot of organizations is an incident or an outage. A severe incident, a severe outage can trigger the same way that trauma might trigger an individual. It can cause undischarged traumatic stress to the organization. I'm not just talking about the actual trauma that it might cause to the individuals; that's a whole other thing. Bear with me. We're going for a metaphor here. Your organization's like a person.
An incident or an outage can be undischarged traumatic stress. Then your organization can either do this flip-flop, or you can get stuck on, or you can get stuck off. Being stuck on is hyperarousal. This is fight or flight. An organization that is stuck on, that is hyperaroused, displays effects of constant vigilance. It sounds like they're fighting Voldemort. They're hyper-aware of threats. By no means am I saying don't be concerned about threats. They happen. But organizations that are hyper-aware are expecting this to happen all the time, and it is their first and foremost only concern. What happens in this case is it takes energy away from moving forward. They're stuck on. It's like anxiety all the time as the organism of the organization.
How can you see that this is happening? This is reflected in how leadership approaches outages and issues. You see a lot of wartime metaphors in organizations like this. A symptom of this is production support teams, where you have a whole team whose only job is fixing issues, which we're all DevOps friends here, so we kind of know why that might be a problem. In an organization that's stuck on, they don't move forward because they're putting all their energy into playing defense all the time.
You can get stuck off, which is hypoarousal. The outcome is very similar, which is still not innovating. This is an organization that is stuck off and they don't make any changes. That's a broad stroke. It's usually not as simple as that. How many people here come from a more ops background than a software engineering background? Okay. What makes things break? Changing stuff, right? That's the metaphor. We're like, "Well, we can't change anything."
I had an organization I used to work with. I had a friend who was the tech ops director for an e-commerce site that was a sister site that I worked with. It was a great example of how the incentives to her team and to her reflected that the organization was stuck off. Her bonus was a multiplier of the uptime of their website. How many changes do you think she ever wanted to go? This is literal money out of her pocket. These things can happen. Again, this is all nuanced. I'm painting big, broad strokes here, but I want you to start thinking about smaller symptoms that might indicate these things.
We want to talk about inappropriate response. Remember, we talked about this prefrontal cortex, and one of its advantages is that it helps us with pattern recognition. As humans, we like pattern recognition. It makes us happy. However, systems are complex. There's not always an easy answer. We see symptoms that seem like something that happened before. We look for this pattern recognition because the signals look like this thing that happened before. But complexity makes that response inappropriate.
Back in the olden days of 20 years ago, when I was hiring for tech ops, I used to say that the best system administrators are ones who would sit there, stroke their chin, and say, "I've seen this before." It's not so good anymore, because we go down that road and we're going to keep looking for the same stuff.
How many people got on an airplane to come here? How many of you had to take off your shoes? I know this is all different depending on what country and everything like that, but this is a little bit of the U.S. metaphor of this. On December 22nd of 2001, Richard Reid tried to ignite explosives hidden in his shoes when he was flying from Paris to Miami. In the U.S., the TSA began randomly searching people's shoes. Then in 2006, they said that all passengers have to remove your shoes. In the U.S. now, you can actually pay money and not have to take your shoes off anymore. That's a whole other question.
What's happening here is, again, looking for pattern. We're like, "Well, the pattern is people who wear shoes try to blow up airplanes." I'm being a little glib with that. Here's maybe a better example. Jennifer Brea says there's a saying in medicine that when you hear hoofbeats, the first thing that should come to mind is horses, not a zebra. Then she says that too-cute-by-half phrase has killed many zebras. This means we have to stop jumping on that first root cause. The first thing that it sounds like, we're going to say, "Oh, it must be that." But we have complex systems.
What can we do to understand the window of tolerance that we have as an organization? How can we identify when we're dysregulated? Those are the things I want you to walk away from here and think about. What are some indicators from how your organization responds to incidents and outages and change? Are you able to stay within a window of tolerance, or do you become dysregulated? I don't have an answer to tell you if you are or aren't, because the way that you respond is different because it's nuanced. But I'm going to tell you some things you can do to help once you know you are dysregulated. Chances are you probably are, because most of us are.
You can't be a big DevOps thought leader unless you write a letter. We talked about what Dr. Peter Levine said, and here's another way to think about it: resilient organizations are not traumatized by routine threats to their mission or business. Non-resilient organizations are readily overwhelmed by these routine threats. They're subject to symptoms of overreaction, shutdown, and lack of regulated effort. This wasn't said by Dr. Peter Levine. That was actually said by me, and I'm not a doctor.
It's taking that same metaphor. We talk about how zebras are not overwhelmed by trauma, but humans are. If your organization is resilient, these things happen. Can we prevent all incidents? No. Should we even try? No, because incidents are a gift. Incidents are a way that our system is telling us something we didn't already know. Am I trying to say don't try to have resilient systems and it's great if everything breaks all the time? No. But understand that they are a learning opportunity, and if we're trying to prevent them from happening all the time, we're chasing down the wrong thing. We have to be resilient and respond to them.
How do humans and organizations take a dysregulated state and become regulated? What's the actionable stuff that we can do? One of the treatments for post-traumatic stress is eye movement desensitization and reprocessing. From now on, I'm going to call it EMDR because that's way too long. In this case, a patient's difficult memories are offset with a positive association that's reinforced through external stimuli. You're being walked through a thing that associates with a safe space, and then they activate both sides of your brain through lights or clickers to reinforce it. The important thing to think about here is that we want to create a physiological association with that difficult memory. That's what happens in EMDR.
How can we do that for a company? We're not going to put little blinking lights and little tappers in all of our people's hands, but what are some of the things we can do to achieve a similar result?
One thing is working through game days. We want to make an association with outages and issues with a safe place. Game days can really help do this, but they have to be done properly. A game day is where you're saying, "We're going to introduce some type of exercise that's a false incident." Basically, you're practicing incident response. But the trick is to keep them low-stress and safe. The idea is not to practice under pressure, but to associate response with a safe environment.
I've seen organizations do game days where they're trying to trick their responders. It's not about practicing troubleshooting. Trust me, you're going to get plenty of practice of a stressful situation. We're actually trying to do the opposite. We're trying to create an association of the mechanics of incident response with calm response that's happening in a safe and regulated way.
You need guidance when you're doing that, so someone has to run it. It's kind of like meditation. With some kinds of trauma, meditation's really helpful, but if it's not guided, people could go into their inner landscape and encounter their trauma and not know what to do. Your game days have to have rules and guidance, and you have to have someone running it so you're keeping it safe. The idea is not to stress people out. Also, you don't want to make it so safe that it's not like a real incident. You still want to follow your process.
Another way to take this to the next level is planned failure injection or chaos engineering. If you do planned failure injection in a game day, run it like a real incident. At PagerDuty, we have something called Failure Fridays. We don't always do them on Fridays anymore, but we did for a while and we like the name, so we still call them Failure Fridays. It's planned failure injection. We introduce failure into the system to test the resiliency of the system. We run it like a real incident, even if we hopefully didn't actually cause an incident. We have our incident commander running the activity. We're going through the whole motions of doing this.
This creates an organizational association of a safe place for incidents because we do them on Fridays during business hours. The reason we do that is twofold. Number one, we're PagerDuty. We don't have a maintenance window. There's no good time for us to be down. The best time, if we're going to introduce failure, is when most of our engineers are already working anyway. Secondly, it's a low-stress thing. It's the middle of the day on Friday, we're just hanging out, and we're going to spend an hour and run Failure Friday. Then we're used to that process, so when it does happen at 2:00 in the morning, our incident commanders and our incident responders are just associated. It's just a thing we do.
This is also why you should run all incidents at their initial severity. I have a saying I like to say in incident response: during an incident, don't litigate severity. Don't have a big argument about whether this is a SEV 1, SEV 2, or whatever. You'll get alerted and say, "Oh, we have an incident." Then shortly after you start working, maybe you realize, "Oh, actually, nothing was wrong after all." At PagerDuty, we still run it all the way through. We do a postmortem and everything. The reason is because it gives us more practice. We want this to be a thing where we're just used to responding to incidents, so that it's business as usual and doesn't create extra stress.
You need to be able to process failure. Blamelessness and blameless postmortems or post-incident reviews are just the beginning. It's not enough just to be blameless or blame-aware. We have to process the failure of the outage through all the information we have, and more importantly, it has to have a conclusion. We have to tell it as a story, and the story has to have an end. Otherwise, it's just more unprocessed trauma for our organization.
There's a misconception that to process trauma, you just have to get it all out. Really what you have to do is process it. We have to integrate your experience into the coherent whole, which includes telling stories as well as making associations with our autonomic nervous system. In postmortems or in an organization, this has to do with how we learn to respond to outages by sharing.
J. Paul Reed did his dissertation on technology postmortems. One thing that stuck with me was that the larger the organization, the less likely teams were to share the results of their postmortem outside of their team. The irony is the bigger your organization, the more necessary it is to share the results of your post-incident reviews outside of your team. It's twofold. One is because our systems are very interdependent, so the chances are there are other teams that will care about what you learned. But also, that act of storytelling helps us process it as an organization. Write-only postmortems don't help anyone. It's great that you got your postmortem form, you fill it out after an incident, it gets shoved off into Google Drive, and nobody ever looks at it again. That didn't help anybody. You need to think about it as storytelling. This is how we process things, by telling stories.
In top-down therapy, which is traditional talk therapy, it starts with the emotions: why do you feel the way that you do? This is very similar to when we get very obsessed with looking for root cause initially. How many people are familiar with the theory that root cause is not a very good phrase? I'm going to introduce it to all of you. I want everybody to learn a new term today. Instead of root cause, we use the term contributing factors.
The reason is there is no root cause. I'm sorry, there is one root cause. You want to know what the true root cause is? The Big Bang. Our systems are complex, so we want to think about contributing factors. If we start looking for root cause while we're resolving an incident, two things are going to happen. Number one, during an incident, the only thing that matters is restoring service. Anything outside of restoring service is wasted time, so we shouldn't be doing that. The other thing is, by thinking root cause, the first contributing factor we find, we're going to jump on it and say, "Aha, got it. There is my smoking gun." We want to collect contributing factors while we're dealing with the incident, but we're not looking for the root cause first.
This is why in somatic experiencing, we talk about working with the body before the emotions. It's the same thing. We want to think about the physiological activity of incident response.
I'm going to talk really quickly about cognitive distortions. They're exactly what they imply: distortions in our cognition, perspectives with bias, irrational thoughts and beliefs that we unknowingly reinforce over time. There are as many as 16 generally accepted cognitive distortions. I'm not going to go through all of them. I'm just going to touch on a few that might be causing issues in your organization, ones that when I learned about them, from my history with IT, I've seen a lot of IT organizations doing.
One is polarized thinking, also known as all-or-nothing thinking. This is seeing everything in extremes, either perfection or total failure. Remember, our systems are always in some state of failure. We don't have perfection. There are only so many nines that make sense.
Then overgeneralization is where we take a single instance and generalize it to an overall pattern. On a personal level, this would be something like, "Oh, I got a C on this test. I'm stupid and a failure." In an organization, it can be, "We had a SEV 1 incident on this particular service, so it's very unstable." This is making overgeneralizations.
Fortune telling, boy, do we love to do this in IT. This one is very prevalent. We feel like if we have enough data that we'll be predictive. Machine learning, anybody? If we only know enough, we could predict the future. Yes, we can start to get ideas on what is likely to happen, but guess what? Our systems are complex. Our users are complex. Our organizations are complex. We have to understand that our predictions are not fact, but just one of several positive outcomes. If you've watched the Avengers movies, Doctor Strange saw one and a half billion different possibilities, and there's the one that's most likely.
Control fallacies manifest in one of two beliefs: either that we have no control over our lives and we're helpless victims of fate, or that we have complete control over everything, which gives us responsibility for the feelings of those around us. Both of them are damaging, and they're equally inaccurate. No one's in complete control over everything, and no one is completely out of control.
Where this comes in in technology, coming back from my background in ops, is that we believe either that if only we had enough control, we could fix everything - if I could control what those damn developers did, everything would be great - or practitioners feel like they have no control because they're being dictated to. You always have some kind of impact. If nothing else, you have the ability to respond mentally.
Being resilient means that we don't see things from the perspective of things that happen to us. It's a matter of what we can do going forward. A culture of blame creates a culture of helplessness. In our organizations, these are not about things that happen to us. Incidents happen, but it's what can we do going forward.
Thank you very much.