Beyond the Retrospective: Embracing Complexity on the Road Towards Service Ownership
At DOES15, we presented the work we'd done at Salesforce to take their SRE teams to the "blameless cloud." We worked with various roles in the SRE teams so they could start asking the right questions about failure, and through the postmortem and retrospective process, begin to make lasting changes in how Salesforce worked with and remediated identified failures.
But DevOps espouses less siloed thinking and more shared responsibilities, so we found postmortems within the SRE organization weren't enough. As Salesforce was moving toward a model of "service ownership," teams along the entire software delivery value stream needed to start to understand their roadblocks to remediation and what aspects of the complex system they worked in were impeding their ability to "own their service."
We'll discuss the second phase of our work in helping these operations _and product_ teams gain a deeper understanding of service ownership, and why just "DevOps'ing it up" wasn't quite enough on its own to help. plus we'll introduce an expanded model from last year's talk that incorporates human factors and complexity theory. These additions helped prime the teams to more effectively grapple with the challenges facing them on the road to true service ownership.
Chapters
Full transcript
The complete talk, organized by section.
Kevina Finn-Braun
This is a little bit of a combination of some work that Paul and I did at Salesforce, and then a continuation onto a new adventure.
I am Kevina Finn-Braun. I am Director of Product Infrastructure Service Management at Intuit. Prior to joining Intuit, I was Director of Site Reliability Service Management at Salesforce, and also ran business continuity at Yahoo.
My Twitter handle is @KFinbraun, so if you'd like to live tweet how awesome we are, you can do that as well.
J. Paul Reed
And my name is J. Paul Reed. This slide has a bunch of stuff on it, but it's mostly about my Twitter name. So if you would like to tweet me things and questions and stuff, that's totally cool: @JPaulReed on Twitter.
My background's actually release engineering, which I guess everybody calls a DevOps consultant now. I don't know how I feel about being called a DevOps consultant, but whatever.
And I note that I'm actually working on my master's in human factors and system safety because it'll actually come up later in the presentation. We're going to be talking about some weird ways that safety science actually comes into incident management in large-scale IT infrastructure.
Kevina Finn-Braun
So I want to give you a quick recap from our last DevOps Enterprise Summit. We presented on some of the work, again, that we had done at Salesforce, discovering how to work across silos and to remediate incidents.
Also, some of the work we did about teams using the old view, and the fact that we kept asking why, and we kept asking who.
Who owns the investigation? Who owns the follow-through? How many whys did you ask?
J. Paul Reed
Five.
Kevina Finn-Braun
Always five?
J. Paul Reed
No, sometimes less, but not more than five.
Kevina Finn-Braun
Okay.
J. Paul Reed
You're not going to ask me why, are you?
Kevina Finn-Braun
No.
J. Paul Reed
Okay, good.
Kevina Finn-Braun
How many people have had this experience of the old view, asking why, asking who? Come on, I need to see more hands than that.
J. Paul Reed
Asking exactly five whys.
Kevina Finn-Braun
Five whys, no more than five.
J. Paul Reed
Yeah.
Kevina Finn-Braun
Who thought that worked? Did it work? Yeah. Who thought that worked for them? Yeah.
Surprise, it didn't work for Salesforce either.
So we started digging into the postmortem and incident analysis process, which led us to really look at how the teams were working together. At the same time we started on this journey, we also had some big org changes happening with some new vice presidents coming in.
So our new vice president said, "Hey, go do the DevOps, and we're going to call it the service ownership initiative."
So what's service ownership? You just said do the DevOps, right?
So service ownership was our way of discovering and understanding how each team's operating context worked, and working to help them build a unified context. And also, trying to get them to help us understand: we knew what wasn't working. Did they know what wasn't working? And if they did know what wasn't working, what did they think would work?
J. Paul Reed
So one of the interesting challenges at Salesforce when they said, "Okay, let's go do the DevOps," is there were a multitude of definitions. And of course, we've all seen this, right? If you Google DevOps or "what is DevOps," it's like, hey, there's 80,000 hits, and you're like, "Let me get on reading all 80,000 of those opinions on what DevOps is."
So what was interesting at Salesforce is there were no fewer than five, ironically, definitions that were circulating the organization, official high-level sponsored definitions of what DevOps was. So it kind of was, what flavor of DevOps would you like, right?
But one of the other interesting things about it is that there were actually metrics attached to these different definitions, right? So metrics about how quickly you should respond, and things that people were actually being held accountable for.
And what we found was, you might think, well, what kind of ice cream flavor? And we found if you looked across the different definitions, some people were actually talking about cheeseburgers, or some people were talking about Chinese food. So it wasn't really, what flavor of ice cream? It's like, what food do you want to have, right?
Kevina Finn-Braun
I want any of those.
J. Paul Reed
Any of those, right?
Kevina Finn-Braun
No.
J. Paul Reed
So that's the thing, is that in a large-scale organization like this, when you have a sort of "do the DevOps" mandate, one of the problems that you often run into, and that we sort of noticed at Salesforce, were these often actually competing definitions.
It's really serious because, again, people are expected to be held accountable to these different definitions, so it's kind of weird to deal with.
Kevina Finn-Braun
So back to the how do we do the DevOps. We started working with the teams on service ownership, and we went in with a little bit of a bias.
Our initial explanation for why people weren't doing the DevOps was that they had this learned helplessness, right? Everybody is familiar with learned helplessness. You have a bad event happen, you can't control it, so you have this perceived lack of control, and then it turns into generalized helpless behavior.
But as we started working with the teams, what we realized was they actually had no conception of the structure of other teams. But more importantly, they didn't understand the work that the other teams had to do and the work that the other teams wanted to do.
J. Paul Reed
So one of the things that we decided to help sort of with this structural blindness problem was to run a series of workshops with a bunch of the different teams around sensemaking of what service ownership means to them within their local context.
So one of the definitions of DevOps that was kind of winning had a bunch of metrics and monitoring numbers and uptime numbers associated with its definition. And so what we did is we actually went down, and you'll see a version of this model, but we actually had the teams create sort of at a beginner or a kickoff level, intermediate level, and an advanced level of what they thought was important.
So we had them identify what they thought was important, and then we had them identify sort of the leveling that they would define that at. And then we sort of had them look at the tasks that they would do. This is actually an example from the workshop where we did that.
And then we had them sort of look at their most and least actionable items. And the point of this was we wanted to make sure that we got their local context for what they thought service ownership for DevOps was. We wanted then to be able to sort of compare that with all of the definitions, but certainly the one that they were going to be matched against.
One of the interesting things is we purposely did not bound them or their system. So what we ran into, which was kind of interesting, is they would ask questions about, "Well, can we hire a database person? Are we going to have database skills on our team? Are we going to have sysadmin skills on our team?"
And we were like, "Well, what do you think? Do you need them? Do you need that in-house?" And different teams fell in different areas about whether they wanted to bring that sort of skill set onto the team or not, and use sort of the shared service model of another team.
Again, we did that so that we could see where they put the boundaries if you asked them that question, as opposed to where the organization had structured the boundaries. And again, just to see kind of what they would come up with.
Kevina Finn-Braun
So we had some surprises. I think Paul mentioned a little bit earlier about the five different definitions. Well, some of the teams that we did these workshops with, it was their vice presidents that were writing and circulating these five definitions of DevOps.
And what we found out was that regardless of what anyone was saying was the definition, the important thing was to understand the team's local context and their rationality. And again, the idea of service ownership didn't mean the same thing to everyone, even people on the same team.
And that was actually one of the aha moments when you're doing the workshop with the teams, is that you say, "Go do the DevOps," when within the team they sort of disagree on what that means, or even what beginner and intermediate and advanced would be on a certain skill set.
And the other thing was, going back to that concept of we thought learned helplessness, the teams actually really saw the value in changing the way they worked. They really wanted to do DevOps, but they were really frustrated that they couldn't do the transformations that they wanted to do. They were time and resource constrained.
And I, again, was shocked at the fact that their VPs were saying, "Go do this, but we're not going to help you get there. Just go figure it out."
I think the thing that sadly shocked Paul was that none of the teams actually talked about retrospectives and doing postmortems and incident analysis, which was a little bit of a surprise.
J. Paul Reed
Yeah, that was one kind of pattern that we consistently saw, was that this concept of incidents and then management of incidents was kind of not on the radar when it was like, "Own your service," but then how do we manage when something goes wrong? So that was kind of fascinating.
One of the things I actually want to point out, and this actually comes from the safety sciences. This is called the Rasmussen Safety Triangle. I note that he did a bunch of work after Three Mile Island melted down on how people do their work and how people respond to incidents.
The reason that I bring it up: the way the triangle works is on one boundary, the top boundary here, you have the boundary of financially acceptable behavior. So that basically is, if you cross that boundary, it's too expensive. We wouldn't do that. So there's a business function, basically cheaper, better, faster, pushing people in the system down.
And then on the bottom, you've got the boundary of acceptable workload. So that's people doing the work. And the way I like to say this is humans are lazy, and I don't mean that as an insult at all. We all want the most bang for our buck and our brain with the sugar that we're utilizing, right? So we're always finding ways to be more efficient in the work that we do.
So there's a natural gradient towards the least amount of effort in the work that we do for the most bang, the most effect of the work we do. So those two arrows there, we call them pressure gradients. They're kind of pushing towards that third boundary.
And if you wonder what that third, or are curious what that third boundary is, that's the boundary of acceptable behavior or acceptable risk. So what that means is when your system crosses that boundary, it's when you have an accident or an incident.
So the thing is, is that you've got the business pushing towards that boundary, and you've got ourselves, teams, sort of pushing towards that boundary. And it comes up with this interesting question: well, how do we kind of counteract that or how do we deal with that? Because certainly we don't want to go across that boundary all the time.
So one of the things, this sort of "experiments to optimize locally," and it's kind of interesting because optimizing locally is a problem. If you've done the DevOps or read any of the DevOps reading, that's sort of a problem you have to think about. Lean Thinking talks about that too.
But the notable part is that area, those little arrows, is called the discretionary space. And the goal of the workshops was to get the teams to sort of explore their discretionary space along that boundary. Because the point of doing that is that you want to get the teams better at detecting when they are trundling close to that boundary, and maybe they're going to have the system fall across that boundary into an accident or an incident.
And so that was really what we wanted to do: get them familiarized with their own context, what that discretionary space looks like, and also get the organization familiar with letting teams explore this space and figure out what that looks like for them.
Kevina Finn-Braun
So that's where our work focused on in the earlier part of this year with Salesforce teams. But then there was a new adventure.
Who's heard of Intuit? Cool. Excellent. I'm sure you'll be hearing more about it.
So I took over leading Intuit's service management team in May, and so for the past six months, Paul and I have been digging into the transformation that they are trying to make.
I think that there's been some different challenges at Intuit, and there's also some same challenges.
So quickly going through them: Intuit wasn't born in the cloud. Here we have this fun little Quicken for Windows.
J. Paul Reed
Windows 3.1 for life.
Kevina Finn-Braun
1983, the internet was still a baby, and we all did our taxes on our desktop.
So incidents meant something different. An incident meant, "I'm going to call into customer support because my desktop software isn't working, and somebody's going to try and help me figure out what's going on."
And because an incident was different, some of you may recognize this, there was no Bermuda Blob.
J. Paul Reed
What's the Bermuda Blob?
Kevina Finn-Braun
What is the Bermuda Blob?
J. Paul Reed
The Bermuda Blob is where incidents go to die.
Kevina Finn-Braun
Pretty much. Yeah. It's that little pink blob area.
J. Paul Reed
So this is actually something we showed last year as well. This was kind of our attempt to value stream the incident life cycle at Salesforce. And what we found out was that as we started to talk about it, we found some artificial boundaries in this giant area where incidents go to die.
So this was a fascinating project because how many people have a diagram like this? Yeah. Okay.
So what we did is we took this, and then I just did a bunch of interviews, and I said, "Okay. Who talks to who when this happens? Who doesn't talk to whom when this happens?" And this is what we came up with.
And actually, as Kevina was saying, these boundaries, and kind of there's a shwoop up there, those are all actually, turns out, organizational boundaries. Now, we didn't know that at the time when we set out to do this, but those fall along pretty well-defined org boundaries. So they're kind of defining...
Kevina Finn-Braun
It's Conway's Law all over the place.
J. Paul Reed
Pretty much. Yeah.
Kevina Finn-Braun
Yeah.
J. Paul Reed
Yep.
Kevina Finn-Braun
So again, no blob because incidents had a different meaning.
Another big thing, and actually something I've had to get used to, is a totally different business life cycle. There are three distinct peaks for Intuit, three times a year when it's all hands on deck. And I think you all know the one time.
We don't really want intuit.com going down on April 15th. We don't want it to go down either because we want your taxes done, not just due.
So the idea of them working in a peak cycle instead of a consistent all-around cycle, that's been an adjustment as well in trying to get the teams to understand that they need to look beyond a peaked work stream, I guess, is the best way to put it.
J. Paul Reed
Very bursty.
Kevina Finn-Braun
So we also found some similar challenges because, of course, Intuit started its journey to SaaS about four-ish years ago.
Some of the things we discovered similar to Salesforce was operational responses weren't documented, or they were documented, but that wasn't actually what was going on. We also saw some inconsistencies as well.
Much to Paul's glee, because it means more work, they are in love with the old view and the five whys and doing blameful postmortems. So we're trying to turn that around as well.
Also, closing the loop on incidents. They didn't necessarily have a Bermuda Blob, but they just didn't talk about them. They just, "Okay. It happened."
J. Paul Reed
Oops.
Kevina Finn-Braun
Yep.
And the other thing was taking a closer look at P3s and P4s, which is always where I think you find the most interesting things. There's a surprising amount of service impacts that they're saying are P3s and P4s, and they're not really, again, looking at having to remediate those or fix those problems as well.
And again, the age-old question of what is an incident? As I stated earlier, in the old days, an incident was, "I'm going to call the call center, and I'm going to ask somebody to help me with my software."
Now, an incident is live production. You've got customers relying on you.
We just had an issue the other day of an incident that wasn't customer impacting, but it blocked the path to production. So now we have this big debate of, you've blocked the path of production, but this isn't an incident. We diagnosed it all in email.
So again, it was people trying to do the right thing, but it caused some strife. And I said, "Hey, guess what? You need to now figure out what you're going to do with an incident that blocks your path of production because it does cause customer impact indirectly."
J. Paul Reed
Right. It was interesting, right? Because the thinking was, well, if it's not customer-facing, it doesn't matter. But it was blocking the customer getting value, new value. And so there was kind of a shift in framing about that, that was an interesting conversation.
So this slide is actually...
Kevina Finn-Braun
Similar to last year's.
J. Paul Reed
Oh, it's the same as last year's slide.
Kevina Finn-Braun
Okay. It's the same slide from last year.
J. Paul Reed
But I wanted to put it in because I wanted to ask the question: when I say blameless postmortem, how many people have this reaction? They're like, "I don't know what that means. That doesn't feel right. I don't get it. I don't know how we do it in our organization. That just seems very weird."
Don't worry, if you raise your hand, I won't blame you.
Okay. Yeah.
So the thing is that that is a natural reaction. A lot of people hear this blameless postmortem movement, and they're very confused by it. And the reason I wanted to put this in here and discuss it again briefly is because it's totally natural.
This is Brené Brown, a research sociologist who does a lot of work on vulnerability and how people sort of interact, and she has a great quote. She says, "Blame is a way to discharge pain and discomfort."
So the thing is, her point is that we are actually hardwired by evolution to use blame as a coping strategy to sort of get rid of things that are uncomfortable. So when you hear somebody say blameless postmortem, it feels wrong.
The analogy I always use is like walking into a meeting, we're all going to pretend we don't have arms. It's like, I can see my arms, and I can see your arms, and let's just all pretend we don't have arms. It doesn't work. There's this bit of cognitive dissonance around that. So that's actually why.
A lot of times, too, when you hear blameless postmortem, people take issue with postmortem, right? Because it's like nobody died. So there's this heavyweight connotation to it. So you see a lot of weird reactions to it.
It's like, well, we shouldn't do a postmortem because nobody died, or we shouldn't expend that much effort on it because it's not that big a deal.
So you actually see discussions about learning reviews or actionable retrospectives. Some people use the term awesome postmortems, and I just want to point out, don't call them that, because I don't know about you, but I don't want my postmortems inspiring awe. They should be good, and they should be sustainable and actionable. But awe-inspiring, not so much.
Kevina Finn-Braun
Yeah. We don't want that. That was the first thing that my boss said to me at Intuit. "We want our postmortems to be awesome."
J. Paul Reed
No, you don't.
Kevina Finn-Braun
You actually don't.
J. Paul Reed
Yeah.
So I wrote a little bit more about this, about blame aware, and it discusses this sort of blameless versus blame aware and what that means. I don't think we're going to start calling them blame aware, but if it doesn't sit right to you, that blog post explains why.
So this is the incident analysis, or basically postmortem retrospective model, that we introduced last year. It's based on a Dreyfus model. You'll notice it kind of has the similar levels that we talked about in the workshops.
But what's interesting, it's not super prescriptive. It's not like a capability maturity model. It's sort of based on behavior and language. So it's in an organization, what language would you hear people saying and using? What behavior would you see?
So this is last year. If you're curious about this, you can watch the presentation from last year. I'm not going to go through all of them.
I will point out one of the interesting transitions you see is novices kind of on this will talk about incidents as being bad, like, "My job is on the line," whereas we move to, in the expert level, how does our team or system contribute to our success? So that's a very big mind shift there.
Kevina Finn-Braun
But that's not all.
No. So again, we started with incident analysis because that's where we were having the most discussions with Salesforce, and then what we did was we actually looked at it and said, "Hey, what about all the other areas?" as we were going into this, again, the service ownership journey with Salesforce.
And what we found were we needed to also look at incident detection, response, remediation, and incident prevention, with the idea of all of these things support the life cycle. So one feeds into the other.
And what we discovered, unfortunately, was that incident analysis was actually the middle child, and it needed its supporting siblings.
So incident detection. This is where you start at the beginning, and this is where a lot of my work is focused at Intuit right now, is no monitoring. Other teams are going to notify us there's a problem. They're monitoring their service; they'll call us.
And then as we move along, we get to the proficient area, and this is what I'm really pushing for: monitoring is a first-class citizen and we prioritize it in everything we do.
J. Paul Reed
So after that you're looking at incident response. And so what I love about this, at the novice level, it's like, did you try turning it off and turning it on again? We've all kind of had that as an incident response.
What you really see as you increase to competent and sort of expert is this idea of the incident management system, which actually comes from the fire service. So incident management behaviors and language start being used more as you progress.
One of the other really interesting things that I like to point out is at the expert level, and there are organizations that are at the expert level, they consider incidents inhumane. They use that word because they want the organization to react in that fashion.
It's not so much about the business impact that's important. It's more about the human impact of the people that have to deal with that.
Kevina Finn-Braun
Incident analysis, we talked about that, so we'll skip past that.
J. Paul Reed
The middle child.
Kevina Finn-Braun
Yeah, the middle child.
And then to incident remediation. What do you do following an incident? How do you make sure that you've corrected the problem?
Novice: oh, file a ticket. Go file a ticket. My personal favorite: "Hey, we need more process. Where's the process gaps? Let's put more process in."
And then ultimately to the point of, the resiliency of the system is considered in the design phase. It's talked about. It's part of the planning. And the remediation work is not considered a separate activity. It's all one and the same.
J. Paul Reed
And then finally we get to incident prevention. And one of the interesting things about incident prevention that you actually see is that as you progress, you see more talking about formation of crews to deal with incidents and issues, injection of failure, so game days, people going into the data center and yanking out cables, stuff like that. Chaos Monkey, if you're familiar with that, that Netflix uses.
And to the point about process, less emphasis on docs, documentation, and process.
One thing that I would point out that's interesting about prevention, you might have noticed actually there's an asterisk there. What the big shift in thinking between the novice and sort of the expert on prevention is, you realize you really actually can't prevent incidents in complex systems.
And so the focus changes from sort of "prevention" to being really good at dealing with these incidents and issues that come up. In some sense, actually being really good, being in our own skin organizationally in that discretionary space. We can really sense where that boundary is.
Now, I noticed a lot of you taking photos. The good news is this is all...
Kevina Finn-Braun
That was me.
J. Paul Reed
...Creative Commons licensed, so you can actually read it on GitHub. And feel free to submit pull requests to update it. We would love people's input, especially based on things in their organization.
Kevina Finn-Braun
So takeaways.
J. Paul Reed
Takeaways.
Kevina Finn-Braun
So whether it's DevOps or incident management or any sort of change really, you need to facilitate teams exploring their discretionary space. It's super important, and it's one of those things that I think we've heard this repeatedly at the summit this year about giving teams the time and space to sort of figure out what that means for them, bringing their context to whatever the initiative might be, whether it's CI/CD or incident response or the DevOps in general.
Incident response equals incident management. So this is one area where your response should be the same every single time. Repeatable. It should be rote. The team should know it. They should do it all the same, whether it's a P1, a P2, a P3, or even a P4.
My favorite, I have a dear friend of mine, and I grew up in the fire department, and so he likes to say, "Fires are an emergency to the fire department because it's what we do."
We should adopt the same mentality when we're talking about incidents. An incident isn't an emergency. It's just what we do. We know what to do.
And also fundamentally, when we talk about the incident value stream, we're talking about bringing this idea of the Three Ways of DevOps into incident management.
J. Paul Reed
So this is actually a takeaway from last year. It's the same one: you are never done. And this provides job security for...
Kevina Finn-Braun
It does.
J. Paul Reed
...for both of us. I wish it wasn't the case, but hey, job security.
But I want to repeat that. It's really important. And I think you see that in the journey in our work together at both Salesforce and Intuit.
There was an interesting DevOps Enterprise Summit webinar that I was on a couple of days before the event, and one of the people on the webinar said, "Cultural change is really hard." And I was like, "No, it's actually not."
The way we change culture, the patterns, those are fairly well-established, and there's fairly good data that they work. The problem is if we change our priorities every quarter, then that's the problem, right?
And you've heard it repeatedly, a bunch of case studies. I mean, Target's in year five.
Kevina Finn-Braun
Mm-hmm.
J. Paul Reed
Right? I went to the Verizon presentation. They were talking about the seven- to 10-year plan.
So cultural change is not hard. It just takes a long time, and it takes concerted effort. So this goes to the you are never done, and sort of improving the way that you deal with incidents and your incident life cycle.
And of course, that's the whole point, is you want to continue to get better at that. And so you're never done at getting better.
Kevina Finn-Braun
Keep going and keep leading the change. It's true.
J. Paul Reed
So...
Kevina Finn-Braun
So avenues for collaboration. Take a look at our life cycle model, and then actually send us GitHub pull requests so you can pull it down, and compare your own documented or undocumented incident life cycle against the actual value stream and see what you find...
J. Paul Reed
Find your own...
Kevina Finn-Braun
...and share it.
J. Paul Reed
Find your own Bermuda Blob.
Kevina Finn-Braun
Yeah. Find your own Bermuda Blob.
Thanks.
J. Paul Reed
Thank you.