The Blameless Cloud: Bringing Actionable Retrospectives to Salesforce
As organizations experiment with greater concurrency and integration between their departments and move toward a continuous delivery of customer-value, failure is assured. Asking “how can failure be avoided?” isn’t as useful or relevant as focusing on
– “How does our organization react when failure occurs?” and
– “How do we create a sustainable, actionable process for describing, exploring, and remedying failure?”
This is the question that presented itself to Salesforce’s Service Reliability Engineering team. Their SREs had received training in incident response and management, but were still struggling with how to incorporate that feedback into the organization at large, to improve outcomes. Feedback loops weren’t always closed, leaving many opportunities for improvement lost.
This is the story of my months-long journey with Kevina and her team to identify the specifics of what made reliability retrospectives difficult to have, why actionable takeaways were often lacking, and how the feedback loops within the company’s operations organization weren’t serving Salesforce’s needs.
We then ran a series of experiments together, putting the SRE team on a road to improving their ability to respond, react, remediate, and reincorporate learnings from failure into the organization.
Chapters
Full transcript
The complete talk, organized by section.
Kevina Finn-Braun
Thank you. Welcome to The Blameless Cloud. This talk is about how J. Paul and I brought, and are bringing, actionable retrospectives to Salesforce.
I am Kevina Finn-Braun. I am the Director of Site Reliability Service Management for Salesforce. Prior to joining Salesforce, I led business continuity at Yahoo. If you'd like to tweet me as we're going through our presentation, I am `@kfinbraun` on Twitter, and in my spare time, I am prepping for the zombie apocalypse.
J. Paul Reed
I'm Paul Reed. I go by `@jpaulreed` on Twitter, although you might have heard me as Sober BuildEng. There's a long story behind that. But this slide is mostly about tweeting me, so if you want to do a live actionable retrospective on Twitter, that's totally cool.
I also do a podcast called The Ship Show, if that's your thing. We talk about build engineering, DevOps, release management, and everything in between on the show.
Actually, we were laughing about this. The program had a typo in it. I am not the Director of Site Reliability at Salesforce. I'm not doing Kevina's job. I'm actually a principal consultant at Release Engineering Approaches, and I talk to organizations of all shapes and sizes about the DevOps.
Kevina Finn-Braun
Site reliability at Salesforce. What are we, what do we do, and who are we?
We're a globally dispersed team. We have a focus on remediating operational service-impacting incidents, but more importantly, learning from those service-impacting incidents. Also, we analyze chronic problems and look for solutions to them, and we try to drive resiliency improvements and operational improvements throughout Salesforce.
So what brought J. Paul Reed and I together? We were facing some hurdles at Salesforce. We had a process that wasn't being consistently followed. We also had some inconsistent information.
But why? Because we had some siloed boundaries for incident handling and remediation activities. They crossed different boundaries of different teams.
But why was that a problem? Because we had some lack of clarity in service ownership and who owned what, and how, and why.
But why did you have that, Kevina? Because we had some restructuring that we did, and we split some things off. We did try and solve it, though, with a giant meeting once a week that was really heavyweight, and that's how we tried to solve it.
J. Paul Reed
But Kevina, why? I'm still confused.
Kevina Finn-Braun
Stop asking me why.
J. Paul Reed
So you may have noticed I asked why a number of times, a specific number of times. That was one of the other things that we were working with: this idea of the old view of human error, where we talk about things like five whys, root cause analysis, using language like, "Why didn't your team do this? Why didn't your team notice those metrics were down? Your team should have done that," and also talking a lot about best practices. So that was actually one of the hurdles that we found as well.
This is a slide from Adam Jacob's talk on DevOps Kung Fu, a talk he gave earlier this year. What I like about it is that his keynote is something like an hour long, but he talks about incidents lead to postmortems, where you invoke the space, describe the incident, establish a timeline, identify contributing factors, describe customer impact, describe remediation tasks, and describe improvement tasks for response improvement.
What's interesting about that is that's actually how quickly he goes over it in his talk. He does that in 15 seconds. So when you're talking about what does the structure of a new-view, modern postmortem look like, this is it. I guess we can all go home now, right?
But it turns out that in complex organizations like Salesforce and a lot of the organizations that you're probably all a part of, it takes a little more than 15 seconds to just go through this list and say, "Hey, let's do a postmortem." And that's where a lot of our work focused as well.
Kevina Finn-Braun
Yeah. Since we've invoked the space and we've talked a little bit about the incident and the problems, let's do the timeline.
We first met in October of 2014, talked about what we wanted to do, how we wanted to fix things, how we wanted to change things. In subsequent meetings, January rolls around, and we say, "Well, we need to blow up this big meeting that we have." That was one of the first things we did.
Shortly following that, we did a status check in April where we shared the assessment with our senior leaders and got some feedback on what we were doing and what worked and what didn't work.
Then in May, I'd like to think as a result of that status assessment, we had a shift in service ownership roles and responsibilities, where there was, I'd like to call, a stake put in the ground on who would do what.
J. Paul Reed
One of the interesting things we noticed about the status assessment, too, was that a lot of it was things that people knew. It was one of those examples where you actually were in the forest and you could see the trees, but it was able to add some color, maybe color the leaves, color the birds, about what was going on and contextualize it a little better.
Kevina Finn-Braun
In July, we did an initial workshop with the larger group about the new view, to introduce them to the new view of retrospectives and postmortems. So introduced them to those concepts and started getting people thinking about that.
J. Paul Reed
In August of 2015, we identified a group to pilot the coaching that we were working on. That was the database team at Salesforce that we worked with.
Then in August to now, we've also had a real focus on the weekly status meeting. Kevina's going to talk more about that. But we wanted to make sure that we were looking at what the teams were doing and how they were contributing to the larger meeting. I can remember, actually, we did a lot of debriefs right after that meeting to figure out what was going on.
Kevina Finn-Braun
Our own retros. We retroed our retros.
J. Paul Reed
Yes. And then, of course, we continued working weekly sessions with the database team. We're getting them to focus on their data presentation, collection, and narrative, which we'll talk about as well.
Kevina Finn-Braun
One of the first things we did in our October meeting was we retroed our current process. This was a process that I created after talking to several people. It was created about two years ago by the time J. Paul and I first talked.
We realized what worked and what didn't work. As we went through this, I realized a couple of things. I realized that while I had designed this process to fit all of our incidents, what we were really only focusing on was the big, high-profile incidents. That was mainly because I also anchored this process around this big Wednesday meeting.
J. Paul Reed
One of the things we did with this flow diagram, you're probably familiar with the nice process flow diagram, and it's all crisp and clean. One of the things we did is that we did a value stream map of the incidents that come through this process.
We found out that when you actually go around and look at what that flow looks like from a value stream map perspective, it looks a little different in real life. It's kind of interesting.
One of the things we found when we were looking at how it actually plays out is you can kind of see there's a highlighter mark there, a fairly evident silo transition. So where incidents are actually crossing silo boundaries. That was an interesting thing that we noticed.
Another interesting thing, you can see this up in the upper right-hand corner: there's a big arrow, a swooping arrow kind of to the left upper-hand corner. That was an example of where people that needed information to do their jobs, the process didn't account for that.
Now, that doesn't mean those people weren't talking to each other. They totally were. It's just when we designed that process, or actually, when Kevina designed it, it was something that we didn't account for or figured out when we actually ran the process.
We did a sample of incidents, and we found that there were some consistent inconsistencies. What we found is that certain parts of the process were actually just like clockwork.
Kevina Finn-Braun
Yep.
J. Paul Reed
Adhered to very well, the inputs and outputs very defined, got a lot of good data out of that. But then we found that some teams would take that curated data about the incident, with all the information, and then they would just kind of set it aside and do their own thing.
A lot of those happened, again, across those transition boundaries. And then, of course, the meeting that Kevina was talking about, there's this Bermuda blob at the bottom where incidents go into the triangle, and maybe they come out, and maybe they don't. So that was another realization that we had.
Once we did that, we actually wanted to step back from the value stream map of an incident because it was really important to talk about actionability of retrospectives. This idea that you may do retrospectives, but is the output of those actionable? Are you feeding that back into your organization?
So we actually stepped back from the value stream map that we did, and we started looking at what structure could we put in place, what work could we do with the teams to improve this. Then we started asking ourselves a lot of questions about what are the current state of the teams? What are they working on? What makes it that they aren't necessarily taking all of the data that we collect during an outage or an incident?
We found out: is the weather good for doing the kind of improvement work that we need to do? We found a lot of the teams were running into this: "Oh, flying cow," in their day-to-day lives.
Kevina Finn-Braun
No cows were harmed in the making of this presentation.
J. Paul Reed
So we decided to actually go for a lighter touch and do mini experiments. They're lightweight. They're easy to conduct. Oftentimes, people weren't even aware that we were doing an experiment.
In fact, it was so funny. Kevina and I were talking about some of the larger meetings where we would make minor, small tweaks, and people would come up to us after the meeting and they'd say, "Wow, that meeting was shorter." It was a shorter meeting, which was great because it used to be a three-hour meeting. Or, "That meeting just felt better." We were able to look back at some of the small, minor tweaks that we were doing. It also made it easier for larger changes that we were going to make.
Kevina Finn-Braun
We decided to head first right into that storm and just go for it.
I decided after talking with J. Paul, and after some feedback from our execs to understand what information they wanted to get out of this meeting, that we just needed to blow it up. We needed to start over, from the language we used, to what we called it, to how it was structured.
The other thing we realized is that we really needed to focus on these smaller cloud-level retrospectives rather than trying to do a big, giant one. Salesforce is made up of several different clouds who actually do the retrospectives, and so we wanted to really focus in those areas.
The first thing we started with was language. Language matters. We talk every day. I'm up here talking. J. Paul's up here talking.
The first thing we did was change the name of the meeting. It went from being called the High Availability meeting, which had a lot of connotations of bad retrospectives and bad blood, into being called the Weekly Service Reliability Review.
While that's a mouthful, we also toyed around with something called the Weekly Availability Report, but we couldn't go with that moniker because it translated into WAR. And what's that good for?
We also were really focused on the language of postmortem versus retrospective, which J. Paul will talk a little bit about.
The other piece of it was problem team versus solutions team. I don't want to manage problems. Neither did my team. We really wanted to manage solutions. Yes, there's a problem, but let's focus on the solution.
The other piece of it was the concept of a root cause versus a proximate cause. Salesforce is a complex system, much like any other complex system that many of you manage sitting in this room, and you're not going to find one single cause for an issue. Most of them are going to be several different proximate causes that feed together to create the explosion that becomes the incident.
J. Paul Reed
Whenever I hear root cause, I have a single tear roll down my cheek. I think I got you to that point, too, when you would hear it. It's just a single tear.
Kevina Finn-Braun
Yes. The email?
J. Paul Reed
Yeah, the email.
Behavior matters, too. When you're talking about retrospectives and where you're asking people to be honest and, in some sense, be vulnerable about what they're doing, behavior in those meetings mattered.
We spent a lot of time observing and really looking at inter-team behavior, people in the smaller meetings, how they're reacting to each other and interacting, what those interactions look like. We did actually a lot of ethnographic research on some of the interactions that we were seeing and trying to get a better understanding of it.
Then, of course, in the larger meeting, we were looking at behavior between teams and trying to figure out what was going on there.
It's funny. People often talk about, I have a friend who, there's hashtag NAFB. That stands for New Age Fluffy Bunnies. It's really funny. People talk about, "Well, this behavior stuff, it's just touchy-feely."
But Sidney Dekker has a quote there about, "People in complex systems create safety. The occasional human contribution to failure occurs precisely because humans have an overwhelming contribution to safety."
So it's important to realize that the reasons that we experience failure are the same reasons that we experience success. We often look at the failure, and it's because humans are contributing to that. If they're in an environment where there's counterproductive behavior or a lot of finger-pointing and arguments and that sort of thing, it's going to limit your ability to explore the human contribution to failure, but also success.
Kevina Finn-Braun
One of the other things we did was look at the structure. How were we presenting the information?
We realized that we were putting our engineers up into a big, heavyweight meeting to basically give a book report. A lot of them were standing there just like our little friend in front of the classroom, like a deer in headlights: "You want me to do what?"
So we started changing that, and part of that was working with the teams in the smaller retrospectives. We realized in those meetings that these teams were actually focusing on writing and rehearsing their book report instead of planning and talking about what had occurred, how they were going to fix it.
So we wanted to really refocus them to work on how they discuss the incidents. Not necessarily prepping for a book report and writing slides, but how you discuss it and the outcomes that come from those discussions.
J. Paul Reed
It was really funny. I remember a meeting that we sat in, and there was a particularly serious incident that the team was discussing. They had done a great job of discussing it, and then got to the point where the team really felt like they understood the issue. The manager asked, "Okay, who's going to present this in the weekly meeting?" And everyone was like, "Not me."
Kevina Finn-Braun
Not it.
J. Paul Reed
Not it.
Kevina Finn-Braun
Not me.
J. Paul Reed
Right.
What we've really been focusing on is more, especially at the summary meeting, around constructing a narrative. A narrative around what a particular cloud is doing, maybe what incidents they ran into, maybe what hurdles they're facing that are not related to an incident, but that may be precursors or indicators of a possible incident because their attention or focus may be somewhere else.
Now, what's interesting, when I say construct a narrative, a lot of you might think, "Oh, you mean construct a story, construct a fib." That's not what I'm talking about at all. These are narratives that are data-informed.
There were actually a couple of teams in the weekly meeting that were really just naturally inclined to this form of reporting, and it was actually really amazing when you sat back and watched it. We found that when they did their report in the meeting, they had fewer incidents. They were able to contextually answer all of the questions that people had. So there were less pointed questions or questions about what's actually going on, because they were able to say, "We had some outages," or, "We're having some trouble with this thing, but we're on it."
That was another fascinating effect of using narrative to tell the story. But whenever they were asked a question where they wanted detailed information, they were always on point with that data. They had it ready to go.
So that was really a way that made everyone in the meeting, but then also maybe executives that were there, be able to really have a more deep understanding because, as humans, we understand narrative more than a book report or a graph with a bunch of statistics with bullet points about what happened.
Quick question, little bit of a survey. For people that are new to this whole blameless postmortem thing, I'm going to ask a question. Let's just be honest. How many people think the term blameless postmortem is kind of bullshit? Like, when you hear it, you're like, "Yeah, I don't think that actually really exists."
Yes, that's why they never happen.
This is Brené Brown. She's a sociologist, I believe, at the University of Austin. She's done a lot of research on interactions, vulnerability. This is from a TED Talk she did actually on the topic of vulnerability. It's really interesting. She says the research describes blame as a way to discharge pain and discomfort.
We are hardwired as humans by millions of years of evolutionary psychology to blame people. So I always like to say, when you go into an environment and say, "Hey, we're going to do blameless postmortems now," it's kind of like everybody saying, "Oh, we'll just pretend nobody has arms." And we can see each other's arms. We're playing this game, right?
So what we like to talk about is blame-aware postmortems, which is where we are very conscious of the human tendency to do that, and then address that in the moment. It takes a lot of work, but the outcomes are kind of amazing.
The other kind of problem with postmortem is a lot of times you'll hear people say, "Well, nobody died, so why do we have to look at this? This isn't a plane crash. Nobody died, so let's not do a postmortem."
Or on the other end, you'll hear, "Well, postmortems, doesn't that mean like the NTSB shows up, and there's this big heavyweight formal thing, and they do all this stuff, right? We don't have the time or the money to really do that."
What's interesting is those are ways to deflect being vulnerable and being in a vulnerable space, and then learning from that. That's why I don't actually like the term postmortem. We need to find a balance that matches the cognitive context with our organization's need to learn, and that's why we call them retrospectives, and that's also why we try to make them actionable.
Now, one thing a lot of people say: "Oh, awesome postmortems. Let's have an awesome postmortem." Really?
Kevina Finn-Braun
Yeah.
J. Paul Reed
I don't know about you, but I don't want postmortems that I'm doing to literally inspire awe in my employees or people that I'm working with. They should learn something. It should be a good experience, but I'm not so sure that there should be awe involved, at least every postmortem.
One of the things we also did when we were trying to contextualize for other people that we were working with at Salesforce, trying to help them understand what was going on, is we came up with a Dreyfus model for the skills acquisition related to postmortem.
Those of you familiar with the Dreyfus model, it's kind of a five-point model, but one of the interesting things about it is that it's meant to describe behaviors. It gives you a bunch of examples of things you might hear or behaviors that you might see in your organization so that you can understand where people are in the spectrum.
Teams can be in different points. Different individuals can be at different points on the spectrum. Also, if you look at a broad skill like retrospectives, people can be a novice in one part and maybe competent in another part. They can actually have different things based on behaviors.
Kevina Finn-Braun
Yep.
J. Paul Reed
So we came up with that for retrospectives.
Here's an example of some things you might hear. A novice might say, "Incidents are bad. My job is on the line, so obviously they're bad." They may finish the postmortem paperwork, and they think of it as paperwork I've got to fill out.
Then we move on to the beginner. The beginner has a little bit more knowledge under their belt. They're more familiar with doing a postmortem. They've been in a couple. But they tend to jump to the focus on the why right away. Why, why, why? They also want to say, really, "How do we fix this? Got to fix it as fast as possible." They have that kind of beginner energy.
Following on that is the competent practitioner. The competent practitioner says, "Look, we need to find the root cause, and we're going to follow the prescribed format for retrospective." That's the one we saw earlier that we went through in about 15 seconds.
Kevina Finn-Braun
Yeah.
J. Paul Reed
Again, doesn't work in 15 seconds.
Then we move on to the proficient practitioner. The proficient practitioner says, "We've established the what. Can we now talk about the how? Let's talk about the how as opposed to the why."
The other part that this proficient practitioner can do is identify the inherent bias that they have in themselves and maybe also start to understand the inherent biases that the participants in the room have as well.
Then an expert practitioner in terms of retrospectives would ask questions like, "How does our team or system contribute to our success, not just our failures? What are we doing there that makes us successful?" They'll also be able to facilitate retrospectives by healthily helping others address our tendencies to blame, and then also our personal and systemic biases.
Gene asked us to come up with the five takeaways.
Kevina Finn-Braun
Five takeaways.
J. Paul Reed
The top five list.
Kevina Finn-Braun
Not the five whys.
J. Paul Reed
So these are the things that hopefully you can take back to your organization when you're looking at doing retrospectives, making them actionable, not calling them postmortems.
Retrospectives facilitate both the service improvement process and the design improvement process. The biggest thing here is that if you have an improvement process or an improvement team but you're not doing retrospectives, there's no point.
Being too busy to learn or improve means that you are in a downward spiral by definition, because it turns out entropy. If you're not improving the way that you're working or learning and you're just focused solely on features or keeping the lights on or whatever it may be, entropy is basically eating away at any past improvements you've made. It's eating you alive. Entropy is mean that way. The saying is if you're staying static with the way that you do work, you're actually backsliding.
Kevina Finn-Braun
It's not about the outcome. It's about the response.
A lot of businesses focus on MTTR, customer downtime impact. That's important. I wouldn't say otherwise. But what is really important to take into account is how you respond.
The notable takeaway here is focusing on outcomes doesn't fix the next incident. Focusing on response, and your team's response, your organizational response, feeds directly into the next incident that you're going to have.
Etsy and Netflix, we talk about them as unicorns. Both of them still take postmortems very seriously, their retrospectives very seriously. They continually work on perfecting them. When they hire new people aboard, they acculturate them to their style of doing retrospectives. So they take it very seriously.
Why and how is more important than the what. This is my personal favorite. As we were observing the smaller conversations and even the one in the bigger meeting, I realized that we're really, really good about talking about what happened. We could talk forever about what happened, but we really need to focus on those nuggets that come from the discussions of the how and the why.
How did this happen? Why did this happen? What were the contributing factors? What did we do right, and how do we take what we did right and proliferate that the next time?
J. Paul Reed
Finally, you are never done. Let me repeat that. You are never done.
As a corollary, it takes longer than you might think. I think if you think you're going to practice your postmortems for two months, call them retrospectives and practice them for two months, and it's suddenly magically going to work, that's actually not how it works. This takes a long time to change behavior.
Kevina Finn-Braun
Does it?
J. Paul Reed
Unfortunately not.
So when you do start looking at this, the model is a good place to start. See where you are on that model, map what your current state is onto that model, and then engage with the artifacts, the tangible things that you produce from a postmortem process or a retrospective process.
This is where actually coaching can be really helpful because they can help you look at those artifacts and improve that. So it really has to be something that permeates the work that you do. You can't just bolt it on as some sort of improvement process that ends after two to six months or whatever. People really need to intrinsically live that as they do their work.
Kevina Finn-Braun
Keeping with our weather theme, because you all saw the weather theme throughout the presentation.
J. Paul Reed
And the flying cows.
Kevina Finn-Braun
And the flying cows.
Our forecast for the future: we are working at Salesforce to tailor the concept of service ownership for Salesforce. It's not one size fits all. What service ownership means at Salesforce may not be the same thing that it means at Etsy or Netflix or Yahoo. There's still a basis of service ownership at the base level that we all practice, but there's different nuances when it goes from company to company.
The other thing that we're doing is focusing on coaching. Coaching those teams, documenting some guides, if you will, on how to run an actionable retrospective.
Focusing on the service health instead of what happened or this individual incident, looking at the system as a whole.
At the end of the day, we're doing all this because we want to give the business what they want, which is a good business outcome. For infrastructure engineering, which is my team, that means providing a highly available service. All of this work and everything that we're doing goes back into ensuring that the service is highly available and reliable.
J. Paul Reed
The other question Gene had for us to share was areas for collaboration, how you can help us.
We would love to see how the Dreyfus model that we've come up with applies to other organizations. What kind of things, behaviors, and language that you're seeing that we might be able to add to it, that would be wonderful.
Then, sharing stories from other enterprises with their retrospectives. Who should do them, who is doing them, where they should live within the organization, what kind of processes and procedures you put in your organizations around that would be great to share, because that makes us all better at this.
Thanks.
Kevina Finn-Braun
That's it.
J. Paul Reed
So we actually have time for questions.
Kevina Finn-Braun
I think we do.
J. Paul Reed
Any questions? There are bright lights, so I can't see hands.
Kevina Finn-Braun
Okay, we have a microphone here.
J. Paul Reed
The light is speaking to us.
Kevina Finn-Braun
I know.
Q&A
Q: You said in your old process that it was intended to apply to all incidents. How does the new process apply? Does it still apply to all incidents, or is there still just kind of the big incidents?
A: It applies to all incidents. But again, with the aspect of coaching the teams and working and focusing in the smaller groups, we're able to talk about the zeros, the ones, the twos, the threes.
The old process really was very heavy on getting ready for this big meeting and getting ready to present at this meeting. So we're giving the teams a little bit of breathing room, I guess is the best way to say it. They can sit back and talk about it more. So it still applies to all of them. It's just a different way of approaching.
A: One of the things I'd say, too, is it was kind of funny. The old process actually had someone who would give the teams the incidents that they were responsible for reporting on. Which is fine. There are certain cases where there's a lot of attention on a particular incident.
But with the teams that are doing a structured narrative, they're able to bring in a sev two or a sev three incident that maybe was interesting to them and they think would be helpful for the broader team, and may be an indicator of a sev one or a sev zero incident.
That gives them the flexibility to bring up to the broader team what they think is important, not just what books they were supposed to read for the book report to do the book report.
Q: How bought into this is your executive leadership?
A: They're very bought into it, actually. They have seen the work that we've been doing in several conversations with them, and part of the status reports and giving them regular updates.
They've also seen it in the meetings, the Wednesday meetings. Some of them were coming to us and saying, "Wow, that went really well," or, "What did you guys do? It went faster, it was different, it was better."
So they're slowly starting to see that the changes we're making are having an effect on our availability, our service, and the way we're approaching problems.
Q: I have a follow-on question to that, though. Do you guys take your biggest, worst, "this is really a big problem for us" incidents to the leadership still? And if you do, who does it?
A: We do still do that. The person responsible for doing or driving that conversation is the service owner, which in a sense is the cloud leader. So if it's a database incident, it's our database leader that goes and presents and talks about it that way.
We're not hiding things. We're still really focusing on the big incidents, and mainly that's to give our execs the visibility that they need in case they're approached by a customer. But in the smaller, broader meetings, we're painting a picture of how that one incident was part of a greater, probably systemic issue. Maybe not, maybe, but we're having those discussions as well. So it's two types of—
A: I will note one thing really quickly. Talking about the executives and their view of it, the coaching and working with the teams on that, they've realized that's so important and they're seeing some benefit. There's actually a couple of headcounts that have been opened up to help teams perfect this process across the company, which is pretty interesting, I think.
Moderator: Okay, well, we need to get the next speaker, which is Mark Horneck, up. Thank you.
Kevina Finn-Braun: Thank you.
Moderator: A round of applause.