How We're Transforming the Practice of Learning From Incidents in a 12,000 Person Organization

Log in to watch

Las Vegas 2022

How We're Transforming the Practice of Learning From Incidents in a 12,000 Person Organization

The IBM CIO organization is making significant progress along our journey to learn more from our incidents. Where previously a small number of our leaders would discuss a handful of "major incident RCAs" focusing almost exclusively what broke and how we’re fixing it, we are now having broad and open discussions that are improving everyone's understanding of our systems, the expertise that keeps them running normally, and the challenges that can overwhelm that expertise. A centerpiece of the IBM CIO Learning from Incidents Program is a monthly discussion in which the CIO senior leadership team, all technical leaders in the organization, and many others gather to review and discuss the story of a recent incident. Hundreds of CIO team members have participated in these meetings and/or viewed the recordings, and these monthly meetings have led to several significant outcomes. These monthly meetings are also inspiring CIO teams to improve their own practice of learning from incidents.From this presentation, the audience will understand:- Why the IBM CIO organization is adopting a resilience engineering approach to learning from incidents- How this approach differs from the traditional RCA and Problem Management practices- What the IBM CIO organization has been doing to improve our ability to learn from incidents- What are some of the themes and outcomes we've observed- How we got here, and how other organizations can join us on this journey

Chapters

Full transcript

The complete talk, organized by section.

David Leigh

All right. Thank you all for coming. Let's go ahead and get started.

I'm David Leigh, and I work in the IBM CIO organization. I've been at IBM a while. For most of my career, I've gravitated to projects involving adoption of modern practices and tools.

I actually attended the first DevOps Enterprise Summit back in 2014. At that time I was involved in internal adoption of DevOps practices. It was in that context that I stumbled across the work that John Allspaw was doing in bringing the field of resilience engineering to IT and vice versa.

I spent the next several years scaling tools like Slack Enterprise and GitHub Enterprise across all of IBM. With hundreds of thousands of Slack users and tens of thousands of GitHub users, we regularly pushed these tools beyond their limits, and we had many opportunities to learn from incidents.

Since 2015, I've had the opportunity and honor of representing IBM in the SNAFU Catchers Consortium. All of this led to the role I have now, which is engineering resilience for our 12,000-person CIO organization. This talk is about the journey that we've been on to transform how we learn from incidents. I'll start by explaining where we started and where we're headed.

Here are two perspectives of system safety. The traditional view is that the systems we build are perfectly designed, and if there are failures, it's either because someone screwed something up or because of a failure in some technical component of the system. This is the view which has informed the root cause analysis approach that we're all familiar with. Under this traditional view, if we fix that person or that technical component, then the system will go back to being perfect. We typically rely on enforcement of processes and rules to help make things go better. ITIL is 100% informed by this traditional view.

The new view is that our systems are not perfect. In fact, they're typically on the edge of failure. It's people who are actively keeping our systems running by continually adapting to gaps, challenges, and surprises. We know this is true because if people walk away from these systems, they quickly fall over.

Safety is the presence of expertise that is observing, anticipating, planning, diagnosing, and reacting in real time, and preventing the incidents that don't happen. Based on this, the way to improve safety is to look at how people adapt during anomalies and disruptions: how they coordinate, how they decide, assess risks, and make sacrifices. This is really about understanding the expertise that is keeping our systems running so that we can enhance and sustain that expertise.

Here I want to distinguish the root cause analysis practice from what I'm calling a resilience engineering approach to learning from incidents. The RCA approach is based on a linear model which suggests that if we fix the thing that caused the problem, then everything will be fine. Its simplicity is attractive and intuitive, but it ignores the real complexity of the system and often leads to bad choices on how to improve the system.

Learning from incidents is based on systems thinking, which acknowledges that our systems are complex, they have emergent properties, and failures are the result of multiple factors, each of which is necessary but none uniquely sufficient to cause the incident.

The focus of the RCA approach is on finding and fixing the thing that caused the incident. The focus of the learning from incidents approach is on learning. Learning involves updating mental models of the system, including an understanding of how the system is supposed to work versus how it actually works, understanding hazards, and recognizing challenges that people face during disruptions. This type of learning is itself a system improvement.

The RCA approach often finds that the system would be fine if only the humans followed the rules or procedures. The learning from incidents approach recognizes that the system would not work if not for the people who are adapting, improvising, and dealing with surprises all the time. In a very real way, human variability is generating success.

The RCA goal is to prevent the system from failing due to the same disruption happening again. The learning from incidents goal is to improve the likelihood of success regardless of the next disruption the system faces.

This is the journey that we're on, and I'll spend the rest of this talk explaining how we're doing this. I want to emphasize that this shift is not straightforward. It is not intuitive, and it takes a little while for people to grasp it. Fortunately, once you make this mind shift, you can never go back.

I was wondering if I can get a show of hands. Are any of you making or thinking about making this type of shift in your organizations? Awesome. Excellent. Oh, we're not alone.

If any of you have doubts or are skeptical about the importance of this, right now down the hall Courtney Nash is blowing apart the myths surrounding MTTR and other shallow incident metrics, and making the case for this type of learning from incidents approach.

Cooking is one of my favorite hobbies, so I couldn't resist the metaphor here. When I step back and think about what has made this transition work as well as it has over these past 10 months, I think it comes down to these four things. These are the key takeaways of this talk. If you're trying to make this shift in your own organizations and you ignore the rest of my presentation, I want you to remember these four things.

Leadership appetite for a new approach. You don't need leaders to understand resilience. You don't need leaders to stop saying root cause. You just need them to be open to trying something new.

Credibility with practitioners. You need to be able to talk to engineers with empathy and an understanding of the challenges that they're facing all the time, as well as a sincere respect and interest in how they adapt to those challenges.

Readiness to seize an opportunity. Leaders' appetite for a new approach may be fleeting. It's important to do the groundwork necessary on a smaller scale so that you're ready to go big when the opportunity arises.

The last is engagement with owners of existing processes. This is about the ITIL problem managers and other folks who are driving RCA processes. These people can be your greatest allies or your biggest obstacles, and it's largely about how you engage them.

Before I go further, I want to give you an idea of the breadth and depth of the IBM CIO organization. IBM as a whole has over 300,000 people in 170 countries, and the 12,000-person CIO organization is responsible for keeping this large, complex, global enterprise up and running.

This is a thumbnail of the CIO portfolio. As you can see, we manage the systems that enable IBM to buy and sell things, provide compensation and benefits, manage our networks and real estate, our finances, etc. We've got every kind of IT system you can imagine, from mainframes to microservices, custom applications, vendor software that we run, SaaS applications, you name it.

Prior to 2022, this was our state of learning from incidents in the CIO organization. Senior leaders attended a weekly meeting that would feature a presentation of RCAs from recent major incidents. As RCAs tend to do, these presentations would focus on the technical system and how the failure could be prevented. For many teams, if you had an incident and it was not characterized as major, then no organized post-incident learning activity happened. Any of you work in organizations that have a practice similar to this? Great.

I joined the CIO organization in 2015, and I was part of a team that had some experience in studying our incidents differently. We used Sidney Dekker's Field Guide and Etsy's Postmortem Facilitation Guide as inspirations, and we did this more or less under the radar of senior leadership.

Eventually we had some incidents that were characterized as major, and we would come to that weekly meeting with senior leaders. But instead of presenting an RCA, we'd present a postmortem that didn't identify a root cause, but did highlight what was surprising, what was difficult, and what prevented the incident from being a lot worse than it was. Over the years, we gained some reputation about this approach, and I was eventually invited to help some of our peer teams study some of their incidents.

In the summer of 2021, a little after our new CIO came into her role, there was a very big incident. It lasted a while, it made a lot of IBMers unhappy, and it was in the newspapers. A big incident can be an incentive to doing things differently. Last December, during discussions about those weekly RCA meetings, I had the opportunity to propose a new approach. This was accepted.

My proposal was to host a monthly meeting to which we would invite all of our senior business and technical leaders in the organization, and we would discuss in significant detail one incident. The graphic represents our previous practice, shown in red: discussed only major incidents, discussed only the technical aspects of those events, and did so with a pretty small audience. In contrast, the new approach looks at near misses and security incidents as well as major incidents. It looks at the whole socio-technical system involved, and we do all this with a much larger group.

Once I got the green light from our CIO to do this, I thought, oh shit, I actually have to do it. I had several concerns. First, could I do the analysis and share the story of an incident in a way that these busy leaders would actually find interesting? Can I make this engaging so that people will ask questions and discuss aspects of the event? What will be the response of those people who are accustomed to focusing on a root cause? Can I pull this off every month? Well, it turns out it's not as hard as it sounds, and it gets easier as you go.

In each monthly meeting, I share this slide to remind everyone what learning from incidents is all about. Repairing mental models: learning from incidents is one of the best ways we have to update our understanding of our technical systems, including how they're supposed to work and how they actually work.

Hazards: we want to improve our collective awareness of the types of surprises that can show up in our systems.

Patterns: the best conversations about incidents are less about specific failures that occur in one system and more about the themes and patterns that apply to many systems.

Challenges: if we can understand the challenges and difficulties that people face in resolving incidents, then we can try to make these less difficult.

Culture: we want to make it safe for people to talk about the mistakes we make so that we can help others avoid similar mistakes and design systems that reduce the likelihood of mistakes.

In each meeting, I share this slide to describe what I hope all the participants will get out of the meeting. We want them to understand the story of the incident and its themes. We want them to see the benefits of the improvement opportunities. We want them to learn something new about our systems. We hope that they recognize the differences between what we're doing here versus the RCA approach, and we hope they find this meeting interesting and engaging. I remind them that real learning requires engagement with the materials, so I encourage them to ask questions. I also hope that at the end they're looking forward to the next meeting.

I'm pretty excited about these outcomes. We've hosted nine monthly meetings and we regularly get between 85 and 100 people attending. Over 270 people have attended at least one meeting, and many others are viewing the recordings and the reports. The ad hoc feedback we've received after this meeting has been fantastic. In addition, we've seen some things happen that were directly inspired by these monthly meetings.

Practice drills: one of our learning from incidents meetings focused on a Slack outage that occurred in February. Since the CIO team manages Slack for the whole company, it's up to us to help the rest of the company understand what's going on during something like a Slack outage. This case focused on the unique coordination challenges that occurred during incidents which affect our primary collaboration tools. A tangible outcome of this was the introduction of practice drills so that when this type of thing happens again, the right people know where to go to collaborate.

The inventory project: near the beginning of this year, we studied how the CIO organization responded to the Log4j vulnerability that occurred in December. One of the challenges was due to the fact that our applications and infrastructure were being managed in two completely separate sets of processes and tools. This learning from incidents discussion highlighted the need to create a single comprehensive inventory of applications and infrastructure, and to apply consistent processes across this spectrum. That project is well underway.

I'm super excited that leaders at various levels of the organization are regularly reaching out and asking for their surprises to be included in this monthly series. The outcome I'm most excited about is that more teams are beginning to host their own learning from incidents meetings to discuss their own recent incidents on their own.

The first three months of this year, I was doing this solo. By April, these monthly meetings had demonstrated enough value that I was able to hire an incident facilitator to help me out. In August, I was able to bring on a designer to help scale this program beyond the monthly meeting series. Those weekly RCA discussions continued to run in parallel with this monthly meeting series, but I was recently asked to replace those discussions with something based on the learning from incidents approach. Just last week was the last of those RCA discussions.

This is a picture we put together to describe some aspects of our scaling strategy. The idea is that our monthly meeting series has demonstrated what we can do for a so-called large or extra-large incident. The opportunity and the focus on scaling is on the small and medium incidents, and we want to enable teams to facilitate their own studies of these smaller incidents.

My past experience with transformation projects showed me the importance of being design-led, which starts with an understanding of your users. We've created these four personas to represent the users that we're currently focused on.

Carlos is the ITIL problem manager who's now in the process of becoming a learning from incidents coach. Sharice is the manager who's protective of her team and maybe has some trepidations about this new approach. Parth is the person who's being asked to facilitate his first learning review with his team. Gina is the person who is genuinely curious about what she can learn from other teams' incidents.

We've created a set of enablement materials organized around these personas, including our own field guide, which is largely based on the wonderful Howie guide from the folks at Jeli.io. We're rapidly iterating on these enablement materials based on ongoing user research as we scale this program.

Here are some of the remaining problems that I'd love to chat about with those of you who are interested in this problem.

Growing facilitation capacity: our scaling strategy hinges on teams being able to study their own incidents, but as I mentioned before, this shift is not intuitive or straightforward. We're interested in how best to help teams grow this capability.

Making it easy to share: a key aspect of this program is the sharing of stories about incidents. We're creating a simple library where people can contribute their stories and where Gina can browse through this library looking for interesting stories to learn from. I'm curious if any of you have created anything like this type of library of learning.

Socializing the contents of the library: another key aspect of the program is how to make the organization aware of the stories worth reading that are in the library. Everyone is busy, so we're currently thinking about the best ways to make people aware of the stories that they might find interesting and useful.

Wrapping up, I am absolutely standing on the shoulders of giants here. There is so much to read and watch on these topics, and I've selected a few of the papers and books here that I keep coming back to. As many of you know, we lost Dr. Richard Cook at the end of August this year, and his work has had a profound influence on me and on so many others.

The last thing I want to say is that there are several similarities between industry adoption of this modern approach to learning from incidents and the industry adoption of DevOps. The first is that many actually misunderstand what it means to do DevOps or to do this type of learning from incidents. The second similarity is that within traditionally minded organizations, both DevOps and this form of learning from incidents are heretical perspectives. Both of these began within small communities within the IT industry, and both are generally adopted via grassroots or bottoms-up initiatives. The last similarity is that there's general agreement that both DevOps and learning from incidents are valuable approaches that should not or must not be ignored.

Thank you very much.