Lightning Talk: How to Avoid Knee Jerk Response to Incidents
Lightning Talk
Josh is Splunk's Senior Technology Advocate focused on next-generation IT operations and DevOps. He is the co-author of several popular books, a serial podcaster, has led numerous technology user groups, and is an awarded public speaker. Josh has more than 20 years of experience in IT working with a wide range of technologies. His most recent focuses have been in DevOps, Digital Transformation, and IT Transformation.
Chapters
Full transcript
The complete talk, organized by section.
Josh Atwell
Hi, I'm Josh, and I think a lot about failure, and that makes me really popular at parties.
But last weekend, I was celebrating Father's Day, and I was following the news. My favorite retailer was having an issue. I'm not picking on them, but I want to talk about how we respond to it.
Now, Dr. Richard Cook, who talks about how complex systems fail, mentions that as our systems become more complex, there are broken parts throughout the system, and that humans are doing work to keep that system from completely failing, and those failures will present themselves in multiple ways. You'll have total outages, where everything's just gone, like what Target experienced, the retailer; partial outages, where some services may go; and then degradation, where it's incremental, like Wi-Fi.
And so these impacts of the outages are pretty consistent. Obviously, revenue will be lost. But confidence is also lost, both from your customers, from the market, from your leadership, and your employees always feel terrible about it because they know that they've let somebody down. Now, Target, the retailer in the US, when this happened, they actually lost share value, market value. Walmart also did as well because the way we look at how we feel confident, we also look at the whole spectrum of an industry, which makes no sense whatsoever.
So if you're a leader, and this is what was bothering me, and you're working with high-performing teams, your natural tendency when there's an outage is to say, "Slow down. Stop. We got to rethink what we're doing. Maybe we shouldn't be going so fast. We did something wrong." And the problem with this is this reaction creates worse problems. You further reduce morale. You're going to impact the progress that you made. You're going to have people questioning whether they're doing the right thing, and you're forgetting about the fact that there is risk with these rewards.
So how do you solve for this? So I was thinking around this. How do you restore the lost confidence? How do you maintain that pace of progress? And then how do you prepare for the next time? Because we are going to have more outages. You can't avoid those.
Now, the first way I do that, or I was thinking about this, is I looked at what their CEO did. So right after, he obviously is on the news. He's telling customers, "We know we had a tough weekend. We know we disrupted you. We're sorry." Told the market, "We totally expect that we're going to continue to hit all of our targets. Everything's fine." What I really loved was he talked about how his team responded to the outage. They identified the problem, they pushed the fix out to the stores, and they got everything running. We're disappointed, but we responded. I don't know what he said internally. This is just external.
So the first step in this, or the next step in this, is believing in the process. So as you're implementing transformation, whether it's digital transformation, DevOps, whatever, you have put forth a process. You have an end goal. You're going to serve your customers. You have to continue to believe in that process. Review it, but believe in it. And understand that failure should make you stronger as an organization. When you exercise, you break down your body, your body rebuilds itself, it becomes stronger. Your organization should do the same thing whenever it fails. It shouldn't just give up, which is what a lot of us like to do after we work out.
So the other thing to remember is that IT and software are driving your business value. We have no way of knowing what Target's valuation would be this year if they didn't apply DevOps. But I think we could all agree it'd likely be lower. And the market agrees that the work that they've done has been transformational because they recognize that Target's implementation of technology has become a standard that all of retail should adopt. It used to be Amazon was going to destroy retail, but retail's still doing okay.
Also remember that that added value was not a result of reducing cost. All the work that you've done was based on speeding your delivery. Just like they talk about in Accelerate. Moving faster is what got that value from your customers. And it's also important to keep in mind as a leader that that added value, there's a gap. Like SREs, we talk about our error debt, or error targets. So we have a little bit that we should expect.
Now, going forward, you really have to focus on how do you increase the stability in the systems and the process that you have without slowing things down. And there's a variety of ways that you can do that, but the key point is don't slow down. Keep moving. So as a leader, you want to continue to encourage everyone to automate their processes, especially how you respond. Increase that visibility and awareness of the system so that you can see these activities happen in advance, hopefully, and practice incident response. Get better at recovery.
Most importantly, though, as a leader, you have to continue to focus on your customers because your customers are not going to reduce their expectations of you. You may have lost their trust for a brief moment. You may have disappointed people, but they still want more from you.
So three things that you have to ask yourself. For long-term success, are we able to restore trust? Do we know how to do that? Are we able to increase our stability? Do we have a plan for that? And are we able to talk about the amount of risk we're willing to take? In doing that, you can avoid knee-jerk reactions.
Thank you.