Incident Management Meets DevOps

Log in to watch

Las Vegas 2019

Download slides

Incident Management Meets DevOps

Surya Avirneni

Senior Manager, Lead Software Engineer · Capital One

Bhavik Gudka

Director Software Engineering · Capital One

Learn how we are applying DevOps principles to innovate Incident Management Process, using a homegrown "Runbook as a Service" platform. We are empowering engineers to automate restoration of known issues using known solutions by leveraging their runbooks.

Chapters

Full transcript

The complete talk, organized by section.

Bhavik Gudka and Surya Avirneni

Bhavik Gudka: My name is Bhavik Gudka, Director of Software Engineering at Capital One. I am currently in the Card Tech space, managing a site reliability engineering group as well as another team dedicated to the authorizations platform. I also have a colleague here today with me: Surya Avirneni.

Surya Avirneni: Thanks, Bhavik. My name is Surya Avirneni, and I am a software engineering lead for the Capital One Technology Operations Center. We are glad to have all of you here. Thanks.

Bhavik Gudka: Thank you all for coming here. I know it is right after lunch. Let me quickly check, because after lunch people get sleepy, so let us see how sleepy we are here. Good afternoon. Okay, from the last rows I do not hear that. Good afternoon. Okay, cool, we have a wide-awake audience here.

How many of you are developers here? Quite a few. Operations? Production support? A few hands. That is good. And how many of you are DevOps engineers? Come on, guys. All of us are. All of us.

Let us start this presentation with a quick refresher on what DevOps is. Most of us know what DevOps is, and most of us practice it every day in our lives. DevOps is basically a philosophy and a practice where you bring developers and operations together to shorten the product life cycle. It enables enterprises to deliver features much faster to the market. At the same time, it creates a feedback loop for developers, a much necessary feedback loop that was nonexistent before the days of DevOps, so they can release high-quality features rapidly that also align with the business objectives for the enterprise.

Some of the benefits of DevOps: what is in it for you as a developer? You get continuous integration and continuous deployment capabilities, using which you can rapidly release new features. What is in it for product managers and business executives? Faster time to market and creating greater experiences for your customers, reaching them first, is important for any business to grow and capture the audience. That is evident in today's world, where companies like us and many other companies are excelling at reaching customers faster and creating experiences that keep those customers for a long time.

That is about developers and product managers. But when I talk to operations folks or incident managers, they do not give a damn about DevOps because, so far, it has not benefited them.

The question today that I want to ask, or some of you might be asking if you are on the incident management side, is: what is DevOps giving you? If you are an incident manager or operations person, how has DevOps helped you? Now code is moving into production left and right, and a lot of changes are going rapidly. But incidents and issues will happen all the time. While the speed of development has gone up, we have not seen the full benefits of DevOps for the other side of the fence, which is operations or incident management.

Today we are going to talk about how DevOps can play a role in incident management. We are going to talk about what incident management is, the incident management life cycle, the goals of incident management, how incident management has evolved over the years, and how DevOps can play a role.

01Incident management definition and goals

Bhavik Gudka: What is incident management? Anyone? Give me one or two words when you talk about incident management. These are some of the words I hear when I talk to people about incident management: boring, labor, stress. Developers say, "I hate it." Some folks who are already in incident management say it is a thankless job. Escalations. You blame others for incidents. It is a waste of developers' time. I used to say that at some point: it is a waste of my time. Nighttime calls. The moment you hear incident management: "Oh boy, I just wake up at night and do things." No life. And of course there are some people who love it, but their sentence is, "I love it when I am on vacation." They love incidents, but only when they are out of town. That is how people see incident management.

What is the incident management life cycle? It is a clock. That clock is ticking. The problems start for our customers, and from an incident management perspective, we have to make sure the incident gets detected, we have the right folks or right tools to solve that problem, and finally the problem gets solved. You detect a problem, get the right people or tools mobilized, and then solve the problem. Most of the time, solving a problem might take five, ten, or fifteen minutes, but the things before that might take hours.

The goal of incident management is to reduce time to recover problems, because any company will have one goal: always on. We want our customers to be happy. You can have the fastest time to market, but you have failed if there is downtime, if there are incidents, and you cannot resolve those incidents quickly for your customers or give them an always-on experience.

How do you reduce time to recover? You reduce TTD, time to detect; TTM, time to mobilize the right people; and finally time to restore the problem, or time to recover. In general, we also want to reduce incident count, but today's topic is not about that second bullet because it takes a lot more to reduce incidents. We are going to focus on when there is an incident, how do we reduce the time to recover?

02Evolution stage 1: manual reaction on the go

Bhavik Gudka: Before I talk about that, we are going to talk about how incident management has evolved over these years. We will talk about the old days, 15 or 20 years back, and how incident management used to happen. Let us assume Surya is a developer and I am an incident manager or operations person. I get a call or somehow learn that there are problems in my system. Somebody told me that Surya is the person who can solve it.

Surya Avirneni: Hey, Bhavik. Hello.

Bhavik Gudka: Hey, Surya. We are having a lot of problems, and it looks like we might have to fail over our application to our secondary region. We need you on the call. Can you please join?

Surya Avirneni: Hey, Bhavik, I am at the beach. I cannot join the call right now.

Bhavik Gudka: This is important, man. Your manager is also on call with me. We need you.

Surya Avirneni: But I did not carry my office laptop with me. Sorry, Bhavik.

Bhavik Gudka: Okay, how about just guide us?

Surya Avirneni: There is a lot of background noise. Can you come to a better area so I can talk to you?

Bhavik Gudka: Okay, give me 15 minutes. It is 15 minutes, or 16 minutes now. Hey, Surya, where are you, man? It is 16 minutes already.

Surya Avirneni: Give me another five minutes.

Bhavik Gudka: I am not going to finish that whole call. He joins, he guides us, and the problem is solved. The problem got solved in five minutes after he joined or gave us direction, but it took us 20 minutes, and this was the happy path. What if it would have been one hour sometimes? Does anybody remember that time?

What happened there? Manual monitoring, manual mobilization, and on-the-go remediation. On the fly, he told me something on the phone and we did that. What are the results, Surya?

Surya Avirneni: With manual monitoring, the way we used to monitor our systems back in the day, it causes slow detection of issues. You cannot automatically detect a problem unless someone reports it to you, whether it is your agents, your call center associates, or your customers reporting those issues over a call. Mobilization, as we have seen, is almost not there at all. Someone has to call someone, then desperately wait for this person to come onto the call and fix the problem. Remediation has been slow because we could not mobilize the right person at the right time, and even that took time for remediating the problem.

03Evolution stage 2: manual reaction using a runbook

Bhavik Gudka: Things improved. Some monitoring automation helped, but everything else was still not that great. One thing that might have changed was that the steps he was telling me on the phone, he started putting in some form of runbook, guide, or operational guide. Whatever he was going to tell me on the phone, he has put on paper and I have a document somewhere.

Let us look at stage two: react manually using some runbook. A runbook will have some statements written in English, some steps.

Surya Avirneni: Hello, Bhavik.

Bhavik Gudka: Hey, Surya. We are trying to fail over an application to a secondary region. I have the runbook that your team gave me. I followed all the steps. The error rate has gone down, but we see a lot of latency. Something is not right.

Surya Avirneni: Hey, can this wait?

Bhavik Gudka: No, Surya, this cannot wait. Our customers are waiting.

Surya Avirneni: Why don't you try the document that is in some XYZ site?

Bhavik Gudka: No. I followed all the steps there. Any idea why there would be latency? I do not think that was expected.

Surya Avirneni: Did you fail over all the services?

Bhavik Gudka: Yes, whatever was on the guide, I failed over all those things. What do you think I am missing?

Surya Avirneni: Did you move the caching service to the other region as well?

Bhavik Gudka: Let me check and get back to you. I look at the caching service, and yes, it was not failed over. I do that and call him back. Hey, Surya, thanks for the tip. That helped. But why did you not put that in the runbook?

Surya Avirneni: I am new to this team. I do not know.

Bhavik Gudka: Can you please get that document updated for me?

Surya Avirneni: All right. Afterwards.

Bhavik Gudka: Automated monitoring, manual mobilization. I still had to get that document and still had to talk to the person when I ran into issues. Planned remediation was a good thing; at least remediation was planned, not on the fly. But the result was not exactly what we wanted.

04Evolution stage 3: scripts as automated runbooks

Bhavik Gudka: Stage three: things have improved. The document is not a document anymore; it is a script. It is reacting using a script, but manual intervention is still required.

I am trying to fail over this application. We have some issues going on. I am not able to download your script.

Surya Avirneni: Did you get the right script and the right URL that I sent you last week?

Bhavik Gudka: I think so. Let me get back to you. After doing some more digging, finally I get a proper URL. Looks like the team had sent some URL that we missed. Hey, Surya, I found the new URL that you sent me in the email. Sorry for calling you so late, but things are good. I ran your script and it is fine. Thanks.

Surya Avirneni: I hope he never calls me again.

Bhavik Gudka: Automated monitoring, manual mobilization. I still had to find that script somehow. Here I was not dependent on Surya, I was dependent on the script. I could not find the script, so I again depended on Surya. Automated remediation because the script is doing that for me. Result: fast detection, slow or fast mobilization depending on the day, slow or fast remediation depending on the day.

05Evolution stage 4: DevOps for incident management

Bhavik Gudka: Then we said: we have a known problem and a known solution, but we still have this manual business going on. How do we fix that? Can we use the DevOps way to do incident management?

DevOps is about shortening the life cycle of software development or business goals. My definition of incident management using DevOps is: shorten the incident management life cycle while delivering features, fixes, and updates frequently in close alignment with operational objectives. Our approach is to automate the end-to-end process, similar to a CI/CD pipeline, connecting the three stages of a production outage: detection, using the detection to trigger a script, finding a script, and triggering that script.

Solve known problems using known solutions automatically. Sometimes a human will be required if the script fails for whatever reason, but in that case page the relevant teams automatically for unknown problems.

This is how the flowchart looks. Monitoring tools automatically detect the problem. Events come in, and there is a runbook-as-a-service platform. If it is a known problem and there is a known runbook, the runbook can be a script and we invoke it. If it is a new problem for which I cannot find the script, then I page the on-call. Here we are trying to cut down the time it takes to mobilize a script or mobilize a person if the script cannot be found.

The result is fast detection, fast mobilization of resources - script or team - and fast remediation of known issues using known actions. If it is an unknown issue, or a known issue with an unknown solution, the first time it will take time, but once we do it the first time, the second time it should be automatic.

06Runbook as a Service at Capital One

Surya Avirneni: Why did we build this platform? As Bhavik explained, when you solve known issues with known remediations that are automated, you reduce overall downtime or application instability issues, and you help your customers access your apps without downtime.

Imagine putting runbook as a service out as an enterprise service for all of your tech teams to automate their known problems. For known problems, we can execute a runbook in the registry and fix the problem. For unknown problems, you page someone to come and fix the problem. At the same time, you can apply learning: what was the fix? Was this a known problem? Will it reoccur all the time? Can we solve it at the root, or is this something we have to live with? When you ask that question, you can implement a runbook for that problem or scenario.

We wanted our developers and tech teams not to have to worry about another automation solution. We wanted them to use existing automation toolsets that they are already familiar with and use in their DevOps pipelines. The runbook-as-a-service platform creates a unified runbook language across the company. Whether you are in team A or team 100, you speak the same language when you talk about a runbook. When you move across teams, there is no confusion. This abstraction also leverages all the existing toolsets: sometimes a shell script, sometimes a more advanced Terraform job or something like that.

With deep integration into enterprise functions like change management and incident management, we can follow enterprise policies for those processes while reducing downtime of our applications. The constant learning in the runbook life cycle helps us automate more runbooks as we progress across the company. Last but not least, it becomes a hub for all enterprise runbooks, so we can promote reusability across the enterprise for similar tech stacks. You can have one template created for known problems in a type of tech stack and apply it in hundreds of places without teams having to build their own automation.

Some use cases we have seen so far: multi-region failover of databases. Most databases will not replicate across regions just for cost, so they operate in one region. Region outages can happen, and availability zones may have problems. In those cases, you have to move them over to another region. It involves multiple steps, and we do not want someone to log in to the cloud provider or data center and do those steps manually. That is one straightforward use case we can apply across tens of applications across our company.

Automated disaster recovery is another. You can easily automate disaster recovery steps, so someone does not have to perform all those actions manually. There can be other items as well, but my favorite is automated diagnosis and troubleshooting. Automated runbooks save time in troubleshooting, pinpointing, and triangulating the problem, especially in an enterprise with a large tangle of applications that depend on each other. You can get closer to remediation rather than spending all the time during an incident call troubleshooting.

07Always on, learning, and future goals

Bhavik Gudka: Our goal is the same as any company: we want to keep our enterprise always on. We can build this platform and have a standard way for every team to have their runbook. From a developer's perspective, when you are working on a problem and trying to fix something in production, there are protocols you need to follow. You have to file a change order, file a ticket for auditing, and alert people. The runbook-as-a-service platform will do all those things for you. As a developer, I only focus on my problem and the logic that is going to fix my problem, and I do not worry about the peripheral things that also need to happen during an incident management process.

More importantly, once there are more and more runbooks, we can also do machine learning out of that once we have the data. We can figure out if there is an application that just keeps on patching stuff. Sometimes, in the manual world, you will not realize that a team is taking shortcuts and patching the problem by doing the same thing again and again. With this platform, I can pull metrics and see that Team A got this problem 10 times a month and every time this runbook was run. That is not good. Something is not right. While we encourage people to put runbooks in place, we can also find out if they are doing too much of it, which means they are not fixing the root problem.

For disk space cleanup, it is okay if a runbook runs once in six months or once in three months. If it is running every day, something is wrong: they are not rotating logs, or they are writing a lot of junk in their logs. Another example is failover. If somebody is failing over every day, something is not right. Failover once a month, once in six months, or once in three months makes sense. While we encourage teams to build runbooks, we can pull metrics on how many people are patching software rather than building good software.

Another aspect of machine learning would be: if you know a runbook is written properly for one team, and other applications have events that do not find a matching runbook, maybe we can tell them that another team has a runbook for a similar event and they should talk to each other. If you have a platform and a registry in one place, anybody can learn. As a developer, I want to learn from my mistakes, but I also want to learn from others' mistakes. I want to make sure that what problem others had, I do not have, or maybe I can reuse that solution if it is valid for me.

Surya Avirneni: As part of Capital One TOC, the Technology Operations Center, we are responsible for keeping the lights on in the company. As we go into 2020, we have bold goals for how we can automate most of the runbooks for all known problems and, as Bhavik mentioned, use machine learning to apply those automated runbooks across the company for similar events. Our ultimate goal is to see the day when we do not have to have a runbook and there is no issue. But software can fail and systems can fail, so it is always handy to have runbook automation in place so we can respond to those events whenever they occur.

That is our talk.

Q&A

Audience: What is the ratio between known problems and unknown problems? Have you seen any decrease in unknown problems? If you have very few known problems, the solution does not really help.

Bhavik Gudka: Right now we are in the early phase of this platform. You are right: we have more unknown problems than known problems. But it is not that the problems are unknown. It is mainly because we do not have a matching runbook, because it is a rule-based engine. On day one, when there are no runbooks, there are no rules. As more people adopt and build more runbooks, that ratio will get better.

Audience: Some use cases you showed were disk space cleanup or restarting an instance. If you had an automated runbook to do that, restarting an instance again and again might not attach to the root cause and might end up as a bigger time bomb. Do you plan to force a policy around saying, "You restarted this eight times last month; we are not going to do that"? Also, what underlying tools like Slack or Jenkins have you used to implement this?

Surya Avirneni: Once you establish a platform and see those remediation use cases happening, we want our problem management practice to pick up use cases that are not relevant. Disk space cleanup or restarting instances may not be more relevant in the world of cloud, when you have auto-scaling and these features. We want that metric to develop and our problem management to apply root cause analysis across teams as an enterprise standard.

To answer your second question, you can use any automation behind the scenes. We are using serverless Lambdas and step functions, and then we integrate with other toolkits across the AWS cloud and monitoring suites.

Bhavik Gudka: Eventually, if somebody is patching too much, we will stop them because of the metrics we pull out. But there is always a gray area where you know the team has to fix it the right way, and they may need three or four weeks. During those weeks, if the problem can happen ten times, how do we make sure that a known problem is solved by a shorter known solution with minimal intervention as quickly as possible? It is okay to have a patching kind of runbook for some time, but not forever.

Audience: You said you are using machine learning or artificial intelligence.

Bhavik Gudka: We will. It is our roadmap.

Audience: Typically, for using machine learning models, you need a lot of data. How much data are you planning to capture from runbook information? It will not be that many records, right?

Surya Avirneni: It may not be just the runbook information. We already have incident information from the past and what actions were performed during those incidents as part of incident activity and restoration processes. We can apply natural language processing and textual analysis on all those incidents to come up with automated recommendations. The runbook information will complement that.

Audience: If you take even the past 10 years, you might not see millions of records.

Bhavik Gudka: You are absolutely right, and that is why we said it is our target. We have not started doing it right now because it is a new platform. As we get more data, it is not just about the runbook itself; it is about how many times the runbook is triggered. Even that data will be helpful.

Audience: Have you explored this solving application issues, like multiple microservices where one service is down or taking a long time? The use cases so far are more infrastructure use cases that cloud platforms often provide.

Bhavik Gudka: You are absolutely right. We can automate anything. If my team is getting involved in an application issue and following eight, nine, or ten steps, and they know that is what they are going to do whenever that issue happens, they can automate that. Your runbook is not always about fixing a problem. Your runbook can also be about troubleshooting. If you want data from three or four places to make a decision, you can make a runbook out of anything where you know you have a known problem and a known set of steps you would like to do. In fact, there might be more runbooks for troubleshooting than for actual fixes.

Thank you all.