How to Crush Major Incidents with DevOps Agility

Log in to watch

Las Vegas 2020

How to Crush Major Incidents with DevOps Agility

Head of Product Marketing, ITSM & ITOM · Atlassian

The key to DevOps success is being prepared for incidents, responding quickly and ultimately getting services back up and running. In this session, we’ll explore how, using Atlassian tools, Dev and Ops teams can work together to do just that: respond quickly to incidents, troubleshoot their cause and restore services as fast as possible. We’ll also take a close look at how teams are improving communication and collaboration across their entire organization, minimizing the business impact of unplanned service disruptions.

Darren leads the product marketing team for Atlassian’s ITSM and ITOM Products. He previously held executive positions at Opsgenie, Onshape, InVue, and DS SolidWorks and has a passion for exceptional technology. He enjoys spending time with family, camping, and more recently, woodworking. Darren holds a degree in mechanical engineering from the University of Florida.

This session is presented by Atlassian.

Chapters

Full transcript

The complete talk, organized by section.

Darren Henry

Hi, my name is Darren Henry, and I lead the product marketing team for the ITOM and ITSM products at Atlassian. Thank you for joining my virtual session. It's entitled "Crushing Incidents with DevOps Agility."

I'm not going to keep on my webcam during the presentation. I don't think my face adds a lot to the content. But I wanted to introduce myself and kick off the session. At Atlassian, we believe that incidents are inevitable. And as a practitioner and early adopter of DevOps and agile methodologies, we think there's a lot you can do to accelerate your time to resolution. That's what this presentation is all about. I hope to not only teach you a little bit about our vision and what we're investing in in our products, but give you some ideas that will help you reduce the time it takes to resolve incidents. So let's get started.

I think a good place for me to start is to define what I mean when I say incident. It's an event that causes a disruption or a reduction in the quality of a service. It requires an emergency response. And believe me, incidents are going to happen. If you Google major service disruptions, you'll see every month there are many that occur. Just this week, Dunkin' Donuts' mobile app was having issues. That's a major catastrophe in the Boston area where I live.

Now, ITIC calculated that the daily costs of incidents are about $4 billion. Gartner stated it was even higher. Any way you cut it, incidents suck, and they're expensive.

Now, we believe that the operation teams that deal with incidents are either IT or dev teams, but they look at incidents through slightly different lenses. The IT teams are responsible for numerous services. They often are made aware from people reporting issues, and in the case of major incidents, several teams may need to be involved to get to a resolution.

Dev teams are focused on the services they deploy. Real-time monitoring is critical because incidents often arise due to the high frequency of change. Most incidents have to do with either a code change or a third-party service. Atlassian is investing and focused on incident management. We not only develop, but use our own products like Opsgenie to manage alerts and on-call schedules, Statuspage to communicate outages to customers and stakeholders, and Jira Software as well as Jira Service Desk to manage service requests, and also the work associated with restoring services and fixing the underlying problems.

We are focused on empowering Dev and Ops teams to respond to, resolve, and learn from every incident. And in each of these areas, we think there are big opportunities for improving things. To be clear, all this is more important when you have embraced an agile mentality. We place the whole monitoring and incident management practice firmly on the right side of the DevOps infinity loop, and we recognize that delays here often kill your entire DevOps momentum.

So let's talk about practical and tactical ways to improve how you respond, resolve, and learn from every incident. We'll start with respond.

Now, over the last two years, most companies we've talked to have automated their incident response. They started programmatically managing alerts and on-call schedules. If you haven't done this yet, it's the first step you should take to really gain efficiency in incident management.

Now, we strive to make this easy with Opsgenie. Opsgenie integrates with all monitoring tools and manages your on-call schedules. A great tactic to improve your team's understanding of alerts is to normalize the way they're displayed to your teams.

Here, you can see an alert from AWS CloudWatch. It may be a little confusing, and I recognize it's hard to see in this slide, but it has to do with a Lambda execution failure. Well, we can take this alert and reformat it with Opsgenie so that the information is clearer. We can specify the source, the issue, and the region right within the name of the alert.

What's really nice is you can normalize all the alerts from other tools like Datadog, New Relic, and Splunk so that they appear in a similar fashion. Responders see the information in a consistent way, and they're spending less time deciphering the alert.

When an alert is of high priority, we can write a rule that will escalate it to a major incident so that response teams are alerted. And if you're not familiar with Opsgenie, it will notify response teams using on-call schedules and escalations. In fact, we use different notification channels, including push notifications, SMS, voice calls, emails, and chat. So no matter where your team is, critical alerts are never missed.

We have some practical advice regarding resolving incidents as well. Collaboration is essential, so we recommend you tie your incident management tools to your favorite collaboration methods. Now, Opsgenie has an incident command center. It has built-in video conferencing, chat, and incident timelines. It can spawn a virtual war room immediately when an incident occurs, and it can direct teams to that room right within their notifications.

We recognize that many of our customers prefer Zoom for video conferencing, so we enabled Zoom to be used alongside their command center. That link can be included in the notifications. We also saw that people prefer Slack and Microsoft Teams, so we built strong integrations with those products.

So here you can see individual Slack channels can be used for each incident. Our tools invite all responders, post the critical information in the header of the channel, and enable responders to take action. Finally, any message that occurs in Slack can be easily recorded back in the incident timeline to keep track of every important action that was taken.

So the practical advice is to utilize the collaboration processes and tools that your response teams like and be very flexible in allowing that. You'll get the most efficient collaboration.

We recognize that quickly investigating the root cause also drastically accelerates incident resolution. And I mentioned earlier that in DevOps environments, many incidents are caused by code deployments. To determine if this is the case, we recommend you look for ways to correlate your incidents to your deployments. And here's an example of how you can do that. With Opsgenie and Bitbucket, we use a service list. We're able to relate incidents to the services that are disrupted, and we can also map our code repositories to the services that they control.

So when an incident occurs, you can use Opsgenie's incident investigation to get a quick understanding of the services that are disrupted, including the service dependencies, and you can see the deployments related to those services. So here we're looking at a very visual indication of the last 24 hours, and we can see clearly successful deployments, failed deployments, and past and ongoing incidents in one place. The halos represent the number of file changes, and if there was a deployment near the time of the incident, it can be tagged as a likely cause.

By surfacing this information, you can see the developers that were involved, you can include them into the response team and strategize a fix. This might be rolling back the deployment, turning off a feature flag, or maybe creating a hotfix.

And by the way, you can use our action channels to run diagnostic tools or even take remediation actions as well right from your mobile device. So, for example, you could run an EC2 rescue playbook or an EC2 restart using Amazon Systems Manager, all with the tap of your thumb. In summary, you want to use automation and correlation to find fast ways to troubleshoot and remediate incidents.

Now, another great way to crush your incidents is to proactively communicate with users and stakeholders during the incident. This not only builds trust and preserves your reputation, but selfishly, it minimizes distraction by deflecting the redundant reporting of incidents and issues and minimizes people asking you for status updates.

Now, you should set up notifications for stakeholders similar to alert notifications. But a great way to further your communication is with status pages. Right from your incident management tool, you can spawn a public-facing status page when appropriate. We use status embed widgets that can also add messages to our webpages, our help portals, and even our applications.

We believe another great way to proactively communicate is to surface major incident information directly within ITSM and help desk tools. Here you can see our Jira Service Desk offering. And over on the left, you may notice we've added visibility into major incidents with a seamless integration with Opsgenie.

Now, as requests come in, agents can quickly link them to incidents. By defining this relationship, everyone wins. Responders get a sense of the blast radius of the incident and can change priority as needed. Support agents see the status of incidents and can respond faster to help seekers. It's a complete crush.

If a major incident is human-reported, the support agents can even create a major incident directly from the ITSM tool and start the teams swarming on a response.

Okay, the last section of my presentation is fast to review, but also very important, and it has to do with the learning state of incident management. We believe it's important to track your progress and look for ways to improve, so you should run reports and discuss the trends as a team. Our most popular reporting and analytics include measuring the mean time to assemble and the mean time to resolution. Many people ignore mean time to assemble, but the manner in which you get the right people to start taking action is usually a great place to start the improvement process.

You should also look for ways to analyze which notification channels work best for your teams. Compare the on-call responsibilities and work distribution, especially after hours. Examine which teams were notified and which teams resolved the issue. Sometimes you're notifying the wrong team for a type of issue. And always look for the sources of the most common incidents. We pre-built a ton of reports to help you gain the insight from these metrics.

And hey, we're talking about DevOps environments, so let me talk quickly about how you can measure DevOps performance. Remember how I talked about relating services to code repositories and CI/CD tools? Well, if you do that, you can quickly understand three of the DORA four metrics and trend them over time. Look at this report that Opsgenie can generate in real time when you shared the list of services with Bitbucket and Bitbucket Pipelines.

You can see we provide deployment frequency and change failure ratio, as well as mean time to resolution. We can even trend this data over a period of time so you can see how you're doing. We'll tell you clearly the number of deployments, incidents, and alerts, and then map the deployments versus failures, deployments versus file changes, and overall service health versus team reaction. By aggregating data across systems, you get a clear picture of what the hell's going on and how to improve.

Now, my final advice on crushing incidents is to document and share the knowledge. It goes without saying that every incident should have a postmortem report, but many people fail at documenting what transpired or how the incident was resolved. At Atlassian, we invested in making this easy. When the incident is resolved, you can easily create a postmortem report. Because Opsgenie records everything that transpired, it populates a template, and then the incident commander or the response team can add commentary.

A key point is that the report needs to be shared, so we added the ability to export the report to Confluence, and it can easily be distributed across the organization. This will help speed resolution of similar incidents and help teams avoid common pitfalls.

So how do you crush incidents with DevOps agility? Well, here's a summary of my practical and tactical advice. You should centralize the alerts and then normalize their format for easy reading. Route them to the right teams with strong, redundant notification. To resolve incidents faster, be flexible in the ways teams collaborate and start connecting systems like CI/CD tools and ITSM tools to troubleshoot at lightning speed.

Finally, track and trend data using strong reporting and analytics. Find out the areas of success and the opportunities to improve.

Atlassian wants to help. All the tools I mentioned are available in free versions, and we have trials of all the advanced plans. We also have some killer resources that you should check out, like our incident management handbook that you can download for free, or our incident management website that is chock-full of best practices.

This concludes my presentation. Thanks for spending time with me today. I hope you found this session helpful, and I hope your major incidents are few and easy to smack down when they arise.