How Engineering does DevOps using Slack

Log in to watch

London 2020

How Engineering does DevOps using Slack

Around the globe, DevOps teams use Slack for everything from code reviews to cross-functional communication. Now you can learn how to unlock these capabilities, as well as how to build a DevOps culture that’s ready for change.

In this session, we’ll show you how technical teams can centralise and automate their workflows through Slack using all their favourite tools and apps. You’ll come away with an understanding of how DevOps teams can work more efficiently with Slack, leading to stronger features and faster releases.

This session presented by Slack.

Chapters

Full transcript

The complete talk, organized by section.

V Brennan

Hi, I'm V Brennan from Slack. I'm delighted to be here today to talk to you about how we leverage Slack as a platform for our DevOps activities. Please come along to our Slack channel after the session for a Q&A.

But let's dive in. I'm going to start by telling you a little bit about me. I've led development teams, operations teams, and even IT teams. I'm a big proponent of DevOps. I think it brings together two very important and different skill sets. I think we build better software together.

I started my technical career out at BNZ. It was surprisingly modern for a 150-year-old bank. It was where I had my first introduction to Agile, DevOps, test-driven development, and Lean. I got to work on some very sexy features like foreign exchange and digitizing and modernizing a bank. From there, I moved to Spotify, and because of my interest in Agile, it was very like going to church. It was quite a ride at Spotify. I really thoroughly enjoyed my time there.

But we decided to move back home to Ireland to be closer to family. When I was looking around for somewhere to go to, Slack was obviously a standout candidate for me. I'm really passionate about communication and collaboration, so it was something that resonated with me on a personal level. I've been really excited to see Slack grow, especially in the recent months, and be part of a tool that's literally helping people stay connected in times like this. So it's a very inspiring and personal connection that I feel I have for Slack.

So what are we going to talk about today? On the agenda, we have... and I will just move my mess. On the agenda, we have a quick intro to Slack for those of you who don't know our history, and then we're going to look back at how DevOps emerged at Slack. We're going to talk about some of the challenges that modern software teams face, and we're going to show you some examples of how Slack innovates to actually address those kinds of challenges.

So what is Slack? For those of you that are not familiar, Slack started out as a video game called Glitch. So it's not my current day job. I kind of wish it was. This is a screenshot from the game called Glitch. It was created by our founders, and the same founders who created Flickr, which was an early photo-sharing platform back in the early 2000s.

Unfortunately for Glitch and its small but loyal fan base, the game never really had broad appeal, and so they decided to shut it down. But when they explained the situation to their investors, they asked if there was anything else that they thought they could take to market. The team had developed an app to encourage communication and collaboration, and they really felt that they couldn't work without this app, and maybe that this was something that other people would enjoy also.

So Slack back then was all about persistent chats. It had a few integrations, like you can see the Google Calendar integration right there, but the user experience was really clean, and it was something that everybody really enjoyed. Anyone who interacted with it really enjoyed the user experience.

Fast-forward to today. Since 2013, we've now over 12 million daily active users, and over half of them are outside the US. Collectively, those users send a billion messages a week, and last year, we reached a significant milestone by going public. There's no doubt we benefited from a great product-market fit at the time.

So let's jump into our DevOps story. We've observed, at least, that systems are becoming more and more complex. Teams have had to change to keep up. Since the early days, DevOps has become a standard way of working, but also a competitive advantage. Operators write code, developers write config, and need to understand how the system works under the hood. The world continues to move faster and get more complex, and the result of this can really be overwhelming if you're not in front of it.

DevOps is about leveraging people with broad system knowledge around a particular change or problem. We've approached it from the perspective of service ownership and tooling, and I'll talk about both in turn.

So when we talk about service ownership, we talk about it from a perspective that teams own their end-to-end customer experience. So we do that through faster tooling in Slack. This is a mindset and a culture we've fostered, and it's very much a journey. It's not complete. When we first launched the idea of service ownership, it was quite scary for development teams, and it's been something that we've actually had to work on a lot and offer more and more tooling.

We do believe it's true to our legacy of not creating silos and customer handoffs. But as we've evolved, it has become harder. Slack is harder to navigate. There's fewer people in the organization who actually understand how the full system works end to end. And we've needed to help developers feel successful and safe, and for us, tooling is the key.

So these tools make all the difference. We've built tools to support developers to manage deploys, logging, alerting, escalations, and support. We also have an embedded SRE model, which is helping us grow broader skills in teams.

But assuming this ownership is not free. We don't just shift the work from one team to another. There's an emphasis on preparation and support, and the goal is to reduce the burden but still empower responsibility through visibility. So we believe incident response works best when we have both the system and the developers responding together.

We have provided tooling so that developers don't have to understand how Prometheus or Terraform works. What we want is to make the developer experience efficient and joyful.

So what do we mean by service ownership? Well, like I've already said, we're talking about the teams being responsible for managing their customer experience end-to-end. This means managing their monitoring and delivering their software to users in production. But it also means after that, that they've got service health instrumentation around service-level objectives, that they've got good monitoring and alerting for rapid response when there is a problem. We also have production readiness reviews and deployment risk assessments to make sure that when we're putting a major new feature into production, people have taken the time and energy to make sure that it's not going to break production.

Capacity and performance monitoring is something that's really important, and we saw that really play a huge role back in March when we had a surge in activity when the whole world started to work from home. Thankfully, our capacity and performance planning had kicked in, and we saw the majority of our systems just scale automatically. It was quite a thing to see.

We also ensure that teams have PagerDuty rotations. We've got a solid incident response process and post-mortem activities. So the goal is that we take our teams from feeling scared and worried and unsure to creating this joyful experience. So it's about taking the anxiety out of operations and replacing the ambiguity by abstracting that away.

There is a lot of surface area for people to cover, so we have used Slack as a platform in order to insulate them from all of that change and all of that knowledge that they would normally have to have.

So this is where DevOps comes in. So DevOps means reducing risk through tools and culture. At least that's what it means for us. As systems increase, the number of platforms you need increase to prevent fatal failures. Like all DevOps organizations, we look to reduce repetitive manual tasks and reduce the need for operator intervention. The difference is that we've leveraged Slack as a platform to do that.

So here are some examples. This is an example from Internet Relay Chat of developers, operations, build logs, alerts, and monitoring all having a conversation with each other, and the goal here is to enable transparency, collaboration, and integration.

We leverage Slack as a platform to better enable automation and reduce toil. We're connecting development and operators and putting key context in place to unlock new workflows. So in the very early days, we used Internet Relay Chat, and one of the important innovations that we made early on was that we saved this information so that it could be searched later and provide a lot of context for teams at a later date.

So here's an example of one of our most popular and joyful integrations. This is Deploy Wizard. Through Deploy Wizard, we can look into our continuous integration, continuous delivery pipeline and tell developers when their PRs are being deployed and by whom. We do have automation and alerting around these things, but it is really nice for developers to know when these things are actually happening. So it's a really simple interface that empowers important behavior by integrating several systems and workflows all in a single place.

Jira is our source of truth for work in progress. So Jira integrations allow us to see the context of a ticket in Slack without having to leave the app. So there's no time needed to switch from one application to another. I get regular updates from Jira bot of tickets that I'm interested in, or maybe new tickets that have been assigned to me, new comments, status updates. I can also create a new ticket directly within Slack using the Jira integration.

Escalation Bot solves one of the tried-and-true troubles that we have with operations: who owns this feature? As an incident commander myself, I know how hard it is to figure out who owns what, never mind grapple with team names. So Escalation Bot helps us save time in putting the right people in front of the right problem. Previously, escalations went via a human operator. Now escalations go straight through to the right team and their PagerDuty rotation if necessary.

This is critical to save time and help enable teams to keep up with an ever-changing landscape. We are still available as major incident commanders if teams need help, but it does strike a balance between creating that independence for folks and autonomy, while also being there to provide support if they need us.

So the app is just a slash command away anywhere in Slack. There's a lot of functionality here, but what I want to highlight is the centralized feature of who owns what. So this is rooted in some fuzzy logic that helps people enter a description of a feature or a problem that they're seeing, and we see that we have some non-obvious results that the system throws up.

In this case, I'm going to stick with the Anatomy team. Inline, we provide access to their Jira project, as well as a collection of special-purpose channels that we have related to this team and this service. So let's look at direct actions that we can take from here.

We're going to escalate. We can either page a team or start a major incident. So in this case, we're going to page an engineer. We can decide the severity. Again, we see the fuzzy logic here for the team that we're looking for. We've got some choices. I would page our Anatomy team, but for today, let's leave them to carry on with their job.

A further innovation that we have to that is our Incident Bot. The purpose of Incident Bot is to speed up the creation of incidents, establish clear communication and ownership really, really quickly. We automate the creation of Jira incident tickets and notifications to relevant channels, and we can also provide a technical summary based on the information that's been entered into the channel. So let's have a look here.

Again, it's just a slash command away. You can just type incident PDE, and you can see a list of all of the different types of activities. Here we've run the command to view all current open incidents. In this case, we're going to open a new incident. So we write a brief summary, choose the severity level, and set the incident commander.

Automatically, we can see a Jira incident ticket is being created. The Slack channel name is being created, and it's posting the incident notification channels. Depending on the severity, that could be to the team, it could be to our exec team.

So the incident channel has been created. What we're going to do now is update the incident commander. Again, it's just a slash command away. We just type IC, and from the dropdown, we select the new incident commander. We can also update the severity of the incident at any time by just typing, again, sev, and we choose from a dropdown and change it to severity two. And lastly, when it's under control or resolved, we can change the state. So we just type state under control. It's automatically updated the heading with under control.

One of the other things I love about this tool is that I can get a summary of what has been happening in the channel for my technical summary. So PDE incident summarize. It goes to Slackbot. Slackbot drafts a message for me. I can edit it, fill in some more of the known detail about how many responders, how many users were affected, a description of the impact, the times, and we use Pacific time for all of our incidents. So I like that it automatically creates that because that causes me a bit of a headache trying to translate it from British Summer Time.

So once that's done, I get an automatic technical summary in the format that we've agreed, and it's automatically posted to the incident channel. So this also reduces the cognitive load on the incident commanders at any given time. The incident commander should be focusing on who's doing what and ensuring that key streams are being addressed during an incident. Actually trying to figure out how to format my technical summary or my conditions, actions, and needs report should be the last thing that worries me. So Incident Bot really, really helps us out there.

Integrations mean we don't have to leave Slack. They make life simpler, they drive particular behaviors, and they help us streamline process. And it does go beyond our dev teams, too. Any team can use these integrations to update their objectives or key results for a quarter, approve an expense claim, or even apply for leave. Some of the most common ones, I've already mentioned Jira. We also have Code Review Minion, who makes sure that our PRs don't hang around too long without any attention. Slackbot reminds us about our standup on a daily basis, and we can also use it to remind us to start a thread about awesome things that we've worked on this week.

We can also centralize feature requests, creating a clear space for dedicated content to the development teams, connecting them directly with our customers and social media input so that product teams can be reacting and hearing the voice of the customer on a daily basis. It's quite a powerful tool.

So when we think about DevOps by function, we think about teamwork, we think about observability, and we think about how we manage our pipelines. And these are just some of the examples of the integrations that we have. Today, Slack integrates with thousands of these applications. So even if you don't see the tools that you use here, there's a very, very strong likelihood that the tools you use will already have an existing integration with Slack.

Slack builds Slack with Slack, and that is something we're committed to doing. We're committed to evolving how we work this way, and we want to share that experience with the wider community. We believe it creates opportunities every day for us to improve our DevOps workflows. We also think it creates opportunities every day to help us improve Slack and how it works.

We believe we're making that more joyful experience for our developers. So we know that the methodology of transparency, collaboration, and integration is applicable beyond DevOps, and we hope to continue what we learn right across the DevOps and engineering community, but with the broader community that use Slack also.

Thanks so much for your time. Join us over in Slack Does DevOps in our Slack channel for Q&A now. I really appreciated you taking the time to spend with me, and we'll see you there. Thanks.