Building Confidence in Your SRE Team

Log in to watch

US 2021

Building Confidence in Your SRE Team

Senior Director, Software Development & Engineering · Comcast

Does it seem like your SRE team is starting to look way too much like the familiar Operations team that you have always known? It’s easy to fall back on the well known patterns of production support. While it is critical to demonstrate strong Operations expertise, it is equally critical that your SRE team adopt a new mindset.

You may think that your first steps in building a healthy SRE team is to adopt all the acronyms: SLI, SLO, SLAs. It's enticing to immediately implement Error Budgets with consequences. But you first have to build a culture of trust and measurable performance.

We strive to drive our operational burden to zero. We look to automate last year’s work to make room for new challenges. We make the time to eliminate TOIL in our daily tasks.

In this talk, we will take a close look at how process and automation can be the driving force behind a truly empowered SRE team.

Chapters

Full transcript

The complete talk, organized by section.

Host Intro (Gene Kim)

We have a phenomenal set of sessions for you this morning.

Up first is Michael Winslow, who has spoken here at DevOps Enterprise six times. He is Senior Director of Engineering at Comcast, currently ranked 26 on the Fortune 500, with thousands of software developers.

Over the years, he has had a variety of roles, including leading Xfinity Mobile and Xfinity Stream. Today, he will be sharing the story of how, after a decade of leading dev teams, he was asked to lead the SRE initiative for a critical service, Xfinity Backoffice. When I say critical, it is because it enabled revenue-generating services such as plan activations, whether direct to consumer or through storefronts.

He will be talking about the SRE principles he tried to apply, the lessons he learned, and the value that he found within the SRE body of knowledge, as well as the capabilities and value that he and the teams created.

I am so delighted for my friend Michael because, after we recorded the session, it was announced that he was being promoted to become a distinguished engineer, which represents less than 1% of the engineering community. So here is Michael.

Michael Winslow

Thank you so much for that introduction, Gene. I cannot believe this is already my sixth talk. Hearing you say that takes me back to 2018, to that first lightning talk that I did in front of the crowd. I am really excited to talk to the audience one more time.

Like Gene said, I am Michael Winslow, currently Senior Director at Comcast. A fun fact about myself is that I still code for management tasks. I find a lot of joy in it, and I would say that anybody who moves over to management from technology should consider doing the same. Today I am going to talk to you about building confidence in your SRE team.

Before I do, I want to stay in touch with everybody. If you go out and Google me right now, you will not find me. You will actually find Michael Winslow, who makes all the sound effects on "Police Academy" and was recently on "America's Got Talent" as well. He is the most famous Michael Winslow. If you want to stay in touch with me, you are going to want to reach out to me directly. A funny story with that is that a friend said, "If you have the same name as somebody else, just use your middle name and start using that as your professional name." The problem is that my middle name is Scott, so anybody who is a fan of "The Office" knows there is a more famous Michael Scott out there as well. So, to be safe: Michael S. Winslow at either Twitter or LinkedIn. Let us stay in touch.

Let us start off. This is where I work. This is where I went into the office every day before the pandemic started. I honestly cannot wait for the numbers to get back down and go back there. This is the Comcast Technology Center, and I want to call out specifically the word technology. Comcast is not always top of mind when people think about technology companies, but we have in fact made the change over the years from a cable provider to a technology company that provides cable and other things.

To illustrate that, let me go over an overview of what Comcast looks like as a whole. I work in the area that is Xfinity Mobile, X1, XFI, XHome; in product and services we have FreeWheel, Spotlight, Comcast Business; and in Comcast as a whole we have Spectacor, Spectra, a lot of other operating companies, including Comcast Ventures. In 2018, when we really started to build our family up and bring on Sky and NBCUniversal, there were so many operating companies that we were working with, and DevOps became absolutely critical. What we wanted to do as we brought on new operating companies as much as possible was remove these artificial lines that were between the companies and share information across. DevOps has been crucial in all of that.

Right around 2016, I joined Xfinity Mobile. At the time it was a super-secret project, because Xfinity Mobile did not actually launch until April of 2017. We were acting as a startup. Specifically, my part and the team I worked on was the Xfinity Mobile Backoffice. We supplied all of the APIs and orchestration for the direct-to-consumer website, the future stores that would be popping up, and the Xfinity Mobile stores. We did a lot of the logic and guts behind being able to sell and operate the phones.

We adopted this DevOps model, which worked well because, unlike Comcast as a whole, which is a large enterprise, Xfinity Mobile at the time acted as a startup. It was easy to pick up these practices and get teams to buy into them.

After the launch of Xfinity Mobile and after I was on the project for a while, I moved to a different part of Comcast called Software Strategy and Transformation. We were the engineers who helped the other engineers at Comcast. When we moved over to that group, I remember specifically wanting to bring over all of my knowledge of DevOps, and they said something interesting to me. They said, "On this team, we do not do DevOps. Instead, we do SRE."

I had heard about SRE, but I had not dug deep into exactly what SRE was and how it differed from DevOps. Since I was starting on a new team, I wanted to get an idea of how they were doing this, so I asked them, "What is your definition of SRE?" At the time, the team said, "Our software engineers develop the software, and our site reliability engineers operate the software." I thought probably the same thing that a lot of you are thinking right now: how is that not just operations, where developers toss code over to operations and have them operate it in production?

I was a little skeptical, but I still wanted to work with the team to really define what SRE was. I joined a book club with the team where we were going to read "Site Reliability Engineering" and see what practices we could take out of it. I started with the mindset that I wanted to bring a lot of my knowledge of DevOps into this and find a starting point that let me relate to exactly what site reliability engineering is.

As I looked at the hierarchy, I wanted to find a place that really spoke to me. There it was: testing and release procedures. That made me think of the CI/CD that I had done previously, and I thought this would be my starting point to build out from.

My OCD kicked in, and one of the first things I noticed was that the P in release procedures was not capitalized. It was not title case. I did not think much of it at first, until I went to the part where they were supposed to be explaining what this part of the hierarchy was. They had testing in there, but absolutely nothing for release procedures. You can go online to the "Site Reliability Engineering" book right now; this is what it looks like. It really started to feel to me like release procedures in "Site Reliability Engineering" was almost an afterthought. You can see testing is there, but not release procedures.

What I wanted to make sure I did with this team, to truly bring my expertise in since I was leading this team, was modify the definition of SRE a little bit upfront. I wanted to change it so that we said SRE and DevOps together means that our definition is: site reliability engineers will use DevOps principles to operate and improve software. That was the approach that we were taking at the time.

Let us get into some of the issues that we had in the beginning. When you say you are going to start doing site reliability engineering, anyone who has ever gone to a talk or looked into it knows there are acronyms: SLOs, SLIs, SLAs, objectives, agreements, and error budgets. The problem we were having in this particular organization was that we had very mature software engineering groups. They were not just going to buy into this idea of a group called site reliability engineers who had not proven themselves. You cannot just slap a title on somebody and then everybody automatically gives them the gravitas to have power over how you develop your software, especially when coming from an operations standpoint, like we were.

What we needed to do was find ways to build trust and confidence in the team. The best way I thought about it was to dive back into my expertise: DevOps. Right there at the top, for anybody who knows the CAMS acronym, you have culture, automation, measurement, and sharing. I said, let us really get good at automating what we are doing right now in the Software Strategy and Transformation group. Once we have built up that confidence in our teams, let us slowly bring in these other ideas of SLOs, error budgets, SLIs, and SLAs.

We had an SVP at the time of reliability engineering, Dana Wilson, and she was quoted as saying, "We must automate away the hundreds of routine tasks which create the fog that impedes our vision." I was able to hang this up and give the team a little bit of a North Star while we were going through these initial automation improvements.

One of the problems that we had was that the folks who had set up any automation previously worked a lot like Brent. For anybody who knows "The Phoenix Project," Brent was that very powerful, very capable engineer that everybody went to, and he became overloaded. We had our version of a Brent on the team. We had all of these great engineer twos and engineer threes who could be doing some of this work, but they did not have all the knowledge that was locked in Brent's head.

Our first mission was to take some of these procedures Brent had in his head and make sure the team was skilled in them. The first thing I did was sit down and say, "Brent, you need to document how you release, how you do your deploys." That was the first step. One quality you might find in Brents is that they do not always enjoy documentation. He actually delivered only five lines with very little detail for releasing our software.

Thanks to Confluence and being able to go back in history, I am able to show you what our Brent put out as his first offering of documentation on how to release our software. Like I said, five steps: verify the release notes; verify the release file; verify with the release manager that notifications have been sent for release; send communication about the start of the deployment; disable the VIPs.

Clearly, this was not enough instruction for anybody to pick up, but it was a good starting point. At least Brent put something there. Then we put our plan into operation. We had Brent sit down with a junior developer and said, "This junior developer is going to do the next deploy." You could see it on Brent's face: he knew his documentation was not thorough enough. But we said, "Do not worry about that. We are going to improve it over time."

The process we put in place was: first, the engineer executing the runbook of the procedure should ideally be one of the most junior on the team. Brent can help, but only when the engineer gets stuck on the documentation that Brent put together. There would be a third engineer in place, which helps increase the spread of knowledge and frees up Brent's hands to help the junior engineer. Every time the engineer has to interact with Brent, the third engineer documents the differences and the missing pieces of the documentation. We would repeat this process over and over again with every deploy until errors were few. Once we had that document in place, it became the pseudocode to automate it eventually. This was the process I had gone through several times before and was now bringing to my new team.

We can go back to that guide in Confluence and show how the instructions evolved over time using this process. Just on the first iteration, we found out that the junior developer did not have access to Jenkins and did not have access to the right place in GitHub. This was invaluable on its own: having somebody other than the person who had always been doing it do the procedure let us find all the prerequisites.

The notes slowly got larger and larger. Instead of just five lines, we had links, if statements right inside the step-by-step guide, and a lot more detail. Over time, it got to a point where it was so repeatable that we automated the process. We became so confident in the deployment of this particular software that we attached it to an AWS button and put it right next to the water cooler so anybody at any time could walk by, push this button, and confidently release our software to production. I have repeated this process on at least three or four different groups in Comcast now, and it is a good feeling when deployment is so easy, such a small step.

One thing that came out of this was great. Our VP of engineering at the time, Gustavo Paspuel, provided this quote: "What was most amazing to me was how automating our mobile back-office deployments positively affected our relationship with other teams. When the product team was faced with an urgent customer need, we could make the change and deploy it to several environments quickly for review. Delivering our software became a non-event." That is how you build confidence. That is how you build gravitas. If your team can put together something like this, they might just be open to other things.

The second thing, once we automated several of the deployments in Software Strategy and Transformation, was that we wanted to bring over the idea of reducing toil. We had a lot of very repeatable tasks that the team was taking on. Vivek Rao of Google says, "Toil is the kind of work tied to running a production service that tends to be manual, repetitive, and automatable."

To give you an idea of how we tackled too much toil, the red bar represented the amount of time we spent on toil: repeatable tasks, manual, mindless tasks. The green box represented engineering work. We wanted to spend more time engineering and less time on toil. My job as a leader was to provide enough time for the group to work on automating some of these things.

One of the first things we did to save up this time was automate deployments that we did so often. Once you have automated a task, you want to make sure it is available for many people to use and add it to your automation library. Then once that work becomes time you can spend on engineering work, all of a sudden you realize that you have enough time to start automating some of the other repeatable tasks and other toil.

The next thing we tackled at Software Strategy and Transformation was key rotation. It did not happen often at the time, maybe once a month, but it was enough of a disruptor when it did happen that it took us away from other things. We automated key rotation and added that to our library, which then freed us up for a bunch of other things we could automate: self-healing instead of manually bringing down and bringing up microservices; certificates for servers not yet using containers; patches; release notes; testing; and common support questions in Slack channels. We created a Slack bot that could answer a lot of basic questions.

We started seeing so much toil reduced and so much additional time for engineering that these things that needed to be done became clear to us. We were able to spend time creating more robust dashboards, working with other teams to learn what they were doing right with monitoring and bringing that into our group, and alerting. These are not always easy to automate, but you do need to spend time on them. We were able to say, generally, we spent about half our time on engineering work and half on new toil that came in.

One thing that happens when you get good at this is that the green time of engineering work becomes your team's premium time. Groups that see your team is not busy on toil all the time may try to dump more of their toil work on you. That is where the next step we implemented comes into play. We wanted to make sure we were selective with the work we took on.

Once you free up that time and have that time to work, you have a choice. You can either bring on work that actually helps make your team stronger, or you can take whatever is dumped onto you. I love this quote by Damon Edwards, which illustrates this: "If an SRE team cannot regulate its own workload, it becomes the aggrieved party." It becomes the group that, no matter how bad an offering a software engineering team throws over the wall and gives to you, you have to take it and support it.

In Software Strategy and Transformation, as the leader of this SRE group, I stopped that. I did not just say, "We will take anything that a software engineering group gives to us." Instead, we started creating a maturity model that allowed us to determine what work we wanted to take on. We looked at it like this: what if handing off to our SRE group was not a right, but something that needed to be earned? We started treating our SRE group as a premium service that people would want to offload their work to, but had to do things in order to do so.

A benefit is that it encourages standards. It encourages teams not to be as snowflake, and it allows us to get the benefits of economy of scale. If the team is using Prometheus, that is something we are really strong at, so that is a checkbox. We will take a team on as long as they are putting out good endpoints that we can scrape with Prometheus. If they are using containers, and a good containerization strategy that fits in with ours, it makes it much easier for us to support that. If they do not, we can send one of our SRE engineers to work on that team for a while and find out whether it is possible to get this team to align with what we are willing to take on. If so, we do that. If not, we say, "You will have to either find another team or continue to work in DevOps fashion on your own."

To recap: building confidence in your SRE team can happen when you build your team's automation skills. At least that is what we did. We were able to get teams to trust us more by giving them wins with the strength we already had in automation.

We tracked down and eliminated toil. If you become a really popular SRE team, you do not want to spend all your time on toil and overload your engineers on that work. Instead, you want them to be able to find some joy in engineering work to really solve problems that come to them.

Third, we were selective about our workload. It is one of the only ways you can make sure you have that time for engineering work.

That is where the SRE team I lead is now. Our next step could use a lot of conversation with you and possibly some advice. The help I am looking for at this point is: who is actually doing error budgets for real out there? Who has created a good relationship with your software engineer team in order to say, "There are error budgets that we have in place, and if you fall below those standards, there are consequences"?

We have tried it in the past, and quite frankly, priorities would always win out over what we called error budgets, what we called important. I would love to see how you got leadership to really buy in, and how you got individual contributors to be okay with this SRE team being able to set standards of quality to the point where they could say, "Stop creating new features and start working on the errors that you are creating as a group."

With that, I say thank you so much, and I hope you can build your own confidence in your SRE team. Thank you.