How Google SREs Modify Production Resources Securely & Safely

Log in to watch

Las Vegas 2022

Download slides

How Google SREs Modify Production Resources Securely & Safely

Most outages are caused by changes to production resources. Automated production changes are typically fast and secure, but can't address every use case -- especially during an incident. An SRE with production access can fill this gap, but that access introduces reliability and security risk if they make a mistake or their account is compromised. To balance this risk, Google developed a framework that automates the majority of production operations, while providing routes for manual changes when necessary.

Chapters

Full transcript

The complete talk, organized by section.

Brett Beekley

Morning. My name is Brett. I'm joined by my colleague Michael, and we're here to talk about how Google SREs like us manage production resources securely and safely.

We'll start with a quick introduction of ourselves, our team, and the stakes of the problems that we solve, and then we'll get into the talk.

As I said, Michael and I are site reliability engineers at Google. I think SRE is a really interesting field that attracts people from a variety of different paths. I, for one, didn't really know what SRE was before I started talking to Google, so I wanted to show how we got here.

I started with graduate school, where in particular I learned how to create and test hypotheses, which is important when things are going wrong and you need to figure out why. Then I went into technical consulting, where I learned how to build a holistic view of a problem by gathering requirements, talking to users, and figuring out what's going on. Then I went to a startup where we scaled up our backend 10-100x while I was there, and that all prepared me to be an SRE at Google in privacy and security.

Michael Bird

My first professional experience involved very little interaction with DevOps. I was just on the software engineering side of things as a compiler engineer. From there, I went on to co-found a small consulting firm where I wore all the hats, everything from sysops to principal software engineer.

I was more introduced to the formal processes of DevOps in a payments processing firm, where we led an effort to move away from a large monolith to service-oriented architectures, kind of in the Strangler pattern that we saw earlier. We followed a similar pattern.

Then I ended up at Google, in software engineering in Ads and YouTube, where I learned a lot about feature flags, launch, dark launch, good stuff, and eventually ended up in SRE. I'm very happy to be here.

Brett Beekley

Thank you. I want to add, as a caveat before we really get into things, that this presentation is a reflection of our experiences on our teams. Google is a really big place, so this doesn't necessarily reflect every single process that every single SRE or engineer follows.

All right. So what are we? Privacy and Security SRE is a group of Google SREs and other staff that run corporate and cloud security services. There's a large overlap between security and reliability, since both are about keeping a system usable for expected users and protected against malicious users. Since our services involve security, poor reliability in our services results in unavailability of the services that we protect.

We'll frame this within the scope of the core SRE mission. There are many talks and books describing the aspects and practices of site reliability engineering, but we'll focus on the use of software engineering principles that enable more-than-linear growth with less-than-linear headcount growth.

In short, we look for improvements to operations like we look at software problems. We contextualize, gather requirements, design solutions, prioritize and implement, and review launch success metrics. These solutions are often, but not always, delivered via code and automation.

Michael Bird

If SRE is about writing software to solve operations problems, why don't we just write software and automate every single operations problem, and therefore eliminate manual operations altogether?

The first thing is, we only have a finite amount of time. We can't spend infinite engineering effort to automate every single process in the world. Another is that even if we had, well, maybe not infinite time but a lot of time, we can't always predict what will happen, especially during an incident, which is an easy example of a time when we might not know what's going to happen in production and therefore what would need to be changed in production.

But as we'll talk about more in our talk, there are also plenty of times even during day-to-day ops or day-to-day project work where we didn't predict what we had to change in production. So eventually a human will have to make some change to and interact with production.

But opening human access to production introduces risk if their account is compromised or they just make a mistake. So that comes to the core question of this talk, which is: how can we give SREs access to production while minimizing the risk of manual changes?

Brett Beekley

Our general solution has been to establish a prioritized list of methods for changing production. All changes should use the highest possible method in this list, which would provide very little or no access to production to a human SRE. If there is a good reason that we can't do that, then we use the next item on that list, which grants just a little bit more access to a human to make a change to production. If that's not possible, then the next one, all the way down to the bottom, which has the most human access to production.

We'll talk through each of these in detail and how they apply to both responding to incidents and day-to-day project work.

Michael Bird

At the top of the priority list is automation. The vast majority of production changes should be done automatically, without a human having to trigger or even monitor it. Importantly, automation requires validation that it's doing its job correctly.

An example of automation in everyday operations is CI/CD release pipelines to improve software delivery performance. The validation of this pipeline should include closed-box testing and open-box monitoring. An example of automating incident response is load balancing and DDoS mitigation, which implicitly requires automated load monitoring.

Let's talk about a case study. In our team, we have to work through the manual replacement of hardware as it ages out or needs to be repaired. When you're expanding our production footprint, new machines are given new IP addresses within an existing pool. Getting production traffic to and from them requires updating several services in our software-defined network stack.

While hardware acquisition is necessarily a manual process, manually updating the network stack would be far too error-prone and toilsome. So our solution is to build an ingestion pipeline that discloses the new production machines and updates the software network stack. While this delays the machines' ability to host jobs on the order of several days, the tradeoff is well worth the automation and consistency.

Brett Beekley

That next level down: as we mentioned back at the start, we can't automate everything. For these cases, we can try the next best thing in that priority list, which is manually performing a process through a mostly automated tool that keeps us on the rails. This is toilsome.

In the case where we haven't built automation for the production changes yet, we create a feature request to implement it. If automation should have caught this but did not, this is a bug. Since we have finite time, we triage these requests, weighing the security and operational benefits against the implementation costs. Sometimes a good general solution to these problems requires taking a step back and seeing patterns in the ops load.

For example, automation got something wrong, so there's a user-reported bug. There should be a real easy way to click that button and get an automated rollback started.

Michael Bird

For our case study, this highlights how we might decide not to automate something and instead use tool-guided manual processes.

We regularly autonomously test our critical user journeys, the set of features that a user might take to perform critical actions in our services. These tests signal that a code change doesn't break anything and are used to gate our release pipeline. However, one of those user journeys requires pressing a physical button, sort of like a YubiKey, so it can't be automated like other journeys.

We decided to build an on-rails testing experience to make it easier for a tester to trigger the test, press the button, and then approve the release. A fully automated solution might involve building a virtual interface that can be programmatically touched. In the meantime, the only toil is to run a couple tests. It takes a couple minutes. You verify and you push the button that allows the release to proceed.

As with most things, there's a back and forth between the security implications of automating something like this that grants access versus requiring that a human being goes through the process.

Brett Beekley

If the manually controlled tools aren't capable of making the change, then we have to fall to the next item in the priority list, which is where a human can make a change themselves, not going through a tool, but that requires some peer review steps as a security gate.

At Google, we developed a system to facilitate this process. First, an SRE makes a request to a peer, usually someone on their team, that contains the change they want to make or the command that they've planned to execute, as well as a justification for why they need to do this. Then the reviewer, their peer, will look at it, make sure that it makes sense to execute this thing, and give it the thumbs up or the thumbs down.

If it's a thumbs down, the SRE does not get access to be able to make that change. If they get the thumbs up, then they have one-time use in order to execute that specific command or make that specific change in production.

This happens very rarely and should only really happen if the tools or the automation that Michael described are unavailable or don't yet support the change that we're trying to make. Because that's rare, this is something that's generally only used for incident response.

For example, if I was an SRE on an incident and there's a problematic region that I needed to route traffic away from that automation didn't catch, I need to do something in production. So I will send a request to run that traffic rerouting command, and then a peer would give it a thumbs up and I would be able to do it. It takes a minute.

But in converse, if I'm working on a feature for some long-term proactive project where I would have to push some config change in a way that we haven't done yet, and therefore would require a command or some production change that doesn't have a tool yet, then I should focus on building the tool to automate that before I actually make that production change.

This case study is actually one of those times when we decided the cost of building the tool is not yet worth the benefit of automation. Like many teams at Google, our team has a set of credentials that we rotate periodically. Like many teams at Google, we automate the rotation of those credentials so that a human doesn't have to do anything.

But our team sits at a unique intersection of various domains where these credentials might span different teams or different sets of infrastructure that we don't necessarily own. Because of this unique case, there's not an off-the-shelf solution for how to rotate these few sets of credentials.

So what we decided to do instead was to keep a manual, human-led process where a human would run specific commands to rotate these credentials, but we gated it behind a second-factor process, the security tool I just described. While automation would have saved us on the order of maybe a couple of hours per year, that's just not worth the engineering effort to build the automation when we have other more proactive work to do. That said, we still file the bug and have it in the backlog, just in case we get rid of all the other proactive work and we get to that one.

Q&A

Question: How do we audit this review process?

Brett Beekley: Great question. In case anyone didn't hear: how do we audit this review process? The best way to describe this is that we home-built a system for this, and as part of the process, one can look up every single approval and therefore see who gave the thumbs up and who gave the thumbs down.

What's nice is that each team sets within the team's domain what makes sense for those audits and what needs to be in the request. We have some specifics for what we're trying to accomplish and why we can't do it through automation at this moment. There's also a team that goes across product areas that will look into those through the auditing log and say, hey, we think you can improve here, and they'll work with us for that.

Michael Bird: That team actually comes up in the next slide.

Brett Beekley

For the fourth option: if all of the above fail and we can't do any of those other three options, then we still have an option to directly modify production through an emergency process, or break-glass process.

This is incredibly rare and should really only happen if all of the following occur. First, an incident requires an immediate production change. Second, tooling doesn't support this change or the tooling itself is unavailable. Third, the second-factor approval process that we just described in the third pattern is unavailable itself and can't be used. And the person trying to make the change should have access to this thing that they're trying to change at all.

While that's very rare, it's still a non-zero chance of happening, especially if there's a correlated outage that might take down multiple of those services. So we keep this break-glass process as a defense in depth and to make sure that we don't lock the keys in the car.

Because this is rare, each use of emergency access is logged, monitored, and audited. Every time we use it, someone comes in from this team that Michael just described and knocks on our door and says, hey, I noticed that you used the emergency process. Do you have a good reason for doing so?

The closest thing to a case study we have is that we practice this process very regularly, because whenever there's a rarely used process it's easy to get bad at it. When it comes time to need it, we want to be well practiced in it. So we'll use it periodically just to make sure that it is a well-lit path in case we ever need to take it to make a production change.

All of this stuff that we've described has been a long time in the works. Our team recently started enforcing all of this as a policy at the beginning of the year. As a result, we dropped our manual production access by half compared to last year. By manual production access, I mean that third step, where someone had to send a request to a peer to run a specific command in production.

We have zero people on our team with ambient unilateral access to production. This is very similar to what was mentioned about Fidelity, where there are zero people with persistent access to production. We have the same policy here. Instead, people have to go through one of those processes in order to make a change to production, at the very least break glass and go through that audited emergency process.

This is a process of continuous improvement. What we're currently working on, and we would love help to work on, is to do these two things.

First, we want to improve our tools. The easiest way for a production change to move up in that prioritized list toward automation is to improve the tools, so it's easier to automate something, meaning we have more features that cover more options for manual changes.

Or we want to improve the reliability of our tools, because the easiest way to fall down that priority list is for a tool or automation to become unavailable, or have a bug, or some reason that a human has to bypass the tool and make a manual change themselves.

If you have any stories, ideas, or thoughts on this, come find us at the conference or message us on Slack. We'd love to chat about it.

In summary, we modify production broadly in four ways. First, most changes use automation. Then very few are human-led through some tool-guided, on-rails manual process. Then even fewer are individual, peer-reviewed changes by a human SRE. Finally, very, very, very rarely, and only in emergencies, we can bypass all the above, break glass, and directly access production.

I think a key takeaway that we had is that site reliability engineering is fundamentally a software engineering practice. We use a software engineering approach to decide how to prioritize our work that iteratively approaches, even if it may never fully reach, zero human access to production.

Thank you.