Getting Started with Site Reliability Engineering
Jian Ma, an SRE from Google, will talk about SRE key concepts and practices, and provide some insights into how to build an SRE team.
Jian Ma has been working for Google as a Site Reliability Engineer for
more than ten years. He was the first Google Ads SRE and witnessed how SRE concept started and progressed in Google. After that he worked as SRE for Android systems, before moving on to the current position, CRE, or customer reliability engineer. As an CRE, he worked directly with customers on Google Cloud systems. He is also one of the authors of the "The Site Reliability Workbook: Practical Ways to Implement SRE".
Chapters
Full transcript
The complete talk, organized by section.
Jian Ma
Good afternoon, everyone. My name is Jian Ma. I am an SRE coming from Google. This afternoon, I'm going to talk about some of the principles and practices used by Google SRE, and hopefully provide some insight about how to build an SRE team.
This is roughly what I'm going to talk about: introduction, service level objectives, error budget policy, making tomorrow better than today, shared responsibility model, and summary.
First, introduction. What is SRE? SRE is short for Site Reliability Engineering. This is a concept, actually a way of analyzing the ops work, originated at Google in 2003. You can find different kinds of definitions. One way we have defined it is that it's a framework for operating large-scale systems reliably. The second one we typically paraphrase from Ben Treynor, the VP of engineering inside Google, who many consider to be the founding father of this concept: SRE is what happens when you ask a software engineer to design an operations function. Let me emphasize these two key things here: software engineer and operations function.
Inside Google, SRE owns running the system in production at every operational level, or put another way, we own the production system. Google SRE published two books on this topic. The one on the left is the Site Reliability Engineering book we published in 2017, which was a pretty big hit. The one on the right, we just published this year in July. I think this slide is a little bit out of date. This is the companion book of the first one. On top of the concept, it also provides examples of how to get things done and how to organize.
Who am I? My name is Jian Ma, as I said. I spent 14 years working as a Google SRE. I started as a Google Ads SRE. I continued on to Android, and then I'm in the team called CRE. Actually, I am the first Google Ads SRE, back many, many years ago. At that time, Google was much smaller than now. As a benefit of that, I witnessed the SRE concept, how it started and progressed in Google, the ups and downs, the corrections we made, this kind of thing.
CRE, the team I'm working on now, is the reason why I'm here to talk about this. It is short for Customer Reliability Engineering. This is Google's way of basically trying to push the SRE concept into the industry to help Google Cloud customers operate large-scale systems in a reliable way. Also, I'm one of the authors of the Site Reliability Workbook, the one we published this year.
Today, I want to talk about three principles of Google operating our site reliability. Number one, SRE needs service level objectives with consequence. Number two, SREs have time to make tomorrow better than today. Number three, SRE teams have the ability to regulate our workload. What does it mean?
First, let me start with the service level objective, which is not a new term for many, but it's a key concept for us. Let me describe why it's so important for Google SRE. Service level objective, or SLO for short, is a goal to measure how the system behaves. On top of that, it's specifically trying to measure the customer experience, or in plain English, it basically means that if customers are happy, roughly speaking, the SLO goal has been met.
Typically, you can define it in many different ways. I can give a quick example here. For example, the first two are talking about availability. The uptime, 99.9% a month, or three nines. If you do the math, it basically means that in a month you could have 43 minutes of downtime. This is the math. It's quite simple. Or the second one is 200 OK ratio. You can say four nines in a month. It'll give you basically three and a half minutes of non-200 ratio. The third one is latency, 50th percentile under 300. The fourth one is another example about log processing: 99% of the log requests processed within under five minutes. This is typically where you see the pipeline style of thing. You have the transaction, you have the backlogging, you are processing, this kind of thing.
Okay, so we just gave the talk about the example. What's the difference between SLO and SLA? SLA stands for service level agreement. Here is the difference. SLA typically, from our experience, is defined as part of the contract between two different companies. There are financial consequences, penalties, and other things if they are not met. Maybe because of that, one of the reasons what quite often happens is that the customer's experience cannot be sufficiently expressed by SLA. That's why the SLO kicks in. SLO concentrates on user experience.
What next? Now we have a number, we have a monitoring line. Then what's the consequence? Because without consequence, as we just mentioned, there is no financial penalty, so what's the consequence? Here is a second concept I want to talk about today. Inside Google SRE, we use this word: error budget policy.
This is what we found out. If you ask anyone, typically asking how reliable do you want your system to be, the answer quite often is: the more, the better. But everybody knows, especially the crowd here knows, that 100% reliable is expensive. You have to make a sacrifice on your development velocity and engineering time. So what do we do? We introduce the budget. Error budget.
Error budget is basically the gap between perfect reliability, 100%, and what we defined early on: three nines, four nines, let's say. The budget is to be spent. In plain English, what it means is this. Okay, we just had an incident. We had 20 minutes of downtime. Let's assume we have a system targeted at three nines reliability. We have 20 minutes downtime. We can talk without guilty feeling that this month we still have 23 minutes in our error budget, because we have 43 minutes for the three nines reliability for the month to spend.
So what's the policy? Here, I gave examples on how we define it. Different teams on different services inside Google define this differently. These are just examples, to give you a taste. Our goal is that we want to have a visible improvement on reliability. That's the goal. Example: as an SRE, we can stand and tell all the counterparties inside the company, feature developers, infrastructure, and many others, we say, "No new features launch allowed." Essentially, that means that we have the power to say feature freeze. We also can say that a team, either a feature development team or SRE team, during this feature-freeze time period, your action items only come from the postmortem action items. Freeze. And we could also say that we want to have a daily meeting with them, with us, and discuss what improvement we can make.
Let me summarize principle number one. We, SRE, demand and define the SLO with the consequence to be the first thing. It also means that any organization, even without hiring a single SRE, can have the same error budget policy. This is just an idea. You can implement this today by starting with measure, account, and act.
Okay. Now let's dive a little bit deeper on why we insist on this SLO. We want to make tomorrow better than today. SLO and SLA and the error budget are only the first step. The next step is staffing the SRE role. The SRE role should have a real responsibility. It's not just advisory or anything. Real responsibility. That's what we found when we build up new SRE teams inside Google. We define and refine the service level objective as the number one task. This person, or these several people, is at the position to evaluate and sound the alarm that the SLO is not met, that customers are experiencing pain, and we want action to be taken. Some action, I just gave the example: freeze this, and other things.
Here is what we consider toil. I think this might be a little bit surprising for many. Toil is a negative word in our world. It covers things like you are going on-call, you are doing your firefighting, you are doing the incident management, this exciting part of the thing, to the not-so-exciting part of the thing, capacity planning. For example, as a part of a release, you are checking here and there, dashboards, and looking for the success or failure of the canary before it goes everywhere. All of this, inside the Google SRE circle, is considered to be toil. And toil is negative.
As such, it's a general practice inside Google SRE that we want this part of the work to be no more than 50% of our time. This actually, among many things we want to do and we succeeded to do or failed to do, is the one we have quite successfully done. We review it. If the team finds out that we have more than 50% of time working on all of the toil I just described, it's not good.
So what is not toil? What is not toil is project work. What do we consider project work? The list is quite long. Let me give you some examples. First thing we do a lot is consulting with system architecture and design. Basically, it means that there is a team, a PM, a developer team, wanting to start a new service. They have a design doc, they have review. SRE gets quite actively involved even at this stage. We go there, we tell them that from our experience, in order to operate or design a high-reliability system, what route is better and what is not. Essentially, everybody knows that during design phase, everything is a trade-off. So we provide the SRE perspective of the trade-off. This is one part of that.
We also do authoring and iterating our monitoring. Actually, we have a lot of coding in this area. My personal engineering efforts are in this part of the area. One of the projects I finished last year is that four SREs wrote a system to process time-series anomaly detection for 1.1 billion time series in real time. That's the kind of thing we do. We also do automation and automate all of the repetitive work.
The last one is also one thing quite important. What we find out is that writing a postmortem, releasing a whole bunch of action items, is not difficult. But that's only the first step, because this long list of action items could belong to either feature developer team or SRE team, or quite often, no clear responsibility. It just doesn't work. SRE quite typically takes the role to coordinate the implementation of this thing, essentially becoming some kind of PM, but with a passion. Because we got paged, we got firefighting, we got excited, now we want to see the thing fixed. This is the full circle.
Let me summarize what we talked about for principle number two. SREs have time to make tomorrow better than today by making this very clear: we are not there to take the operation load. We are there, taking the pride, to make tomorrow better than today.
This is the third topic I want to talk about today: shared responsibility model. Here is the way we do it. This is one thing I think, so far from what I heard of many other companies doing things, is quite unique. Google is a very big company, has a lot of services. If you count the number of projects, or the developers working on certain projects or services, the majority of them have no SRE support. In another way, SRE only supports a minority of the services Google provides. This is counting the number of projects, which in a way represents how many developers are there. However, of course, if you look at another way, the QPS, the users, the revenue, the majority of them are supported by SREs.
What does it mean? It means that we do not, by default, take on a production system. By default, we do not. They have to work together with SRE to pass a certain bar in order to get SRE support. The bars include, of course, the obvious one: you have enough users, your service means something. But also it means that your system has to be reliable enough. You follow the SRE practices. You listen to us. We all work together to get it reliable.
Also, almost as important, the management team, the executive team of that service, buy in. This word buy-in is not new. I've heard of many teams inside different companies doing this executive buy-in. In our world, the buy-in is real. If you don't do anything, by default, the feature development team, your executive, your whole team, is in charge of a service. We can help, we can do the architecture design and consulting, all of these things, but we don't take it.
In order for us to take it, in Google's way, SRE is a totally different management architecture. However, our headcount is funded by the service owners, which means that there are real financial decisions for the top executives to make in order to get SRE support for the service. This way, they have to think about it. After that, as we said, they have to pass certain bars, reliability and other bars, before we take it over. But it also means that we can control our workload. If we are overloaded, there's no way for us to write code to improve, to do the project work, to make tomorrow better than today. This way, we can control it. We can regulate our workload.
There are some examples of different teams doing things different ways. For example, in this case, we can give 5% of the operation work back to the developer teams. This is talking about a mature system. I'm not talking about the transition period. This is the mature, like Ads, like this kind of thing. We gave them a little bit of a taste of what is the on-call shift, load management, and ops tasks. What we found out is that this actually is quite useful to get them to understand our principles, our operation model, and get them motivated to react to what we ask, rather than separate out.
SREs' project work is also, just like as I described earlier, we are software engineers, most of us. Our project has a design doc, has everything. It's just real project work. Actually, inside Google, it's basically typical practice that there is no boundary to what SRE really is. Let's say you are very passionate at Java garbage collection tuning, which many consider to be an art. Nowadays, typically, inside Google, if you are looking for the best Java garbage collection tuning experts, the SREs are. Not everyone wants to do this kind of thing, but just to give you a flavor on what SREs do. There is no boundary. We are real software engineers. We do a whole bunch of things.
We only onboard them if they can be operated safely. Let me explain the last part a little bit. If every problem with the system has to be escalated to its developer, give the pager to the developer instead. What it means here is this. Quite often, what we find is that during the transition, early stage of the service going from the developer into SRE, the service is so immature that not only we don't quite understand what's going on, even the developer team, different parts of the subteam, don't know what the other subteam is doing. So we have to go back to ask whoever wrote this specific feature, what's wrong? If we see this thing happen several times, Google SRE's general practice is that we'll just send the alert back to whoever is writing it. We will say that this is not the best way to utilize our time or your time. You have to make it more reliable and make it more uniform. We will help you, but this is your responsibility. Before you do that, I'm sorry, but this is your pager.
Leadership buy-in, as I mentioned earlier. For the last part, I provide one example of the whole thing, how to tie it together. When we run out of the error budget, we tell the leadership of the feature service team that you have to put your developers on this reliability work, system reliability work, or, because everything is the budget, everything is the math, you can loosen the SLO. Let's go from four nines to three and a half nines, or three nines, or two and a half nines. Typically, at this stage, the service owner team, the business owner of this service, will be quite nervous because this is the real number for them dropping down, going down. They say, "We were a four nines system, now we are three nines." But we made it clear this is one option. This way, we can make them understand that it's much better to consider reliability early on in the whole life cycle of the service, rather than you finish design and throw it over, and there is the operation team to do that, which in this case is SRE.
Automation is also what we do a lot. We eliminate toil, capacity planning, and fix issues automatically. The last item, to fix issues automatically: the internal saying is this. If you can write the fix in a playbook, in a process, in documentation, you can make the computer do it. Essentially, it means writing code to fix the system automatically.
Let me be honest, even in the programmer circle, from time to time, we push back and say, "It's so complex if I write code to do this. Much easier I write documentation, next time you read it." I'm talking about with each other, I'm not saying outsiders. But still, because the background of the majority of Google SREs is software programmers, software engineers, programmers, we quite often hear people just say, "Okay, let me show you how to do it." That's the end of the discussion.
Let me summarize the third principle. SRE teams have the ability to regulate our workload so that we can spend time to work on projects to make tomorrow better. In order to have time, all of the time, you have to realize our goal is to make the customer happy. How happy the customer is is a number, is math, and the math is calculated by the SLO.
That's a summary for the third one. That's a summary for all I'm talking about for this 20 minutes. SRE needs an SLO with consequence. SREs have time to make tomorrow better than today. SRE teams have the ability to regulate our workload by different ways, including pushback. Thank you.