Getting Started with SRE

Log in to watch

London 2018

Getting Started with SRE

Senior Site Reliability Engineer · Google

Stephen Thorne is a Senior Site Reliability Engineer at Google. He learned how to be a SRE on the team running Google's advertiser and publisher user interfaces.

Stephen went on to work on the challenge of Google App Engine as it grew rapidly and broke scaling barriers.

Before Google, Stephen fought against spam and viruses in his home country of Australia, where he also gained his Bachelor of Science in Computer Science. Now he works to integrate Google's Cloud customer's operations with Google SRE and in his role in Customer Reliability Engineering.

Chapters

Full transcript

The complete talk, organized by section.

Stephen Thorne

My name is Stephen Thorne, and this talk is Getting Started with Site Reliability Engineering, and I'm here from Google.

Typical. Only just works. So this is my agenda for today. I'm going to run through a quick introduction. We're going to talk about service level objectives, error budget policy, making tomorrow better than today. In essence, I'm going to be taking you through a very quick tour of the principles behind site reliability engineering.

But first of all, I really need to give you a little bit of an introduction, help you understand where I'm coming from, where this talk is coming from. So first of all, Google has found that site reliability engineering has been a very successful model for operating applications in production without being overwhelmed. Our VP, Ben Treynor, has quipped that SRE is what happens when you ask a software engineer to define an operations function.

So what I'm going to try and do today is take you through some of the philosophy of how SRE works. And we believe that it's a versatile model which allows you to operate mission-critical systems, no matter if they're small, medium, large, huge, or very fast-growing.

What this talk is not about is to try to convince you that SRE is the right thing for your organization. What I'm actually going to try and do is show you how it works, the philosophy, the principles. Whether or not you think it's the right thing for you, that's up to you to decide.

We have published, in 2016, one book on the subject of SRE, Site Reliability Engineering. This book covers a lot about how Google invented SRE, where it comes from, and how Google runs systems in production. The byline, "How Google Runs Production Systems," is right there on the book.

What we've learned since publishing that is that the way Google runs things isn't very relevant to a lot of people, but we want to talk to you about how you might implement SRE and how you would be able to implement some of the same things we've done. And so that's why we're publishing a new book, The Site Reliability Workbook, and that will be coming out in 28 days, the 24th of next month.

But who am I? My name is Stephen Thorne. I am a site reliability engineer at Google. I worked as an IC, both in the ad side and the cloud side of the business. I ran App Engine for a number of years. And my new team, Customer Reliability Engineering, is working to approach our cloud customers and helping them understand how their products and their applications are running on our platform, how our products impact them, and be able to help them move fast and do things better on our platform.

Finally, the thing that I want to talk to you about today is the principles of site reliability engineering. The principles of site reliability engineering, these are the ones that I'd like to talk about today, which is site reliability engineering needs service level objectives with consequences. Site reliability engineers have time to make tomorrow better than today, and site reliability engineering teams have the ability to regulate their workload.

You can see very visibly that this is a progression, that you can do site reliability engineering without actually having a site reliability engineer, and you can have site reliability engineers before you actually have a site reliability engineering team. And each of these builds upon the previous. You can't run an effective site reliability engineering org unless you're monitoring and reporting on your SLOs and actually worrying about the reliability of your system. It just doesn't make any sense. So we're going to go through that today.

First of all, what are service level objectives? I believe they're fundamental to the SRE practice, and so we're just going to briefly introduce them, talk about them, and how to apply them.

Service level objectives define a goal for reliability, and we believe that we, the people who are running this application, and this is in concert with our product managers, with our business, we believe that this is the objective for how reliable it needs to be to meet the needs of its users and the business. We measure how well the application is performing and aim to keep it above that level of reliability.

There's an ideal to defining an SLO that I apply: that if the customers are happy, then the SLOs are being met. This, of course, ends up with situations where you're not meeting your SLOs and your customers are perfectly happy. That means you probably actually want to revise your SLOs.

These are typical SLOs. These are the simple ones. I've got to emphasize, these are the simple ones. Your site is up enough. Your HTTP server responds with success often enough, fast enough. Your log processor processes enough log entries fast enough.

Defining a good SLO means picking goals for measures that track your customers' actual experience. Ideally, you want to set goals that your customers really care about. So just to give you a counterexample, you don't want to set an SLO on, say, the CPU utilization of your backends or your network throughput and things like this, because your customers don't see those as errors. Your customers see HTTP errors. It might be that you have all of these causative things and you want to monitor them, of course, but your SLO is how you think about your reliability of your system and your account for it.

SLOs can get very complex, especially when you start to involve your users' actual expectations. So these are just simple examples. And every time I talk about SLOs, the immediate conversation jumps to the one which is one letter different, which is the SLA.

In my view and definition, SLAs are your legal agreements between organizations and with penalties for not meeting them, typically monetary. My definition of an SLO, or at least a good SLO, is that when your application's not meeting it, your customers are starting to get unhappy. This is well before the point of an SLA should be, which is when they're so unhappy they deserve their money back. So you want to be well on this side of that.

And I believe SLOs are best used within your company, not between organizations. You can use them between organizations, but you sort of have to have a very deep understanding that you're not held accountable to them in terms of refunds and monetary response and things like that.

But much like an SLA, an SLO has to have consequences, otherwise it's just a metric and it can be met, not met, that's fine. They're not going to have any useful impact. So I strongly advocate using SLOs as your primary way that your on-call engineers get alerted for emergencies, but that's not what this talk is about.

And if you'd like to learn more about how to monitor and alert and configure your monitoring system in order to be able to respond to SLOs and defend your SLO, then there's actually a preview chapter of The Site Reliability Workbook up on Safari Books Online, or you can wait till it comes out next month. And there's plenty of things written on the internet about this.

The thing that I want to talk to you about today, which is actually much more fundamental to how site reliability engineering works, is error budget policy.

And so first, this is the question which you ask somebody in context. If you ask, "How reliable do you want to be?" the answer is always, "More, the most reliable, 100%." But we know that the only truly reliable system is one that does nothing at all and can never change.

So you need to have some budget for failure. You have to know what an acceptable level of failure is, and you have to balance that against the needs of your business. It needs to do releases, have development velocity, to reduce costs. And so what we do is, much like you have a budget for how much you're going to spend on your cloud services or your development, you have a budget for failure.

And so we have this error budget. Now, I define an error budget as the difference between 100% reliability, the most reliable system you could ever have, and your SLO. And this might be the difference between, in my example, 99.9%, if that's your SLO for uptime, putting that into much more real terms that we all know that 0.1% of a month is 43 minutes.

So if you have a 20-minute downtime, that's a terrible thing. The VP is on the phone. Everybody's angry that it's an outage. You get it fixed, you're back up. You still have 23 minutes of error budget left for the rest of the month, if you're accounting monthly.

So that's essentially what an error budget is. You can have error budgets on a percentage of requests succeeded, failed, too slow, whatever, and you end up having this gap between 100% reliability and your SLO, and that's budget to spend. The idea is that you're allowed to have errors. You account for them. You say, "Okay, we agreed that this is the level at which our customers are going to understand how bad the system is and get angry at us." But before that, they're probably just going to hit reload in their web browser because they believe it's their home network and not our website.

Legitimately, that's something you should think about. For instance, who wants to run a four-nines-reliable mobile backend when mobile operators are having trouble reaching three?

So what do you do when you spend your whole budget? It's a budget. You run out of it. You've spent it. What do you do? It means you didn't meet your reliability goals and something needs to change.

Now, this isn't an SLA and the response isn't to say sorry. It's not to give anybody money. You have to do something to make your system more reliable. And the short-term goal is to stop the problem from getting worse, and long term, to make sure it doesn't happen again.

So I've worked with a product that had excellent operations team, great incident management heroes all around. That wasn't enough. Something had to change because our users were simply not happy with the reliability.

So what we had to do was institute our error budget policy. Now, it's always better to agree these policies beforehand, but sometimes you don't get there. But these are some example policies. The one that suits your business is probably very specific to what you need. And as a policy, it has to be supported from the highest level of your organization. It's no good having a policy if nobody follows it.

In my team's case, we found the biggest cause of instability was pushing new versions of the code. We had a situation where it was a monolithic release. Dozens of development teams all went into the same release, and every time we released, there was at least one problem with one of those things, and we had to roll back, and it was a big issue. We were burning error budget all the time. So we reduced the number of releases until we could break down the release into smaller releases and then move forward.

So this is our first principle, that SRE needs service level objectives with consequences. And if you have good SLOs and you have an error budget policy and you follow that error budget policy, I would legitimately say that's SRE. That is what I perceive as SRE. And you can do that today in your organization without having a single site reliability engineer, but you are doing site reliability engineering.

Now, you don't need to implement all the principles at once. That's very clear. And I've got to point out that this goes the other way, and I'll tell you another story.

I worked with a team. So I was engaging with a cloud customer, actually, that was running an ad system. And they had a log processing pipeline. This log processing pipeline was the major cause of all of their concern. They had on-call engineers dealing with it whenever it had problems.

Mostly, it had throughput issues. The latency would spike. It would go from 10 minutes to 20 minutes, 30 minutes. They were getting woken up in the middle of the night to deal with this. So I went, did a workshop with these folks, and we sat down and worked out what an appropriate SLO was for this pipeline.

It was a pipeline that regenerated the recommendation model, so the next time somebody came to the website, they would have much better ads for them. But you don't come to the website again in 30 minutes. You typically come the next day or the day after. So we sort of thought, well, what's the closest time that somebody would experience pain? In this case, we're defining that as lower-quality ads.

Picked six hours. And then suddenly, because we weren't exceeding the error budget ever, because latency was spiking to a whole 30 minutes, oh, with an SLO of six hours, we have so much error budget, we can reduce the amount of resources we spend on the pipeline and let it go into high latency more often because it doesn't matter about throughput. We know that it's acting reliably. We'll respond when it ever goes over that threshold, but doesn't matter. And they haven't been paged about it since.

And that's going from spending three engineers, at least half their week was spent dealing with it before we had the workshop. So it does go both ways. You can be within, outside of error budget. That is a terrible yellow. I am very sorry.

So going on, our next principle: making tomorrow better than today.

SLOs and error budgets are fine, and the next step is defining what your first site reliability engineers will actually go and do. So I believe SREs must have real responsibility, meaning they must be both engaged with operating the system, but also empowered to do something about it when it goes wrong, and we're going to talk about how we achieve that.

What will your first SRE work on if you're starting with SRE in your organization? The first thing that they should work on is defining and refining service level objectives. Even if you have them, they can probably be improved, and that's my experience with most error budget excursions, is first you check whether or not the service level objective was actually legitimately being a problem and you might want to improve it.

They're the best-placed person to actually enact your error budget policy, and they need to be accountable and responsible to the application and the fact that it's meeting its reliability expectations.

A major part of what SRE does is toil. Because they're operating a system in production, everybody knows you've always got to do more things with systems running in production. You might have to get more servers. You might have to go into a new cloud zone. You've got your weekly releases. Not everybody has zero-touch CI/CD releases. I know everybody wants that, but legitimately, somebody has to set that up.

So one system I worked on was reporting its SLOs quarterly, and every quarter was showing it to be consistently meeting every single one of its SLOs. And that meant that the team was freed from the reactionary operational work, and in the spirit of if you do a good job, your reward is more work, then they kept on being given more work to do, more systems to maintain, more responsibility. And so they spent a lot of time burning down this toil, because they had so much toil, but they had it capped.

And the reason we cap it is because if you're just doing toil, you can't improve anything. So we say that at least 50% of your time should be spent on project work, and at most 50% of that time be spent on toil. Of course, this is very qualitative, and essentially, if you ask somebody, "Are you spending too much time on toil?" they'll tell you.

So SRE does project work, and the focus is making things better for themselves. So first, they might do whatever is required to actually meet the SLO. That's their first responsibility. Any project work required to do that, it might be, "Oh, we're having too much instability in one cloud region. Let's run in two cloud regions." It might be, "Our releases cause too much downtime. Let's actually do progressive releases." Do the project work required to address the SLO, and then move on to, they might improve the monitoring. They might work on automation. They might hold folks accountable to the postmortem action items. Whatever is required in order to make running those systems less toilsome, more productive, more reliable.

And that's our second principle. It's very simple. SREs must have time to make tomorrow better than today. Because if you're not capping that toil and allowing them to actually go and implement that monitoring work, then all they're going to do is get totally overloaded with toil, and then they won't be able to do any project work. And so the next time they actually need to do something to improve the reliability of the system, they're too overloaded.

And I think any org with one or thousands of SREs needs to be able to apply this principle. There must be this ability for the SREs to address the toil and do the project work.

So we're talking about getting started with SRE, and-- just checking time. I want to talk about our third principle. And this is about SRE teams.

So I've got to emphasize here that this is just how I see Google doing things. I don't want to be prescriptive and say, "Okay, first of all, what you have to do is restructure the way I do things." Think about this with relationship to your own company and how you might implement it.

And this is how I see it possible for a team responsible for running mission-critical software in production and keeping it reliable. I don't think you can just create a site reliability engineering team, well, create, by taking an existing team and changing the name on the door.

Getting started with SRE will work better if you start with SLOs, build up some principle approaches to engineering to those SLOs, and prioritize giving the most mission-critical systems to your SREs for support in production.

Note that I say mission-critical and giving. That means that there's plenty of room for other teams to run things. SRE don't implicitly have the only keys to production, and they're there to force multiply, not to gatekeep. Your SREs are there to be able to take responsibility for the reliable running of your systems. They're not there to take over running all reliable systems.

So by sharing the responsibility of running your application in production, with some products being partially or entirely owned by your dev or DevOps org prior to SRE taking them on, you're probably going to have a whole lot better synergy between your teams. So we see SRE as a team that needs to be able to do a good job of dealing with the most important pieces as that team grow and take responsibility for more and more production systems as they mature.

In order to be able to continue to do a good job, to make tomorrow better than today and engineer to the SLO, our first two principles, there have to be ways that this team can push back when their workload becomes too much to handle.

We apply that pressure in many different ways. I'll give you a few examples here. Giving 5% of the operational work is an excellent way to make your developers actually understand what it means to do that operational work. It stops things from keep getting put to the back of the backlog if they have to participate.

You, of course, can do the usual project management thing, track completion. SRE teams are often very good at analyzing new production systems because they've been running all the old production systems for so long, and so you see all the same patterns coming up again. And we all know that fixing things upfront before they're deployed is much, much cheaper than fixing them after they're deployed.

Leadership buy-in. So I believe that this is required for every aspect of SRE. It's required for the error budget policy, it's required for our cap on toil, and it's especially important for shared responsibility, because this is where your SRE team will interface with other parts of the organization, and you need management coverage.

If your leadership doesn't back up your error budget policy, you probably have to go all the way back to principle one, and probably again and again. When your application misses your SLOs, it puts a large amount of additional load on the SRE teams, who by necessity have to spend much more time fighting fires and accrue more tasks to improve the system. And so your leadership is additionally best placed to help here, either by funding more focus on reliability, funding software development where it's required, not necessarily by the SREs, but also with dev buy-in.

Or, and this is something that I didn't mention earlier, this is very important, you can always loosen your SLO. It might be that when you actually go to your CTO or your CFO and you say, "We need to spend this much extra money to meet this SLO, this target that we have," and they say, "Oh, well, that's a lot of money. Why don't we just do a worse job and eat that unreliability?" Okay, that's actually perfectly reasonable. They're the best-placed people to make that decision.

I have seen a team literally fall apart and be disbanded because the application was consistently out of SLO, and the development team refused to commit to actually addressing the problem. This is a direct quote: "If the pager is going off so often, can't you just hire more people to answer the pager?" That doesn't actually address the fact that the system is chronically unreliable and there's nothing that we can do about it.

So there is such thing as leadership helping the wrong way. The correct thing to do there was to freeze the pushes, fund the effort on reliability, burn down that backlog, actually do the things the SREs were telling the devs to do, which they were ignoring, and actually improve the health of the application.

So I don't have to tell you that fixing issues after release is always more expensive. Architecting resilient systems, maintain-- oh, and consistency. This is another great thing that SRE can bring, is if your SRE team is responsible for running many, many systems, the more alike those systems are, the more systems they can run with fewer toil. And so SRE is often very well-placed to drive consistency across an organization.

I was talking to a gaming company that was-- sorry, that's computer gaming, not gambling, company who said that the one big thing that they enjoyed about SRE was the fact that they now had a team that could say, "Well, we're running all these systems in production, but they're all different. Every time you iterate, you have to make them more consistent," so that they could drive down this toil and drive up the productivity of the SRE teams. And it's quite little cost if you do it upfront.

And I'm sure you can appreciate how much good automation can benefit the running of a system in production. Automation doesn't have to be written by SRE, I should emphasize that. SRE might do a lot of automation work, but it could be done by development teams, by your DevOps teams, by whoever you have. But your company should specifically address automation where it helps drive down toil and prioritize. This is how you'll be able to get your SRE teams to be responsible for more services, and yet not suffer from operational overload. This is how we scale up our team non-linearly with the number of services they support.

And so that's our third principle. SRE teams have the ability to regulate their workload. So once you have a team up and running, this is the principle you can apply to allow for growth and better harmony with the rest of your organization, scale up your operations, and develop even larger systems.

So in summary, these are our principles, which I'll restate for you. Site reliability engineering needs service level objectives with consequences, and you can do this in any organization. Site reliability engineers have time to make tomorrow better than today, and even your very first SRE needs to be able to do this. They need to be able to both run your systems and make them better. And site reliability engineering teams must have the ability to regulate their workload so that your teams can flourish and grow.

And thank you very much.

Q&A

I'm slightly over time, but I would take a question or two. No? Oh, there is a question.

Q: How many SREs are in Google?

A: How many SREs are in Google? I don't have access to accurate numbers. It's about 2,000. Sorry.

Q: Is an SRE part of a DevOps team, or is--

A: In Google, SRE is not part of our DevOps team. Not at all. We have both. Our SRE teams are much more focused on the running systems in production than the DevOps role is. Our dev partners are actually more DevOps than we are.

Thank you very much, everyone. Thank you.