Ten Things We've Learned From Running Production at Google

Log in to watch

Las Vegas 2023

Ten Things We've Learned From Running Production at Google

SRE Engagements Engineering Lead · Google

10 key lessons that Google has learned from 20 years of SRE: How to establish a reliability culture, fight toil, and manage change.

Chapters

Full transcript

The complete talk, organized by section.

Christof Leng

Thank you. Thank you, Gene. Thanks for having me back. Hello, everyone.

Today I want to talk to you about 10 things that we've learned over the years from running production infrastructure at Google.

As Gene already said, SRE is something that was started by Ben Treynor Sloss at Google pretty much exactly 20 years ago. We are celebrating the 20th birthday right this month. The idea behind it is taking a software engineering mindset and methodology to design and run operations.

It started at Google, but it's an industry-wide practice now. Over the many years, we collected proverbs that describe best practices and common pitfalls that we have encountered. Some of them originate from Google; others we have adopted, only that we call them "prodverbs."

Now, Google SRE are thousands of engineers in one organization that work on basically every major Google product. And so we have many, many years of experience in very different areas. Google products can be very large, can be relatively small, can be very fast moving, or very stable.

I think these topics apply widely, but you shouldn't just copy-paste them to whatever you are doing, because every environment, every organization is unique, and that won't get you very far. You should use them as food for thought. I hope they are useful.

Now, the first of the three sections is about culture. They famously say culture eats strategy for breakfast. And I believe, much like the DevOps movement, culture has been very essential to SRE's success.

So let me start with the first principle: reliability can't be taken for granted.

It's a little bit like these basic things in day-to-day life: air, food, and so on. It's easy to forget that they even exist, that you need them, while you have plenty. But when you run out of them, it can be very existential, and it can be very hard to get back to normality. Because when you run out of reliability, not one thing has gone wrong. Many things have gone wrong, and you need to fix a lot of them to get your systems and products reliable again.

So that's why there always needs to be a voice for reliability at the table. That is the role that SRE strives to fulfill, and it's why our official motto is, "Hope is not a strategy." It needs something better than that.

Number two is the metaphor: cattle and pets.

You should not have pets. Your systems should not be pets. Your systems should be cattle, because pets are unique. They have names, they have personalities. You invest a lot of time and energy and money into maintaining them because you care deeply about each individual one of them, and each individual one of them is different.

Whereas for cattle, you care about the herd, not the individual. They're uniform. They typically have numbers instead of names, and individually they're cheaper.

So this is not about actual animals. This is about machines. I don't endorse industrial farming, but luckily machines are not living beings. So we should standardize them. We should scale them. That makes it a lot easier for us to run large systems, run them efficiently at low cost, and also allows us to change relatively fast and easily.

And don't forget: cognitive load is an important bottleneck.

Number three: you probably have all heard about blameless postmortems. Let me talk a little bit about the why. Why do we care about blamelessness?

A lot of people think it's about just being nice to each other. And it's true, yes, it is, but it's not all of it. There's more to it.

When you create an environment where people are afraid to speak up, to admit mistakes that they have made, to flag when there is a problem, you will not get the full information about the weaknesses in your infrastructure, in your systems. And then you can't fix those. You can't do good risk management.

So you need people to be able to speak up, to be willing to speak up, without a fear of consequences.

And also, finger-pointing doesn't help with anything, because you won't be able to fix people. You need to fix the systems and the processes.

So if there's this big red button that Christof pushed that took down the whole system, the question is not, "Why is Christof so stupid?" Believe me, I like big red buttons. The question is, "Why do we even have that button in the first place, and why is it so easy for Christof to press it without thinking too much?"

Number four: measuring.

I don't have to tell this community about the importance of metrics. We've talked about it at length over the last few days. But metrics also have risks.

So when you put out a metric and say, "This is important to the business. This is important to each individual's career," they will start optimizing towards that metric. So that metric better actually align with business outcomes.

Often we use proxy metrics, and it can be very misleading if people try to optimize for them instead of the actual business outcomes.

And also, when you don't measure something, it typically gets worse because people are not paying as much attention. They're very focused on improving that other metric that hopefully will get them a promotion.

So you should be very careful of what you select and what you measure and what you don't measure, because you can't measure everything. If everything is important, nothing is.

And then always iterate on your metrics. There's not just this gold set of metrics that will always be true for everybody everywhere. It really needs to align with your business. And then apply context and not just follow the numbers.

So the second section is about operations. SRE is not an operations role, but operations plays an important role in our role.

So let me explain. First of all, I like to say that the only way to really understand the limitations of a system is to see it go up in flames with live user traffic. And you have to be there. Don't read the postmortem; be part of that experience.

In the situation, you will have so many more opportunities to dig a little bit deeper, to understand what has happened. And only with that information can you really improve the system.

But that is the goal, not the on-call. The on-call is a means to an end.

And there's another thing to it. When you are on-call, when you are working together with the developers on these systems, you have skin in the game. They trust you because you are in the same boat, and then they're more likely to listen to your advice than if you stand on the sidelines and give smart comments.

But when you're on-call, don't be a hero.

We don't need heroes. They're actually very bad. Heroism is not only bad for the hero themselves. If you look at literature, heroes tend to have a lot shorter life expectancy. It's not good for your health, mental and physical.

It's also not good for the team because it creates a certain culture that is not very sustainable. If you say, "Look, Christof was so great. He stayed all weekend to fix the system, and the customers are so happy," everybody else in the team thinks, "Oh, I should also do that."

Don't applaud people for that. Sometimes we need to go the extra mile. But you should be very careful to not create a culture where this is expected.

Because it's also bad for your products and your systems. If the best thing is to extinguish fires, it's boring if things never catch fire, right? People won't invest so much time into improving the systems early on because they're very busy with fighting fires.

And also, the on-calls should never be alone.

It's like the worst thing that can happen to you is getting paged at 3:00 AM in the night. Nobody's around. You have no idea what this alert means. You have no idea what to do. You're stuck, and you don't know how to escalate, who to escalate. There's no way forward.

I never want to be in that situation, and I don't want any of the engineers to be in that situation either.

So there should always be people around in one way or another. There should be clear escalation paths. On-call should never be alone. It leads to faster mitigation. It leads to better insights on how to improve the system. And it makes on-call a lot less scary.

We say you should automate yourself out of your current set of tasks about every 18 months.

And the reason for that is if you keep doing the same things that you have been doing, you always will get extra work: new systems, new things, increasing toil from increasing user traffic.

You will drown in toil. You will only do this repetitive work, and you won't have any time for engineering and for improving the systems. So simply for staying in the same situation where you can actually have some time for engineering, you need to aggressively automate.

But most people think that automation is first and foremost about efficiency. I would argue it's more about consistency, because the automated script will always do the same things.

If you ask me to turn up five clusters, one of them will look exactly like the other. I will be very, very careful, but I might overlook something here, forget something there. That command couldn't run there, so I tweaked it a little bit. So every one of them will be slightly different, and they will become pets.

And you don't want pets, because these things will blow up later because nobody expected that this one cluster had this flag set differently.

The third section is about change. Change is super important. Change is happening all of the time in our organizations and the world around us and in our software systems. Software is fast moving.

So how does change impact how we run production systems? First of all, it breaks them. Change is the number one reason for outages.

So should we stop changing software? Quick show of hands.

A few.

So probably not. It would make life easier, right? But also not very enjoyable. We love changing things, and it makes our products better and makes our users happier.

There is some inherent risk to change. It's a risk that we're willing to take. But there's also accidental risk from poor change management, and that is something that we can minimize.

So first of all, you shouldn't just roll out a feature globally and see what happens. I can tell you what will happen.

Use incremental rollouts and test at every single step. Test in non-prod. Test with a small percentage of users. Test with a small percentage of the regions, and so on.

Do not deploy a config that hasn't been submitted to a code repository and been code reviewed. Because if you don't do that, if you just run the config from your command line, you don't actually know what's being deployed in production.

And I've been debugging outages for way too long only to find out that somebody else on the team had pushed this one flag from their workstation.

And don't deploy on Fridays, for whatever Friday is in your part of the world. Not on weekends, not on holidays when nobody is around, because as long as your rollouts are not great, and nobody's rollouts are really great, there won't be anyone around to fix that.

Do that during business hours. It's much more convenient, believe me. You'll have coffee and everything.

But if you ever get to a state where actually rollouts just never break, roll out all the time, obviously. And tell me how you did it.

Number nine: outages will happen.

There's an inherent risk to change, and it will cause outages. And other things cause outages too. Like, I don't know, a hurricane, an earthquake, I don't know.

That's okay. The idea is not to prevent it. It is to minimize their impact. Minimize their blast radius on how many users are impacted, how many regions, how many of your products, and to reduce the time to mitigation, that the system is up and running again.

So root-causing is super important, but it can wait. Collect all of the data during the outage and then analyze later. First, the system needs to be up and running again.

Be able to roll back things quickly, and then analyze why the release didn't work.

And to be able to support your postmortems and your root-causing, use written communication during the incident as much as possible, because then you will have a paper trail of what actually happened. It will be easier for others to join you and help you and read up on what had already been tried.

And give everybody access to the source code so they can actually figure out what happened in the code that caused this outage.

Last but not least: no haunted graveyards.

If you don't manage your technical debt actively, it will get worse, and you might reach a point of no return, and then nobody wants to go near that code anymore.

Have you ever seen something like, "Don't touch this code. Very important"? Yeah, I love these TODOs. The first thing I clean up in the code base is, like, what happens if I remove that? Okay, interesting. Hmm. That doesn't work here.

Because these things are booby traps for change. If you have many of these things in your code base, they're just unmaintained, and nobody wants to go near them anymore. They will trigger when you change something else, and they will blow up in your face in the worst possible moment.

So clean them up early on.

So let me summarize. What did we learn?

First of all, running production systems is a team sport across silos. Build relationships, work together, or you will fail and have a horrible time.

Second, there was change. Your systems need to change. Your products need to change. Your organization needs to change. You need to keep changing. That is good. That is healthy.

But don't try it by working hard. Do it by working smart. Do it through engineering. We are engineers.

And try to keep things simple. Everybody can build a complex system. That's not hard. That's not something to applaud. Try to build boring systems.

Thank you so much.

My collection of topics is definitely not complete. So the help that I'm looking for is: what are typical principles and proverbs that you know, that you have learned, and that you might want to share with me?

I will be at the Google booth downstairs, upstairs, somewhere later today.

Thank you so much.