SRE in Enterprise

Log in to watch

Las Vegas 2023

Download slides

SRE in Enterprise

Steve McGhee

Reliability Advocate, SRE · Google

James Brookbank

Cloud Solutions Architect · Google

A talk on Enterprise Roadmap for SRE, an O'Reilly report authored by Steve and James. We review challenges that enterprises are having with adopting SRE today and how to overcome them. Paper copies of the report will be available.

Chapters

Full transcript

The complete talk, organized by section.

Steve McGhee

Hello?

James Brookbank

Hey. You got the clicker?

Steve McGhee

I got the clicker.

James Brookbank

Got the clicker. Got the books.

How's everyone doing? Everyone had lunch? Coming back from lunch. Yeah, this is the hard one. This is when you have to stay awake.

Steve McGhee

I think you can do it. Can you do it, James?

James Brookbank

We're going to make it fun.

Steve McGhee

We're going to make it fun. Don't worry.

James Brookbank

It's a fun talk.

Steve McGhee

That's right. Yeah, it starts with a cat.

James Brookbank

The reason why it's going to be fun is because we haven't got this approved. So that's the best bit, right?

The lawyer cat is here to remind everyone: these are our opinions. We love this stuff. We're passionate about it. They're not the opinions of our employer. Do not tweet my boss. Don't put stuff on Instagram. Well, do that, but not to my boss. These are our opinions. The lawyer cat has spoken.

However, the book is approved. The book is official. You can read the book. You can quote the book. You can take pictures of the book. It's fine.

Steve McGhee

So we're hoping that this will give you a post-lunch introduction to the book. How about that?

I'm Steve McGhee. I'm the guy on the right in black and white. I was an SRE inside Google for about a decade. Then I left. I joined a company that was adopting cloud and had a DevOps team, and that was where I learned the word DevOps. I came back to Google, and now I help customers adopt reliability practices. And I work on the DORA team as well, as you can tell by my shirt.

James Brookbank

Yeah. I'm like the reverse experience. I've spent all my time in enterprises, like banks. I'm the dinosaur here. The fun dinosaur, not the scary one. I'm the fun dinosaur here.

I've done a lot of time working for enterprises, learning about DevOps, and then finding out about SRE. That's something that I stopped doing a little bit, coming to Google. I now work with Steve inside Google Cloud. Together, our experiences come from two different backgrounds.

Steve McGhee

That's right. We're sort of like peanut butter and jelly. Maybe peanut butter and chocolate.

James Brookbank

Sure. One of those. It's a mix.

Steve McGhee

We're not sure.

James Brookbank

So let's go a little bit into why we're doing this. Why are people interested?

You can hear Google talk about SRE. That's good, right? We like that. But also, a lot of people are like, "What about enterprises? I kind of want to be like Google, so I just get the SRE, right?"

Steve McGhee

You just download it and install it.

James Brookbank

Kind of, right? It's great. That enthusiasm is there. This is fantastic. It's awesome that people want to do SRE. This is great.

But we've not seen a successful adoption as much as we want. We think that reliability is the most important thing. We're SREs, so we think that. But actually, for a lot of people, that's not the case. It's not always the most important thing. That's okay.

It's also sometimes expensive or difficult, so you've got to be a little bit careful around that. We get it. And maybe you don't want it the Google way. Maybe that's not what you're looking for, but you want something that's better than today.

Steve McGhee

Progress.

So this is what we're going to talk about today. If this is the wrong talk for you, feel free to go out the back. I can't even actually see you, so I won't be offended.

James Brookbank

There's still coffee.

Steve McGhee

There's plenty of coffee and food out there. That's fine.

James Brookbank

You can make a run for it now, and we won't mind.

Steve McGhee

Right. So we have a book.

James Brookbank

We have a book.

Steve McGhee

This is the official one. You can go get the book. You can come to the Google stand and acquire this. We wrote this for a very good reason. We have another book on SRE, and none of you read it. That made us very sad. It was a really big book.

James Brookbank

It was quite thick.

Steve McGhee

And some people told us they read it and didn't read it. So we made this smaller one just for you or your boss. You can give it to your boss. It's super thin. Read it on the plane, basically.

James Brookbank

Yep. This is the Cliff Notes version, has all the extra stuff. So if you don't want to read the biggest ebook, this is for you. This is why we're here.

Steve McGhee

And we have them at the Google booth out in the vendor hall.

Let's get started. You may recognize some of these words. If you don't know, that's okay.

James Brookbank

If you don't know these words, that's also fine.

Steve McGhee

The point, though, is that SRE does not take over any of them. It's meant to be complementary.

We have actually seen several teams saying, "We have DevOps already. That's great. And we're going to now have SRE instead." That's not the plan. We don't want to say instead. Think of it as complementary.

Again, think about ITIL. ITIL is this giant thing. SRE does not do all the ITIL things. There's plenty of things in ITIL that have nothing to do with SRE practices. So try not to overthink what SRE might accomplish, and instead think of it as complementary to your existing setup.

So you don't need to burn everything down in order just to graduate to this.

James Brookbank

Please don't. Please don't burn things down.

Steve McGhee

Right. James, what the heck is a CAB? What's a CAB?

James Brookbank

We did the token CABs. They did the token CABs. That's right.

Change advisory boards are something that, if you know, you know. And it's not always the fun thing that you might think.

Network operation centers, they're awesome. They've got lots of screens. It's pretty exciting. You don't need to get rid of these. In fact, don't get rid of these. That's kind of bad, actually. The only thing worse than a CAB is a CAB in the shadows, happening without you realizing.

So if you get rid of it and then people are just doing changes, that's worse. That's materially worse.

Don't think about these things as the things to worry about with this interplay of dependencies. Think about the philosophy. Stop centralizing stuff. It's too much. Start thinking about scaling out instead of scaling up.

Steve McGhee

Just how we kind of all decided that waterfall maybe wasn't the best plan. It's just another form of centralizing and trying to make it perfect. That's not the approach that you want to take.

You may have heard of DORA. I heard there was a new report today, I think, came out.

James Brookbank

Yeah, you should definitely check it out.

Steve McGhee

Who here thinks DevOps is a pretty good idea? This is kind of the conference where people might agree with that. Great. Excellent.

James Brookbank

If you don't like DevOps, this is probably the wrong conference. It's a bit late now.

Steve McGhee

That's right. You are here. There's a bit of a selection bias here, perhaps.

But what we're trying to say here is, start with DevOps, please. Don't go, "We're going to skip DevOps, and we're going to go straight to SRE." Bad plan. Don't do that. Do the DevOps. Follow DORA, if it's helpful for you. You can add SRE to that as well. In fact, there's a part of the DORA reports that talks about reliability. It's all really part of the same thing. Don't skip the DevOps. Graduate up. Don't do that.

Another thing to point out: in last year's State of DevOps Report, as of today because we just put out the '23 one, you can see that we showed that if you improve your software delivery performance, you get better business outcomes. Sure. But if you also improve your reliability, it gets even better. In fact, there's a number you can see, 1.8x, which is pretty good.

If you have them both together, it's just additive. It's not just that reliability is a good idea. It actually has better outcomes for your business. Do both.

James Brookbank

If you don't believe that, read The Phoenix Project. It's amazing.

Please don't rat me out on these as well. Don't call your boss and be like, "James said cut your training centers." That's not what we mean here. Don't take this literally.

Maybe think, when you're doing training centers or centers of excellence, think about enablement. Think about a center of enablement as helping your teams work. Think about dojos as training people in these spaces.

This is the same for DevOps as SRE. Don't learn in isolation. They're not doing this in the ivory tower. They need real hands-on practice. Don't just stop doing these things and fire all your trainers. Training is good, but don't lock them all in the ivory tower. Let them out into the world.

Don't big bang stuff. I feel like if you're doing that at a DevOps conference, other people can fight with you on that. But we're going to also say, please don't big-bang approach things.

Steve McGhee

That's right. Gradual change is a tenet of SRE, and you should apply gradual change to SRE itself as well. When you introduce it, introduce it gradually.

James Brookbank

So this is me. I used to be a cloud architect in enterprise. It's a tough journey, right? I'm super sympathetic.

Your cloud journey, and sure, we work for a cloud provider, so I get that this is like, "Of course you would say cloud is a good idea." It's a good idea for SRE. It doesn't have to be our cloud. You can use a private cloud, hybrid cloud, all the different types of cloud are good.

But you've got to do some of these things for SRE to work. Your SREs need flexible, scalable environments. Flexible infrastructure is critical for SRE. If you're giving SREs one server and saying, "SRE this," it won't go well. They'll start building clouds if you let them. That's also fine as well. But just be intentional about your cloud journey and your SRE journey.

They're not the same thing. You've got to work out how they work together. You don't want to tightly couple these two journeys together. They're complementary, but they're not the same journey.

Steve McGhee

Yep. Another thing that has come up a couple times during this whole week, or you may have heard, is culture.

We actually have this theory. We can't prove that SRE is emergent from culture itself. We found that teams that are following this generative model, they're already doing things that we recognize as SRE, even if they're not calling it that.

So which came first, the culture or the practice? We're not really sure.

James Brookbank

Probably the culture.

Steve McGhee

You think it's the culture?

James Brookbank

Almost definitely.

Steve McGhee

You think you're probably right to that.

However, if you go the other way, if you have a culture which isn't so right-shifted on this chart, and you try to apply SRE to it, it may not go quite as well as you'd like. If you heard some of the other talks today about how you go about changing cultures, I'm not going to go through that today. It's a whole different talk. But make sure you are pushing your culture to the right if you're trying to adopt SRE practices. That's the point.

Another point of this is, it's very rare that one entire culture will be consistently all in one column. You can actually have a very erratic set of things.

James Brookbank

Bubbles.

Steve McGhee

Bubbles. You can have bubbles within your organization. You can have parts of your culture which are far to the right, and parts which are in the middle, and parts that are just all over the place. So be aware of that as well.

So why even SRE, James?

James Brookbank

Why are we doing SRE? Why is SRE happening to us? I thought DevOps was good enough. We were doing the DevOps and now suddenly I've got all these SRE people. Why have I got DevSecOps? What happened? Why is the ML everywhere? It's got everywhere.

Selection pressure is what's happening. Things are changing. There was a time when you could just have the network down for the afternoon. We'll just come back tomorrow. That's not the case. Everything is 24/7. You've got to keep your operations running in a completely different way from 10 years ago, 20 years ago.

That's why SRE is suddenly popping up in these spaces. The evolution is happening. The Red Queen effect is real. Some people have to run to stay still in this space, because the expectations now are that you're as good as Amazon. If you're a retailer, those expectations keep going, even if you don't. So this is why we're in this space.

Steve McGhee

That's right.

Okay. This is a question for the audience. Now I'm going to ask you a question, and just keep it in your head. Just imagine yes or no.

Can you build more reliable services on top of less reliable infrastructure?

Huh. Is it a trick question? Is he tricking us about the trick question?

The answer is yes, you can. To think about this, think about RAID. You can build a more reliable system, like a SAN, on top of less reliable parts: disk drives. We know this is true. We have things like resilience through redundancy. We have controllers that manage the disk, things like that. Basically, we're adding complexity to a system through software, which makes up for the fallible parts below it.

This is an important consideration. I hope you believe me because that's going to make the next slide a lot easier.

Lots of times I have customers come to me and say, "But Steve, we want to build a reliable system. But your cloud fancy VMs only have like three nines of availability. How am I supposed to build a five-nine system on top of three-nine VMs?"

The reason why is because you're thinking on the left side. You're thinking that your system needs to be less available than the parts below it, that your infrastructure needs to be more reliable and more available than the system that you present, that you put on top of it.

But cloud actually goes the other way. Cloud goes this way, which is super weird.

If your brain is on the left and you're working in cloud on the right, this is where cognitive dissonance happens. This is where you have all of those meetings where people yell at each other and no one really understands. And the cloud team says, "Ours is fine." And the app team says, "No, it isn't."

So you have to teach everyone: if we're on the cloud, we're on the right side. And there's questions like, "Why are we on the right side? Why is cloud upside down? It's crazy." The answer is kind of just scale. It just works out that way, I'm afraid. But it's true.

James Brookbank

You can keep your traditional model if you want. If you want to keep your mainframes, we're not stopping you. Keep your traditional model.

But just be aware that if you cross the streams here, if you mix these two styles of infra and app together, you're going to have a hard time.

Steve McGhee

That's right.

James Brookbank

Pick one of these models to use. That's crucial.

Should you even do SRE? That sounded hard. Should I just keep the mainframes? Well, sure. We're engineers. We're not cops.

The decision making, though: when, why? Here's some guides. This isn't an SRE thing. This is like a McKinsey model. You can show this to your CIO.

If you're going to do SRE, you've got to do it in production environments. Do it with the stuff, the Horizon 1 stuff, that's running now, that's already in production, that needs to scale. That's where your SRE money should go, your effort and intention.

If you've got a demo app and you're showing it at this conference, you don't need SREs. They're busy already. Don't assign the SREs to that stuff. Be intentional.

Most teams in Google don't have SREs. That's okay. Weird, right? The Maps does, the Search. That's still working good. But a lot of services don't have SREs.

Steve McGhee

That's okay. The other thing that helps to explain that is, if you're adjacent to a team that has SRE, you still benefit from that. You can still call them.

James Brookbank

You can still use their toolchain.

Steve McGhee

Well, age them, right? Things like that.

James Brookbank

So why else should we use SRE?

If you don't believe the horizons, that's cool. But also maybe go through the checklist. Think about this: is reliability a product differentiator? Do your users care? If you're like, "I have it up at the weekend," if no one's using it at the weekend, you don't need it.

Do you have critical risks? Does the government call you if your system is down? Let's maybe think about some SRE. This is probably important. No one likes that call from the government.

The hyperscale services: if you've got one server, it's not going to go well. If you've got a billion servers, suddenly you'll be thinking, the SRE will be happening to you at that kind of scale.

So think about this checklist. Think about where to put SRE first. Just try it. Try some of these things. Use some of these guides.

Sorry, this one, I was told there'd be memes.

It's going to get confusing when you talk to your accountants. They're like, "SRE, this looks expensive." SREs are expensive. This is pretty expensive.

The reason that Google did SRE, one of the main core principles, sublinear scaling, is about cost reduction. We didn't want to add administrators in a linear way for the amount of servers that we had.

So at the core of SRE, we are in the cost reduction space. Don't stop talking to your accountants at that point. That's bad now.

Good cost reduction, this is great. No global cost optimization. You're optimizing for the fleet. You're optimizing for a large number of areas. Invest in people who are doing software development so they can write the tooling that saves the money.

Keep the global and local pieces in your mind at all times when it comes to the cost optimization pieces.

Steve McGhee

So we're not going to reduce the cost of one department, right?

James Brookbank

One SRE is not going to fix the one department. It's a broader impact.

Steve McGhee

Let's talk about some principles.

This is possibly my favorite meme of all time. Has anyone suffered from this, renaming your—

James Brookbank

Who's just calling people out right now?

Steve McGhee

I'm sorry. You don't need to raise your hand if this is something that you've experienced.

But we see this again and again and again. We saw this with DevOps teams as well. We saw this with cloud teams.

James Brookbank

Just reunited.

Steve McGhee

We're just doing it again. We're just doing it again. Just call them SRE. They'll love it.

Please don't do this. I see it all the time. Yes, it makes me sad. What should we do instead, James?

James Brookbank

Again, some of you might like the DevOps, so this is not new information for you.

Please think through these lenses. If you want to get started, you don't necessarily want to go in for the tooling and all the cool stuff. Just start small. Go find the problems in your environment that you need to fix and go start fixing them. Build those incrementally.

Do not start this conversation with organization-destroying mistakes. Even if you rename everyone to SRE, that's not so bad. They're just called SREs now. It didn't help, but it didn't make it worse.

If you just fired everyone, it's not going to go well.

Steve McGhee

That's harder to undo.

James Brookbank

That's bad. Yeah, it's hard to undo that. So try having a reorg first. You can always do another reorg. That'd be fine. Do that first. Sorry.

Steve McGhee

With the people that are running your team, it's really important to make sure that you invest in that team. Not only in terms of pay, as well as education. We want to make sure that we have the right mix of people and they're doing the right work.

Within Google, the model that we suggest, that we have used in the past, is your SRE team should be a mixture of sysadmins and software engineers. They're not on two different teams. They're on the same team together. They work side by side.

Again, as James alluded to, we don't want to just fire the people who can't code, for example. We want to bring them up. We want to make sure that the folks who aren't traditionally coding, we give them the opportunity to learn how.

If you have folks who are traditional SAs, and you suggest, "Hey, we can help you learn how to code," and they say, "That's not for me. No way," maybe that's another discussion to have here. We do want to make sure that the people are growing, that they want to grow, and they're being empowered to grow.

But don't just get rid of them and hope to hire perfect unicorns off of some hiring site. That's not a great plan. Use your employees that already know your business. They're the best people to bring up.

Remember also that part of SRE, or I would say even the core of SRE, is embracing risk. You want to make sure that people understand that in their day-to-day job and that it's a place that is safe to fail.

The best way I can describe it is through a story. If a person pushes a button and accidentally breaks something on your site, then immediately fixes it, what do you do with that person? Do you bring them off to the side and say, "Never do that again"? Do you fire them? Or do you put them on a stage and say, "Look at what they did. It was so great. Good job, everybody"?

Showing that it's okay to have a little bit of failure, as long as you're able to recover from it, is the best thing you can do as a leader.

Another thing that you can think about is you can ask your teams, "How many projects did we have that failed recently?" And if the answer is zero, we're not really taking risks, are we? We're only investing in projects that we know are going to succeed, or we're lying to ourselves, which is always an option.

So make sure that you think of your failures, even your incidents and outages, as just unplanned learning opportunities. This is a chance we have to learn about our system through those incidents. That is the best way to move forward.

James Brookbank

Yes. So this is good, right? You've got all the information. You can just do these things and it's going to work.

So great news is, from the DORA research that we've done, that is true. It does work. The curve goes up and to the right. That's a good curve. That's how curves are supposed to work. It's fantastic.

But then it gets a bit weird. The curve does sort of plateau off, becomes some kind of J-curve in this space.

Luckily, we've also seen this before. So we know that for DevOps adoption, SRE adoption, the same problem is occurring. You start off well. Everything goes in a good way, but then you find new problems. The automation introduces new issues. The SREs suddenly start having to really deal with the next set of problems that they've almost created or uncovered.

So this plateau is often very frustrating. You've made quick wins, you've made quick progress, and you're like, "Maybe we should just give up on SRE now." Don't do that. This is good. You are making progress. It won't always feel like progress, but give it time.

Steve McGhee

Yeah. You want to make it out to the other side of the J-curve. Don't give up too early.

Has anyone heard of platform engineering? Has that come up at all? We agree, and we think this is a good idea.

I like to think of a platform as an abstract set of capabilities. You can buy them. You can build them. You tend to not get them all at once. You sort of develop them over time. Then think of your platform as one thing, and then the services that are adopted onto the platform as another thing.

You can bring the services on at different times. You don't have to put them all in at the very beginning, and then adopt and adapt your platform capabilities as well.

One thing that you don't want to do, and I'll show you a graphical example in a second, is to pick the biggest, baddest, most awesome thing first and put that on the brand-new platform that we're not really sure how it works yet. You can do it. I mean, you can do it, but—

James Brookbank

It's going to hurt the whole time.

Steve McGhee

I don't recommend it.

So this is a graphical way of thinking about your platforms over time. You're adding capabilities to your platform. CI/CD would be an example of, well, two capabilities, I guess. Rollbacks. You can think of small capabilities like canary releases. We want to be able to do canary releases. That itself is a capability.

As you're building these capabilities, start adding these services onto this new platform, and start with the low-risk ones. Start with the ones that don't need all of those features first.

Once you're happy with that, once you've gained some confidence in those capabilities, like you know how to do the canary releases without messing it up and how to roll it back, that's when you bring in the big guns. That's when you put the billing service, or the thing that makes the money, onto your platform.

You do not want to do this one. You do not want to put the money-making service onto the platform too early before it has the capabilities that you actually need to survive. This one is the bad plan.

James Brookbank

It's the bad plan.

Steve McGhee

This one is the good one. Good one, bad one. I think we get it.

James Brookbank

So this is a graphical discussion around platform engineering. I hope that works.

The good piece of that is that SREs are naturally building platforms. When they do toil reduction, when they automate things, they're building these platforms for you. Just let them do that.

While you're doing that, avoid some things. Avoid making your SREs developer support. I've had the call to my team: "The printer is broken." That's not what the SREs do. That's not it. If that's happening and you're printer support, you're doing it wrong. Stop doing that.

The other problem here, in many ways as well, it's okay for prod to be broken. It's a combo. It's a DevOps problem in many ways. That should be all of your teams working together in this space. Don't just throw stuff over the wall and what you're doing.

It's going to take a while. The J-curve is real. You want to start somewhere, start with your incidents. You already know what your incidents are. You are already having outages. Go find that team. Go ask them what you need to start on, and keep your feedback loops running.

Cause-based and symptom-based. That's kind of fun.

Steve McGhee

Yep. Cause-based is just kind of an abstract way of talking about things that we can measure that represent how the customers are actually feeling.

So if you've heard of SLOs, those are a great example of a symptom-based metric.

If we have a cause-based alert, which is like, "The disks are too full," or "The computer is too hot," we're making a guess as what we think is causing a problem. That may not actually be a real problem.

If instead we're listening to the symptoms, like from an SLO, we know that there's something wrong and we know that we're not in false-positive land.

So we want to make sure that you start with the symptoms, and then you just dig down to find the cause instead of going the other way around. Sorry, I flipped those two words.

James Brookbank

That was wonderful. Learning.

Steve McGhee

We're learning from this incident.

James Brookbank

Learning as we go.

So that's happening on the ground. That's all your grassroots stuff, which is great. But at the top, we've got to do something as well.

We've never seen a chief reliability officer. That doesn't exist. We just made it up. But you need someone with a seat at the table. Shout-out to Mark.

Who has the ownership of those strategic reliability decisions? Money. They have money. You might have one. It might be the VP of engineering or the COO. But if you don't have executive sponsorship, it's like a security team with no CISO. It works, but you're going to keep wondering why there's no CISO, why there's no one really owning this stuff.

In order to do SRE properly, to treat it like an investment instead of a cost center, do all of that good stuff, there's got to be some executive sponsorship. Grassroots only goes so far in this space.

Steve McGhee

When you lose the reliability officer is also bad. We've seen that several times with customers we worked with, where they had a person who was doing exactly this job. It was going great. They left. They went to another company. What happened? The team imploded.

James Brookbank

Yes.

Steve McGhee

So this is a very serious outcome that you don't want to happen.

How do you know if it's working? Unfortunately, we don't have the perfect metrics for reliability. We don't really know. We can't give you the perfect graph on if your system is reliable or not.

We have a lot of proxy metrics that you can use to understand. Sure, you can measure availability. That's not always the only metric that matters, though.

Here's a few ones that may not be obvious. If you are able to enforce consequences to exhausting your error budget, that is, do you change your behavior when you're having trouble with your system, or do you just hope that it's still going to keep getting better? Being able to change under stress is an important characteristic of "this is starting to work well."

Are you still praising heroes? We heard about this earlier this morning from Christophe. You shouldn't be having heroes. Your system shouldn't depend on heroes. If you're still having that scenario where the heroes are being praised out, maybe that's not really working.

Are you funding your SRE team after every outage? That's actually not a good sign. That means that you're being entirely reactive. You're not being very proactive.

And finally, are you actually celebrating success? Do you just expect that it's always up all the time, and no one should be surprised that it's up? Or are we impressed that it's up quite as much as it is? We want to celebrate the fact that the system, as complex as it is, is continuing to work.

James Brookbank

All right. We're nearly there. Keep going. Keep with us.

This is how we think that success goes. We go through the innovation curve. The early adopters are good. We get to the chasm and we cross that. And then the majority are kind of with us in this space.

But no, this is going to hurt. The chasm in enterprises is a very different place. This is a really tough environment. So be prepared.

As you introduce SRE, things are going to go well to start with, and then you're going to need to do a lot of work, keeping that flywheel going, making sure that this is happening. Crucial.

Steve McGhee

One of the ways to get you through that chasm, to get you through that J-curve, is to make sure that your culture is helpful. It's in the generative way.

How do we get there? It's not the free food and the ping-pong, I'm afraid.

James Brookbank

You can have free food.

Steve McGhee

You can have free pong. That's not going to solve the problem.

Instead, though, you should check out Project Aristotle. It is on the Google re:Work site. It's been there for years. You should check it out. It's a bunch of published methods. Think about the things on the slide here. Psychological safety is the number one thing to think about within your teams. If you're not sure what these are, there's an entire site describing it all.

James Brookbank

And if you're not sure what culture is, it's that. Go look it up.

Steve McGhee

Yeah. Other things to look out for, you can just read here, but the number one thing with SRE, in my opinion, is that you want to aim for sublinear scaling.

You want to make sure that you don't need 10 times as many operators if your business gets 10 times as big. That's the concept of sublinearity.

You want to make sure that your teams are able to move around within your company as well. For example, inside of Google, if you are an SRE and you want to move to software engineering, that's totally reasonable. It's not a demotion. It's not a promotion. People move back and forth all the time.

James Brookbank

Crucially, your money is the same.

Steve McGhee

Your money stays the same.

James Brookbank

Pay your devs the same as your ops.

Steve McGhee

That's right. Take us out, James.

James Brookbank

Hopefully this helps you on this journey. If you didn't pay attention, you're on your phone, we have books. Please read the books. Give them to your boss. Go download them from the website.

What's missing from a lot of our SRE guidance is the vision, the reason why we do these things. It's not the minutiae and the tooling. All your SREs know how to do that. They're already aware of how to do all of that stuff.

They need vision and inspiration. They need to connect to this journey and understand what's happening outside of Google. We've seen it work.

Steve McGhee

That's right. That's it. Thanks very much.

James Brookbank

You're welcome. Thank you very much.