Google SRE Interview

Log in to watch

GeneCon 2023

Google SRE Interview

EXCLUSIVE

Jennifer Petoff, PhD

Director, Google Cloud Platform and Technical Infrastructure Education · Google

Christof Leng, PhD

SRE Engagements Engineering Lead · Google

Gene Kim

Author, Researcher & Founder · IT Revolution

Jennifer Petoff and Christof Leng, senior leaders in Google's Site Reliability Engineering organization, dig into the real mechanics of how SRE operates at scale — from negotiating service level objectives with product teams to managing on-call load, deciding which services deserve SRE support, and handling security incidents in production. They push back on common misconceptions, including the idea that 50% ops time is a target rather than an upper bound, and explain why SRE scarcity is a deliberate design choice that preserves the discipline's value. Petoff and Leng draw on their combined decades of experience at Google to describe what healthy SRE engagement actually looks like versus the dysfunctional patterns that waste everyone's time. In this talk, you'll learn how Google structures SLO design and violation response, how SRE teams decide which products to support and when to exit an engagement, and what capacity planning looks like when you are operating at planetary scale.

Chapters

Full transcript

The complete talk, organized by section.

Host Intro (Gene Kim)

Okay, next up is Dr. Jennifer Petoff and Dr. Christof Leng, both leaders in the Site Reliability Engineering program at Google. Famously created at Google in 2003 by VP of 24/7 Engineering Ben Treynor Sloss, who very purposefully built the SRE organization outside of the product engineering organization.

It's so fascinating to me to see the backgrounds of SREs, who I admire, and how many of them have PhDs. Many come from not technology, but they're all systems thinkers. In fact, Jez Humble, author of the Continuous Delivery book, co-author of the DevOps Handbook, he's also now an SRE supporting Google Cloud Run and other products. I just can't overstate how much I've learned from both Dr. Leng and Dr. Petoff.

So, could you quickly introduce yourselves, Jennifer, Christof, and describe your role? We have a bunch of exciting questions queued up.

Jennifer Petoff

Sounds good. Hey, Gene, and hey everyone. I'm Jennifer Petoff, and I'm currently Director of Google Cloud Platform and Technical Infrastructure Education at Google. I've been at Google for almost 17 years now. I'm currently based in Lisbon, Portugal, spent a lot of time in Ireland. I'm from the US originally, and I'm probably most well known for spinning up the SRE education program at Google. Education programs are my specialty, and I'm one of the co-editors of the SRE book that we wrote back in 2016.

Christof, would you like to say hello?

Christof Leng

Thank you, Jennifer. Hi everyone. I'm Christof. I am based in Germany. I've been with Google for about nine years, and I lead a horizontal group in Site Reliability Engineering for central programs: the Production Excellence program that I talked about at DevOps Enterprise Summit before, and the SRE engagement model, how SREs collaborate with the dev teams at Google. Very excited to be here today. Thank you for having us.

Q&A

01How do you negotiate SLOs with the business, and what happens when targets are missed?

Gene Kim: Thank you. We queued up a bunch of questions that people have been dying to ask you, that Jason Cox has been accumulating for years. The first question is: how do you negotiate these service level objectives with the business? And let's be honest, what really happens when you miss those targets? Can you give us some insight in terms of what those conversations sound like and what actually happens as a result?

Jennifer Petoff: Christof, I think you're going to take this one.

Christof Leng: Thank you. I think that's a great question. I can describe the optimal case, how everything should go. The product side already has an idea of what the quality requirements of the products are and understands that SLOs can help them with that, and approaches SRE to help them with designing that. SRE has a lot of experience in that space and can do the monitoring and reporting of those.

Typically good SLOs are end-to-end from the user's perspective. The users don't care about the systems. They care about what they see on their screen, or however they interact with the product. These SLOs should be monitored in real time, and any substantial violations should cause an incident, where the primary responders, whether they are on the dev team or the SRE team, look into this and try to mitigate the situation, because SLO violations actually measure user pain if they're well designed.

Leadership typically reviews the compliance over longer periods of time. If the SLOs are missed, then you shift from feature development to quality improvements.

However, it's not always optimal. Often we see that the product side doesn't have a clear picture on the quality requirements. Product managers often think in features, not in terms of availability or response time or throughput or these things. Often people take an easy approach of defining SLOs around their systems, what's easy to implement, and then it doesn't actually correlate with the user pain.

Then we also see, on the reaction side of things, behavior on one side where people completely ignore the SLOs because they don't actually monitor them actively, and then that's kind of a waste of time. Or they overreact and kind of freeze everything, and the UX developers are not allowed to move buttons on the UI because the backend is slow. That's not helpful either, especially since then you accumulate a lot of feature backlog, and then you will launch a lot of things in short succession or as one big hairball, and then you run right into the next SLO violation.

The assumption here is that, as the resident SRE, you can help those product managers navigate the path to balance agility and reliability and raise awareness of how it actually can help them, because no product manager likes it when their customers complain about poor quality. It keeps them from launching new features.

Gene Kim: Yeah, totally. You're right. That's so good.

02Is 50% operations work the target for SRE teams?

Gene Kim: The SRE community talks a lot about, ideally, SREs are spending 50% on operations, 50% on improvement activities. Is that right? And if so, what does that look like in a regular week?

Jennifer Petoff: Gene, I think it's important to call out that the 50% ops threshold is an upper bound. It's not a target. I was talking to Christof about this earlier, and you could think of it like an alerting threshold. You ideally want to be well below this alerting threshold. Most teams have a much lower ops percentage.

The goal here is to do enough ops to really maintain that wisdom of operations while ensuring there's enough time to focus on things that really matter, the value-added, engineering-driven projects.

In terms of how we monitor these things or pay attention to these things, on a weekly basis different service teams will have their production meetings so they can be aware of what sort of ops load they're under and whether things are getting critical. Cross-service, we do have a more general ops review, which looks at the state of production more generally, more of a grassroots effort that tends to be based in different SRE sites at Google. Also, our leadership runs a process called ProdEx, which looks at key indicators. Leadership is also paying attention to this as well, and if there are hotspots in terms of ops load and toil.

In most teams, there's one person who's on call per site. We have geo-distributed on-call shifts, so they're handling ops for a given period of time while the rest of the team is really focusing on their engineering projects, and/or background ops tasks that might be occurring, like service turnups, cluster moves, things like that. But again, we try and give people enough room to focus on the stuff that matters, basically.

Gene Kim: By the way, Jennifer, I apologize for offending your sensibilities by the 50% ops work.

Jennifer Petoff: That's not offended. We just want to make sure that we set the record straight. We set a high bar. We want to make sure that there's enough ops, like I said, but we don't want the team to be crushed under that load, basically.

03How does Google decide which teams get SRE support?

Gene Kim: This is something that Jason Cox from Disney talked about yesterday, around SRE talent and skills being so scarce, and supply being outstripped by demand. He talked about how, as much as Jason is a nice guy and wants to help people, he's very careful about committing to helping a certain team and is taking on a more strategic perspective: is this going to be a long-term client? Is this going to help achieve objectives? Can you talk about what is the process to decide which teams to support? We've written about above the line, below the line, but can you tell us in more detail what that process looks like as people are clamoring for SRE support?

Jennifer Petoff: I guess the rule of thumb is, at least at Google, SRE is a scarce resource by design. It helps promote that healthy tension. It makes sure that the work that they're doing is really valued, and also that we're supporting the most mission-critical services and not just holding a pager or doing ops. The goal is to really focus on the largest, most business-critical systems.

Those could be revenue-generating services, like ads types of services; in our case, enterprise-focused services, so Google Cloud services; and various underlying infrastructure. There's also Google's broader infrastructure on which other services depend, thinking about Borg and our other infrastructure services, even SRE tools, the tools that SREs use to be able to monitor, react, and basically deal with issues as they come up.

Do you have anything to add to that, Christof, in terms of more of the specifics of how the allocations happen, or is that a pretty good summary?

Christof Leng: I like to think about an SRE engagement as just another engineering project. It should focus on the same things: what are the outcomes we want to achieve, sustainable outcomes that outlast the engagement, and what is the timeline for that?

If you have engagements that basically go on forever, so you will be on call for that service and probably do some engineering work, we don't know, not sure what, this is generally a waste of SRE resources. When you have achieved these goals or realize that you can't achieve them, it's time to revisit that engagement in the context of all other possible engagements within your scope and maybe shift your attention to something more critical and more impactful.

Jennifer Petoff: Can I ask one quick follow-up question, Gene? It sounded like Jason was almost coming at this from an angle of SREs are difficult to recruit. Is that also on the radar, or was it more of this design principle?

Gene Kim: Yeah, I think it was a combination of more demand for his team than he can possibly have or source immediately. Two is his desire to keep some excess capacity on the bench to do saves when they're needed, say when a certain movie can't make it to the movie theater because the mastering pipeline broke, so he could parachute some people to save the day. I think he was expressing some degree of judiciousness of what it takes to enter into a longer-term engagement.

Jennifer Petoff: That makes sense. The reason I brought that up, I gave a talk at SREcon this October about some work that we did: basically, can you take new grads and turn them into high-functioning SRE team members? We had some great success with that when I was working in the Google Dublin office. Maybe I'll propose that at the top for the next...

Gene Kim: Oh my goodness. Yeah, next.

Christof Leng: SRE hired me, and I didn't even know what SRE was.

Jennifer Petoff: You weren't a new grad either, Christof.

Gene Kim: That's astonishing, because according to Stack Overflow salary surveys, those are among some of the highest-paid roles, and to be able to get to some level of competency with new college grads is astonishing. Let's count on that for Enterprise Technology Leadership Summit 2024. By the way, I am just sort of dying to ask, and you can just say pass, but given the fact that SREs are so in demand, what is the craziest thing you've seen someone do to try to get SRE talent on the team?

Jennifer Petoff: To try and hire someone and attract someone to join the team?

Gene Kim: Or, yeah. I'm not saying there's ever duffel bags full of cash showing up, or escalations like, I really, really, really need this, even though I'm not above the line.

Jennifer Petoff: I don't know if I have any examples that come to mind in that regard. Did you do anything crazy to get hired, Christof?

Christof Leng: No, our life is actually very boring. It is the way how SREs like it. Excitement always means it's a bad sign.

Gene Kim: I love it. Just to clarify, have you ever seen a product owner do something outlandish, outrageous, to try to get SRE attention to help their teams?

Jennifer Petoff: Oh, okay. I see what you're saying. Like, look at what we're doing. We're going to do this dangerous, risky thing; come pay attention to us.

Gene Kim: Come save me.

Christof Leng: Honestly, I'm always grateful if the developer side says, we need SRE support, we want SRE to take a look, because the worst thing is: we better not show this to the SREs because they will have opinions about this. Let's launch this without anyone looking. This typically goes wrong.

Gene Kim: That's awesome. The dynamic is: before I do this very dangerous thing that could jeopardize the organization, let me call my friends in SRE to have them take a look and get an opinion before I actually push the button. I love that.

04How should SRE leaders balance agility, security, compliance, and production risk?

Gene Kim: What kind of advice do you give to SRE leaders about helping them balance the need for agility, as described by the product owners, and the importance of security and compliance in production systems? I suspect that this will have something to do with the incredible error budget tools that you've developed.

Christof Leng: I think security and compliance are important topics, but they're somewhat separate from SRE. We have security experts at Google. Every major launch, not every push to production, but major launches and major changes get reviewed by security, by legal, by SRE, to make sure that the risk is well understood.

The idea is to surface these things that happen before they happen. Generally, it's better to do smaller steps, like with any kind of software development. If you do a whole big launch where you change all kinds of security surfaces, it's very hard to really identify the risks.

In practice, during operations, when we detect a security problem, SRE often gets involved to act because we are the experts on mitigation of any kind of things. I've been in security incidents myself where we then had to quickly close any kind of security hole while the developers were working on actually fixing the code base, and then we work hand in hand with our security experts.

Gene Kim: Another reason to have SRE friends in your Rolodex.

05How should teams think about capacity planning without Google's scale?

Gene Kim: We have time for one last question. Life must be great at Google running one of the few planet-scale networks. What advice do you have for capacity planning and scaling production systems when you don't have planet-scale networks?

Christof Leng: Most of our services run on Google's internal production platform, and that is pretty elastic. The smaller services don't really have to worry about capacity, and they shouldn't invest too much time into it, because engineering time is also expensive.

But for the bigger services, things like Google Search, Ads, YouTube, we're talking about mind-blowing amounts of resources, and they have to plan ahead of time and get the capacity ahead of time. Sometimes that means new data centers need to be built, and that is not something that you can just order online. They have specific plans for that.

They implement monitoring to get alerted if utilization reaches critical targets so that they can actually get emergency loans or free resources while they figure it out. I got literally paged, but it wasn't the same team as Jennifer, for: storage is low, you only have one petabyte left. You need to do something now.

Then it's time to talk to the program managers and the resource managers, and to figure out where we can get additional resources and figure out why the utilization has increased and what we can do about that.

Gene Kim: So good. I must say, as much as I've loved the conferences that we've done, the fact that we have this opportunity to do real Q&A sessions is something that's been so unexpectedly fun and energizing. Thank you so much. I love seeing you both again, and I'm excited that we have some stuff to work on together for 2024.

Jennifer Petoff: Thanks, Gene. Thanks for having us.

Christof Leng: Look forward to it. Thank you.

Gene Kim: Thank you, Dr. Petoff. Thank you, Dr. Leng. Catch you all soon. Bye.