Log in to watch

Log in or create a free account to watch this video.

Log in
Las Vegas 2022
Share
Download slides

ProdEx - Google's Production Excellence Program

ProdEx is Google SRE's flagship program for production health and operational risk. It has been running for 7+ years across the SRE organization.SRE directors assess the health of SRE teams and provide coaching in an interactive review setting based on production metrics and business context.

Chapters

Full transcript

The complete talk, organized by section.

Host Intro (Gene Kim)

So this week you'll see two prominent themes in the talks. The first is definitely site reliability engineering. The second is about how we organize to best achieve the outcomes that we want. So in many ways the next talk is about both.

There's so many things interesting about site reliability engineering principles and practices, which Google pioneered all the way back in 2003. I think it's one of the most incredible examples of how one creates a self-balancing system that helps teams go to market quickly without jeopardizing reliability and correctness of the services they create.

So for over a decade, I've wanted to better understand why Google chose a functional orientation for their SREs. To this day, thousands of SREs at Google are in one organization reporting to Ben Treynor-Sloss, VP of 24x7 Engineering, which includes SRE very purposely outside of the product organizations.

So the next speaker is Dr. Christof Leng, SRE Engagements Engineering Lead. Over the years, he's managed and worked on various parts of Google services, including Cloud, Ads, and internal developer tooling. And over the pandemic he spoke with Dr. Jennifer Petoff, Global SRE Director of Education, and I learned from him personally so much about how Google SRE leadership interacts with Dev leadership in this sort of functional organization. And so he is presenting today on how he's helping his organization ensure that they all have excellent SREs who can help the customer succeed, and how SRE fosters production health across all Google fleets, Google fleet of services. Here's Christof.

Christof Leng

Thanks for having me. Hello everyone. Thank you.

I'm going to talk about ProdEx, Google's Production Excellence Review program, which helps us to manage operational risk and promote best practices across Google SRE. But why are we doing it? To explain it, I have to talk a little bit about organizational structure first.

SRE at Google is a central specialist organization with its own reporting hierarchy and organizational structure. But it is matrixed into the individual product areas, the business units, to support them, to align with them, much like a lot of other specialist roles at Google as well.

Well, this is the classic matrix model, and it comes with challenges. You need to align on the vertical, on the business alignment. The SREs in the individual teams need to understand the systems and the business needs in that area.

But also, SRE is a community, a community that learns from each other, that builds platforms together, that establishes standards and promotes best practices. So we learn from each other. And to be able to have our full potential, we need to encourage that exchange across SRE, which is a very large organization by itself.

And there is a lot to say about the business alignment, about the system reviews, and I hope I can talk about this another time. But today I'm going to talk about this other dimension, the horizontal alignment, the promotion of best practices, standards, and mutual understanding.

Now the goals of the ProdEx program are to drive these operational best practices that we have, that are constantly evolving, and production health across all of SRE and, by extension, across all of Google's products.

And the idea is to assess the main risk areas for the SRE-owned production services to see, like, what can we do about them? How can we manage them? How can we mitigate them? And especially to identify individual SRE teams that may need more help. There are always some hot spots. They have challenges, and it's a very dynamic situation.

But it is not an audit. It's not a compliance exercise. It's a coaching opportunity for these SRE teams to learn from more senior leaders and to understand better how they are doing and what they're doing. But it's also an opportunity for these reviewers to get cross-SRE visibility and awareness, and bubble that up to SRE leadership to inform the overall SRE strategy.

What it is not is the business alignment, the vertical thing, nor the compliance thing, here are the policies that you need to follow. It is not there to criticize. It is to help and to add perspective.

So how do we do that? First of all, a lot of these review programs that probably all of you do in one form or another are kind of unstructured. Everybody has their own slide template, and these evolve, and every team does it a little bit different. So these things are not repeatable. Everybody is using a slightly different set of signals.

And that really prevents us from using the reviews as a data source to bubble information up. And it also is a lot of overhead and preparation for the individual teams. Not uncommon that a team spends multiple days, or even more than a week, to prepare such an unstructured review.

And we don't want that. We want shared metrics, and we build dedicated tooling that actually automates the data collection. You can still tweak it if our automation may have missed some of the systems that you're responsible for.

And then we also want to apply context from the team. We are a data-driven company. Data is extremely important, objectively measurable data especially. But the context, the perspective of the team, and the business context also matters. So there's plenty of room for them to provide speaker notes, to annotate, to explain why the data looks the way it is.

And then there are two senior reviewers per review session. They are typically directors or principal engineers. And they review the findings together with the team. And important here is they do not typically come from the same area. They're not from the reporting chain. They're not the bosses. They are senior leaders from elsewhere in the organization. So they can provide an outside perspective, and it's less intimidating.

And generally we aim for every team to get reviewed at least once per year. But if we see a low score, if we see a lot of risks in a team, the teams get automatically scheduled more often. So we keep tabs on those and try to help them, and make sure that they dig themselves out of that hole.

And talking about all of these things are nice, but the review itself is only the starting point. The real goal is actually to identify actions and to track these actions and make sure that things actually change, things improve. So that's an inherent part of the program, to generate action items in areas where we see the need for improvement. And then in a next review session, to review these together with the team and review the progress on those.

So one of that might sound nice as an idea, but can we actually show that it works? And to give you a little bit of context, the program started seven years ago. There are over a hundred SRE teams signed up now. We do not centrally force teams into the process. The individual product area leads, the directors, actually sign their teams up because they see value in that, because they see value for their teams.

And over these years there are over a thousand reviews that were conducted, and over 40 different reviewers have participated. But that's output. That's not outcomes. I will talk about the outcomes later, but to better understand them, let me dive into more detail on how the program actually works, what is being discussed in review.

Well, there are six areas. The first one is team information. It's typically a quick one. It's just to make sure that the team actually has a purpose and understands this purpose, its scope clearly defined, and a plan to work to its mission. Every SRE team is expected to have a charter, and charter being signed off and up to date, and a roadmap on how to work towards that charter. If you don't have that, if you cannot clearly articulate what your purpose is, then a lot of other problems will ensue.

Second, on-call health. ProdEx is a lot about operational risk, and pager fatigue is real. It's a huge risk for an SRE team, because operations is a means to an end to an SRE team. The actual work is engineering projects, but if you spend too much time on operations, responding to incidents, you will not have time and energy to do that.

So we look at how many incidents do you have? How many of these incidents actually force alerts on something that you cannot do anything about, because, I don't know, the network is down, you cannot reach the database, so, well, still you might get paged and distracted from your engineering work. Also, how noisy is your alerting? Every time a single thing goes wrong, a hundred alerts fire. That is a problem. What is the staffing? Do you have enough engineers that you are not on call every other week? It's important to make time for actual project work.

But we also look directly at the project work, and we look at that through the OKR completion rate. How many OKRs does the team have? Does it actually do proper goal planning? And how many of these have a very low score, say below 0.5? If you start a lot of things but don't finish them, it's not helpful. You need to focus better.

Also, how does this compete to other operational toil work that might not be an actual outage, not an incident, but like a ticket queue where you have to work through? And what about tickets that turn out to be actually projects in disguise? Do we have a policy on how to put them into your backlog and not disrupt your actual project planning? If all of these things are working, then we are likely to have an SRE team that can actually deliver valuable engineering work.

Another area that we look at is how service level objectives were managed. And it's not only about everything being green. That's actually: do you have SLOs defined for all of your critical aspects? And are they signed off by your stakeholders? Do you have rationales for them? And a rationale might just be, well, that is the historical performance of the systems. We do not know any better. That's not great, but it's at least honest.

And if you put in a magic number, it's like always has to be 77, then future generations of engineers will work very hard to make it so, not understanding that you just had no better idea. So please write down all the rationale.

And do you actually measure them, and are they working? And if they are not, do you write postmortems? And do the postmortem action items actually get resolved? Because if you have outages and nothing changes, you will have more outages and you will not deliver sustained value.

Now another area that's very important is data integrity. It's something that's easily overlooked because as long as you don't lose any data, nobody notices that you don't have a proper restore plan. Well, when you do notice, it's kind of too late. So it's important to talk about these things up front.

And first of all, it's important to identify what business-critical data sets do you actually own. What data do you need to, you cannot, do not want to lose? And for these data sets, do you have data integrity plans where you explain why you need to get them back, how you need to get them back, what are the constraints, how quickly do you need to get them back, how much data is it? And again, get this signed off by your stakeholder.

And it might just be like, this data set is very large, but it's generated, so it doesn't make sense to back it up. It would only consume, and we can just as easily and sometimes even more quickly regenerate it. Write it down, get it signed off, so everybody understands that.

But if you need to back it up, also restore-test it, because if you back up things and never restore them, by the time that you need to restore them, it might turn out you can't. And you can do this manually, but it's not really necessary. You actually want to have automation and do this on a frequent basis, so you detect any kind of change, any kind of regression early on that breaks your backups.

And last at least, capacity planning. You both want to make sure you're not wasting machine resources, but also you're not wasting engineering time on over-optimizing your capacity management. So it needs to be size appropriate. You need to look at the utilization. You need to make sure that you have alerting, because running out of capacity almost always means that your service is down. And you can also look at how often does the team manually adjust resource allocation, because that is a sign for poor planning and room for improvement.

So that is what's being discussed in the ProdEx review. And how does this impact our work?

So first of all, in the first year that we did these reviews, there were only 23% of all reviewed teams that were high-scoring. And over the years that has increased to 66%. And at the same time, the fraction of these teams that were scoring low, that were at risk, that had actual problems, dropped from 44 to staggering 9%.

Now you could argue that maybe just the review has gotten soft. But we can actually also see this by the underlying metrics. For example, the pager load, the incident rate, dropped by 34%.

And looking at the results over these many years, from statistical analysis, we see that actually data integrity is the most predictive section of the overall score. If you're doing poorly in data integrity, it is unlikely that you are a well-performing team. And if you are doing well there, you probably are. Correlation, but it shows you how important it is to have a good grasp on data integrity.

And last but not least, because we ran so many reviews and we invested heavily in automation, we were able to save thousands of hours of leadership time, both from the managers and tech leads of the teams being reviewed and the reviewers via the automated review preparation.

And I would argue there is a critical ingredient that the program is actually being successful, that it was able to be adopted by so many teams, it was able to scale, and that the review fatigue is not that big for the program to break down. Because if every reviewed team would complain about, like, why would we have to do a week of preparation, then their leaders would probably remove them from the program. But instead we still see more and more teams signing up to the program.

So what do our stakeholders say and our users? So Ben Treynor-Sloss, who founded SRE and is still our VP, says it's one of his most important bits of telemetry for him and his leadership team about the health of SRE teams. So being able to aggregate information up informs leadership about the strategy, about widespread risks.

Jessica, who is an SRE director for networking at Google and one of our reviewers, says it's one of the fastest mechanisms to build insight into the challenges and best practices of SRE teams from all corners of the company and share that knowledge back into the organization. So as a reviewer, which is a lot of work, it's still valuable for them because they get a chance to see teams from very different parts of the organization and learn from them for their own organization.

And Philipp, one of the managers of an SRE team, says ProdEx helped us to keep track of our operational risk and provided valuable mentoring for a long-term strategy, because it really helped his team to identify a fundamental problem they had with their strategy, something that they overlooked, and it gave him a lot of homework to think about and really restructure the team around that.

So that is it for me. The thing that I would really like to learn from the community is, how would you assess operational risk health of SRE teams? And what are the metrics that you would look at? I know that the metrics that we have are not perfect. I know of some gaps that are kind of obvious, but there are other areas that I might not even have thought of, and I would love to hear what you think. What really is the health of an SRE team? Thank you so much.

Host Outro (Gene Kim)

Thank you, Christof. A testimonial from Treynor-Sloss himself. That's awesome.