How to SRE Anything

Log in to watch

Las Vegas 2025

How to SRE Anything

Director, Google Cloud Platform and Technical Infrastructure Education · Google

This presentation introduces the core principles of Site Reliability Engineering (SRE) and demonstrates how they can be applied to improve reliability, efficiency, and satisfaction in any area of business and life. Drawing an analogy between software development, business operations, and everyday activities, attendees will learn how to define Service Level Objectives (SLOs), balance reliability with other priorities, embrace failure as an opportunity to learn and improve, and foster a blameless culture that encourages innovation. Through real-world examples attendees will gain practical insights into applying SRE principles to achieve success in their own work and personal lives. Join us to discover how SRE can empower you to work smarter and live better!

Chapters

Full transcript

The complete talk, organized by section.

Jennifer Petoff

All right. Hello everyone, and welcome to my talk today.

First, a quick question, a little poll of the room. How many people are here today because they're SRE enthusiasts and came to this talk? Yay, some SRE enthusiasts. How many people are here because my talk did not have AI in the title? Anybody? All right, thanks very much. Full disclosure: I did sneak a little bit of AI into the talk, but that's not the primary focus today.

For those that don't know me, I'm Jennifer Petoff, and I am really excited to be here to talk about a topic that's near and dear to my heart. I am a quintessential SRE enthusiast. I've been working on the Google SRE team for over a decade now and live and breathe SRE. But what I'm going to do today is come at it from a bit of a non-traditional angle. Instead of focusing on how we use SRE principles and best practices to run services in production, I'm going to talk about how we can extend these principles and best practices to a wide range of situations in both work and life. How can we use SRE to work smarter and live better?

Here's what I'll plan to cover. First I'll do just a quick level set to make sure we're all on the same page with respect to key SRE principles. This is where we started our origin story from an ETLS perspective: SRE applied to software operations. From here I'm going to talk about how we can apply SRE at work, SRE for life, SRE for our personal wellbeing, before we finish with a few key takeaways.

Starting with the foundational SRE principles, I actually like to use the software development lifecycle as a construct for explaining how you can indeed SRE anything. The TL;DR is that we need to consider both the what and the how. In the software development context, your what is your shiny product features, and the how is deploying those services to production in a reliable way that meets the needs of our users. The how in this case is essentially the SRE aspect of the software development lifecycle.

Now let's consider how we can apply these concepts to other things, like basically anything you can imagine. In order to do this, we really need to define three key things. First, what is the domain that you're working in? Then you need to think about the what, or the thing that you're trying to do or build. Finally, you need to think about the how, or basically the means to achieve the thing that you're trying to actually do.

Just to run through a couple of examples to see how this rubric works: for many years I led learning and development programs for Google's SRE and Google Cloud engineering teams. My team and I really prided ourselves on applying SRE principles to the training program itself. In this case, the domain is training programs. The thing is your shiny training content, and the means to get there is deploying a consistent and reliable training program that meets the needs of our students.

Here's another example. Let's talk about customer service, which I think is important to many of us. In this case, the domain is customer service. The thing, the what you're trying to deliver, is the customer experience. The means to get there is solving, or better yet preventing, issues in a timely manner to keep our customers satisfied.

You'll notice that in each of these examples, the focus is on satisfaction and not perfection. We're meeting the needs of our students; we're keeping our customers satisfied.

The key realization that Ben Treynor Sloss, the founder of SRE at Google, had was that 100% is the wrong reliability target for basically everything. Will your users even know if you're not 100% reliable? Do you have to hit 100% reliability to achieve customer happiness? No, you do not. If you start at 100% reliability and you demand perfection, you need to treat any failure, no matter how small it is, as an emergency. But in truth, we know that the only absolutely reliable system is one that does nothing at all, one that can never change or have anything change around it. It's existing in a vacuum, hermetically sealed from the world, and really something that's impossible to achieve in perpetuity. It demands heroics to achieve over any appreciable timeframe, which isn't sustainable. Perfection stifles innovation. If you're so worried that something's going to go wrong and feel the need to maintain perfection, you don't have time or the willingness to take the risks that lead to those step changes in improvements and novel approaches.

Another key concept that I'm sure many of you are aware of are SLOs. If 100% is not the right target, what actually is? The SLO is basically your goal for reliability, and it should be targeted to meet the needs of your users, tracking that customer happiness. The concept of an SLO can also apply to the SRE-anything concept. It's the goal for how well the thing should actually operate, and again, should track the experience of your users.

SRE is really about achieving balance between competing concerns. It focuses on establishing what an acceptable level of failure is and balancing that need against the other aspects of your business, and lets you move fast when things are good and respond appropriately to failure when it happens. If we want to take this out of the software development context and apply it to the concept of SRE anything, you want to think about these appropriate trade-offs between reliability, time, velocity, and cost.

Here's another page from the SRE playbook. Is more effort always better? No, it's not. Just like when you're SREing software or a software service, when we SRE anything we want to strive for this balance between competing forces. Balance the effort versus the results that you're trying to achieve. Do just enough to meet the needs of your customers, keep them happy, but not too happy. Consider the trade-offs and avoid polishing a diamond. If you do anything above and beyond that happiness threshold, you're wasting time and effort that could be better applied to other things.

It can be so tempting to polish a diamond. I always like to think back to the SRE EDU team, where I started out in my early days in Google SRE. Our SRE EDU orientation or onboarding program was really our flagship offering. This was getting the new SREs up to speed, and our team really loved working on this program. They loved chipping away at it and delivering incremental improvements. However, our survey results showed that our students were actually quite happy with the program and what we were offering. By continuing to invest and spending all this time on it, we were robbing ourselves of the time and energy to really innovate and spin up additional programs with greater impact, for example, our annual week of education and ongoing education in general. Again, you want to think about making appropriate trade-offs.

A couple other quick principles and patterns are important to an SRE practice. Users should never notice an outage before you do. You want to engineer solutions to eliminate classes of errors rather than being satisfied with point fixes. Don't feed those machines with human toil. And of course, failure is an opportunity to improve, not to brandish pitchforks.

We can adapt these principles to any topic you can imagine. It's a simple reframe. Users should never notice a problem before you do. For the second bullet, the solutions don't have to be engineering-based; they just need to be broad-based rather than point fixes. Substitute: develop solutions to eliminate classes of errors rather than being satisfied with point fixes. In the third point, we're talking about SRE in terms of the software and the machines that it runs on. To make this more general, we can swap in the domain for the machine: don't build the thing that you're trying to build with human toil. Failure is an opportunity to improve is timeless and applies no matter what domain you're tackling.

Continuing on this theme of failure is an opportunity to improve, let's talk about why blamelessness is so important. Why blamelessness matters when you're trying to SRE not just software, but anything you can imagine.

A big part of SRE is thinking about what happens if something breaks. As we discussed earlier, since 100% is the wrong target for basically anything, it's not if something breaks; it's when something breaks. Murphy's law tells us that no matter how hard we try to prevent it, things are going to break. When something breaks, how do people on the team respond? How does that organization respond to failure?

Is failure viewed like this? "I'm extremely angry right now. People should lose their jobs if this was an error." This was a statement made by Hawaii State Representative Matt LoPresti in reference to the nuclear alert false alarm back in 2018. Fun fact, I just learned that there is now a new off-Broadway show about this particular incident playing. I'm going to be in New York next week, so I'm excited to check that out. But I digress. For those who don't remember, human error caused this statewide emergency alert to go out about an inbound nuclear missile attack. Pretty serious there. But the reaction here is clearly very blameful.

You may be wondering: what is the harm in blame? Why shouldn't heads roll if a mistake was made? The reality is that if there's an expectation you're going to be blamed if something goes wrong, you're incentivized to cover it up, to sweep it under the rug, so to speak.

Again, failure happens. There's no way around it. Murphy's law tells us so. If you acknowledge that something has gone wrong or a mistake has been made, no matter what domain you're operating in, you and the team can start working to fix it faster, which leads to improved time to detect and time to resolve issues. Also, if people feel like they can come forward when they make a mistake or if they notice a problem, you open this opportunity to proactively address a point of failure and ultimately deliver more robust systems, more robust things. If you keep hiding the problems, you're going to end up with a brittle system in the general case, or a brittle, unreliable thing.

Simply stated, failure is an opportunity to improve, not to brandish pitchforks. Our goal is always to learn from failure. You've already paid the price when something goes wrong, so write a blameless postmortem and share it widely so others can learn too. This is a foundational concept of SRE, and it can apply well beyond the software engineering context.

Now let's get into a couple of examples of how to SRE anything at work.

This is the service reliability hierarchy that was included in the original SRE book that we published back in 2016. The hierarchy covers the elements that go into making a service reliable, from the most foundational to the most advanced. It turns out that the service reliability hierarchy is actually a great framework for SREing anything you can imagine.

We can start with the training program context, which of course is near and dear to my heart. The big aha moment for me was realizing that those elements of the service reliability hierarchy could be adapted to a training context. The things we were teaching our new SREs could be used to actually make the program itself better.

If we start at the bottom of the pyramid, we do monitoring in the form of attendance tracking and survey feedback. We address issues that surface via that monitoring. We occasionally write postmortems when things go wrong so that we can learn from failure. We do a lot of testing of new content and programs. Canarying is super important: piloting things and progressively rolling out versus launching with one big bang. All the while, we're scaling our operations by looking for opportunities to vanquish toil through automation so that we can make the most of our limited human resources. It's only when we do these things that our program can be fully actualized and we can realize the full potential of the curriculum design and the program itself.

Take a quick example of how we use monitoring to drive improvements to our Google SRE onboarding program. Starting at the base of the pyramid, looking at the monitoring, what did our monitoring tell us six to 12 months after we launched our original program? Some key themes were emerging. SREs really wanted a more hands-on education program. Here we're spot-checking the open-text survey feedback. The beauty now with the GenAI tools that we have available is that we can derive insights over wide data sets much more easily, things that we could never do before. But again, digressing. The key theme is emerging: people really want a hands-on education program. They don't want a lecture series. There was this strong request for learning by doing.

What did we do? We iterated. We came up with a new program design that addressed these concerns. We moved away from passive listening. We built a real system that was designed for students to troubleshoot in a safe way. We gave our facilitators instructions to back off more and more as the course progressed, and we had people working in groups of three, with the least experience in the middle driving the group exercises. The goal was not to spray people with a fire hose of information, but rather to instill confidence, because confidence is what really drives behavior, and behavior repeated over time is what drives your culture.

What does our monitoring tell us now? TL;DR: people are much happier with the hands-on exercises. This is a good example of how we put the principles into practice to drive improvements in a non-traditional space. In addition, version two is equipped with better observability. We've got concrete behaviors being demonstrated that we can observe. Students are learning how to use a system diagram, diagnose issues using key SRE tools, annotate outages, mitigate realistic production issues, and find root causes and propose solutions. There you have it: a key example of how we applied the SRE principles to those training programs.

Now let's consider how we can apply the service reliability hierarchy to other domains. Basically, we can ask different questions at each level of the pyramid. For monitoring: what can you observe in this particular domain? For incident response: what are you going to do when things go wrong? For postmortems and learning from failure: how are you going to learn from the situation when things go wrong? For testing and release: how can you pilot new things? For capacity planning: how well does the thing, or the new features you're adding to the thing, scale? For development: how will you ultimately achieve the objectives that you set out to with the thing? In terms of the product: how do you turn that thing into a well-oiled machine, one that's reliable, sustainable, and meets the needs of its users?

I like to think about the top two items highlighted in green as our aspiration. It's envisioning what you're trying to achieve and how you're going to know if the thing and the domain are fully actualized. Underpinning this, you've got your tactics. These are the tactics you're going to use to achieve the aspirations that you set out to.

Let's look at another domain that's important to all of us: customer service. Before I joined the SRE team at Google, I worked in the AdWords global customer services team, so I know a little bit about this. If we start with the aspirations at the top of the pyramid, what are some measures of customer experience that we're striving for? Maybe it's customer satisfaction above 90%, call volume trending down, net promoter score rising, positive social media sentiment, for example. What does our well-oiled machine look like? Maybe we've got AI delivering our frontline support. Maybe we've got evidence that our customer experience reps now have time to proactively identify opportunities for customers to grow and add that value, like Miles was talking about, value to the bottom line.

Now we can enumerate our tactics. What can we observe? Email, call, chat volume, staffing levels, CSAT, NPS. What do we do when things go wrong? Let's brainstorm. Maybe we page additional customer experience reps if support volume spikes. How are we going to learn from this? Maybe we're holding per-market weekly ops reviews. Maybe we're bubbling up insights in a monthly all-team meeting. We're feeding insights from our tickets and customer calls to the product team. How can we pilot? Maybe we have a pilot template and a process for sign-off; keep it simple and keep the teams investing here. How will we scale? Maybe we're going to test automation and AI enhancements. This is a nice framework to get you thinking about going from zero to this well-oiled machine.

Those are just a couple of examples of how to use these foundational SRE principles in a work context. I've included a blank worksheet here that you can use to design and SRE your own work-related things.

Even better, use AI. Surprise, I told you I snuck a little AI into the talk. Oftentimes the domain we're trying to tackle is brand new to us. It's not our area of expertise; it's not in our wheelhouse. I have found that AI tools can be a great thought partner when you're starting something new. For example, I went ahead and codified the principles that I just talked about into a rubric. I asked Google's NotebookLM to do this using my deck and my speaker notes as the source. I fed that rubric and the operating instructions into a Gemini Gem, and then brainstorm on any new project. In this particular case, it can be anything. I brainstormed: how would I start a clown college? I'm a fan of The Simpsons, in case anybody got that reference or not. It turns out Gemini Gems are meant more for individual use and can't typically be shared outside of the workspace organization that you're a part of, at least not yet. However, I am able to share a copy of my clown college conversation and the info that I used to build the Gem if you want to replicate it and try out this AI-assisted workflow with your own use case. You can find that on my website, reliablepgm.com/howtosreanything.

There it is. There's our AI interlude.

We've talked a lot about how to SRE anything at work. Now I want to talk about SRE in life situations.

Ever since I joined the SRE team at Google in 2014, I've seen opportunities to apply SRE principles and best practices to a huge array of situations. I actually wrote about how I applied SRE principles to deal with a family medical emergency, when my mom got sick against the backdrop of the pandemic. I've also done one on SREing a travel emergency. Once you've seen how well these principles apply, you can't unsee it.

We can take a look at how the framework applies to this situation. Let's take the case of developing a family emergency plan. That's the domain. The domain is a family emergency plan. The thing we're trying to achieve is reliable support available to our loved ones when needed, and information flowing appropriately in a timely manner. What does the well-oiled machine look like? You've got family members and friends seamlessly working together to take care of that loved one when, for example, you can't be there yourself.

Now that we've got the aspiration detailed out, let's look at our tactics. What can we observe? For loved ones who live alone, a lifeline is a great form of monitoring. It's a little device that they can wear around their neck and press to call for emergency support if something is going on. It'll also trigger if a loved one falls down. Just like with a production incident, when you lower the time to detect a problem, you're more likely to achieve a better, faster, and more positive resolution. Another good thing in terms of observability: make friends with their friends so that you'll have people with line of sight when you can't be there yourself.

What do you do when things go wrong? Be clear who's on call and who can support you, avoiding single points of failure, and make sure you've got emergency contacts specified locally. Create a detailed playbook. Ideally, this will be in the form of a healthcare proxy written by your loved one, with a detailed description of what they would want or not want doctors to do if folks can't speak for themselves. DR-test those family emergency plans. Just like with a software system, disaster recovery testing is critical to stress-test the plan before you need it. Do a Wheel of Misfortune exercise. You don't want to find out that you've missed a key element or a point of access in a crisis. Think about what could go wrong, preferably with your loved one if they're willing to talk it over with you, and discuss what you would do in a particular situation.

What do you do when things go wrong? You do a retrospective based on the DR test and take action to ensure your documentation is up to date. Think about what you can pilot: get and test that access. How will you scale? Understand who can support you. Knowing that you're not alone is a key to the whole endeavor.

That was the example of SRE for life. You can think of other examples where this might apply, but I want to talk through one more example, and this would be SRE your own wellbeing. That last example of SRE applied to life situations was more emergency-focused or potential emergencies. But I do think that SRE principles and best practices are very well suited to proactively managing your own wellbeing.

It turns out two SRE principles in particular, capacity planning and load shedding, are actually fantastic life hacks. First, capacity planning. Software and services running in the cloud require machine resources or quota to run effectively. I like to think of my mind as my personal quota of machine resources. Typically, SREs would build a capacity plan and provision that service to run optimally. If a service ends up running hot and exceeding that allocated capacity, at best that service is going to degrade. At worst, you're going to get a cascading failure and a total outage. To avoid those global outages, SREs spread that capacity around, maybe in three different cloud regions with some slack capacity built in for unexpected spikes in demand.

To protect my wellbeing, I use a similar capacity planning construct. I limit the number of hours I spend on things that I enjoy but are not necessarily rewarded: things like mentoring, coffee chats, public speaking. I would do this all day if I could. I set aside a certain amount of quota, and when that capacity is reached, I have to start saying no.

Load shedding is another great construct that can be applied to wellbeing. I set that quota for the various aspects of my life: work commitments, family fun, and as many categories as you think you need. Sometimes things unexpectedly come up and demand exceeds the quota in one aspect of my life. If I don't do something about it, I would expect a degradation in my wellbeing or, in the worst case, burnout, that complete outage scenario. In this image, this would be like a cascading failure where all the dominoes start to fall, each one catalyzed by the last.

To avoid this outcome, when demand for my time and attention spikes in one area of my life, I use the opportunity to load shed and borrow quota from the other aspects. For example, with the family emergency issue, when my mom got sick, I had to make room to manage that situation. I did this by load shedding anything non-essential at work. It also works the other way. Sometimes there's spikes in my work commitments. It could be executive fire drills, time-sensitive program launches, whatever it is. In this case, I load shed on the personal side. I asked my husband not to over-program us. I affectionately call him our CFO, our Chief Fun Officer.

Thinking about capacity planning and load shedding in this way has effectively helped me to SRE my own wellbeing, and I hope you'll find thinking about it this way helpful to you as well.

All right, let's bring it on home with the two minutes I have left. Some key takeaways on how to SRE anything. Don't forget the what and the how are equally important. The domain, the thing, and the means to get there: that's your winning combination that really unlocks this ability to SRE anything. The service reliability hierarchy is a useful rubric for applying SRE principles to any situation, whether it's in work, in life, or for your wellbeing.

Where I need help: share your stories of how you've applied SRE principles in non-traditional ways. I'd love to hear more examples to test how general this rubric actually is. Try out my AI-assisted process. I'm curious if it works for more than just me.

I also wanted to highlight that we are working on a second edition of the Site Reliability Engineering book, with the idea that we're going to have it ready to go in time for publication with the 10th anniversary of the original next year. Very excited that Tim O'Reilly's here, because I want to chat with him. We published through O'Reilly. What topics do you think are important to include? Spoiler alert: AI will feature heavily, so don't be surprised there. I'd love to hear what you think has aged well, what you think hasn't aged well, and where we should take it.

Thank you. I'm out of time, but let's connect. I've also included a few links if you want to learn about any more of the concepts covered here. Thank you. Thanks for making the time.