Swim Don’t Sink: Why Training Matters to a SRE / DevOps Practice

Log in to watch

Amsterdam 2023

Swim Don’t Sink: Why Training Matters to a SRE / DevOps Practice

Director, Google Cloud Platform and Technical Infrastructure Education · Google

Do you offer training to the engineers in your organization or do you throw them off the deep end to “sink or swim”? Providing training and education is universally important to set team members up for success in your organization and is critical for establishing a thriving Site Reliability Engineering (SRE) or DevOps practice and culture in the first place.

The specific training needs of each engineer varies depending on several factors including:

-The maturity of your organization in adopting DevOps / SRE principles, practices, and culture

-The knowledge those individuals have about your organization and infrastructure

-The experience of the individuals being trained, both in terms of technical skill and familiarity with the SRE / DevOps model

This talk will explore the business case for training, the trade-offs between cost and effectiveness, and best practices for training design and deployment depending on where your organization lies on the spectrum of size and maturity.

Learn why training is not about unleashing a fire hose of information upon unsuspecting engineers but about giving those engineers the confidence to run production systems at scale.

Chapters

Full transcript

The complete talk, organized by section.

Host Intro (Gene Kim)

Gene Kim: One of the prominent themes in recent DevOps Enterprise Summits has been site reliability engineering, which I've always been dazzled by. There are so many interesting things about the principles and practices that Google pioneered back in 2003, and I think it's one of the most incredible examples of how one creates a self-balancing system, where helping product teams get features to market quickly can happen in a way that doesn't jeopardize the reliability and correctness of the services they create.

For over a decade, I've wanted to better understand why Google chose a functional orientation for the site reliability engineers. To this day, thousands of Google SREs still report in one organization to Ben Treynor Sloss, VP of 24/7 engineering, which includes SRE, very purposely built outside of the product organizations.

I've wondered for over a decade why and how they do this. I'm so grateful that I met Dr. Jennifer Petoff, currently global director of Google Cloud Platform and Technical Infrastructure Education, who helped me understand how SREs work across product teams across Google. Which is, I suppose, not surprising because she literally wrote the book, the most widely cited book on SRE practices.

Here to teach us about how Google trains SREs, and things to think about when training members of your own organization, is Jennifer.

Jennifer Petoff

Jennifer Petoff: Hello, everyone. How's everyone doing this morning? All right, good energy. Today, I do think we saw a lot of pizza in our talks yesterday. I think this talk has less pizza, possibly more unicorns. Today what I'd like to talk to you about is why you need to help your people swim and not sink, and the importance of learning to an SRE or DevOps culture.

I think Gene covered the highlights of my professional background. I thought I'd throw in another couple of tidbits just to maybe spark a conversation over coffee or lunch. Back in the day, I got a PhD in chemistry, so my friends call me Dr. J. In my spare time, I love to travel, and I'm a part-time travel blogger at Sidewalk Safari. I'm super excited to be here and have an excuse to come to Amsterdam and explore all that the Netherlands has to offer.

Before we jump into the main topic of the presentation, which is about learning programs and training SREs, I wanted to give you the TL;DR on site reliability engineering, just so we're all on the same page. What is SRE? It's a discipline that originated at Google back in 2003. It's a framework for operating large-scale systems reliably. Ben Treynor is famous for saying that SRE is what happens when you ask an engineer to design an operations function, or ask a software engineer how to manage systems in production. Really, it is this focus on running systems in production that is the key to the practice.

There are four key principles that we like to think about as foundations to SRE. Number one, site reliability engineering needs service level objectives with consequences. SREs must have time to make tomorrow better than today. SRE teams do need to have mechanisms to regulate their workload so that they're not buried under toil. Finally, failure is an opportunity to improve, not an opportunity to brandish pitchforks, as we like to say.

How does SRE fit into the product lifecycle? It's this bridge between the business, development, and operations. It provides a common language. It provides shared goals. It helps to take any negative emotion out of decision making.

With that out of the way, let's talk about learning and why learning is important. I hope you'll agree that when it comes to becoming an SRE, or practicing DevOps in the first place, or even just ramping up on a new service that you're in charge of, there's a lot to learn. There are the ins and outs of those production systems, incident management, best practices, and more. The list goes on.

Some people think about training as it being all about cramming as much information into people's heads as possible. It's setting up this proverbial fire hose of information. But the reality is, no matter how much you tell people, they're only going to retain a very small amount of what you tell them, especially in a lecture format.

Training is not about the fire hose. It is about believing in yourself. For adult learners, it's all about building confidence. It's about fighting impostor syndrome, especially for people who are new to a team.

Going beyond confidence, training is also about driving and perpetuating an organizational culture. Just like these hot air balloons, training can really give your culture a lift.

In light of all this, how should you go about thinking about and actually training your SREs or your DevOps practitioners? There's actually a continuum of training options that you could consider. There's sink or swim at one end of the spectrum, so basically do nothing, and then there's the systematic training program at the other end of the spectrum.

You really want to avoid sink or swim if you value inclusivity. Just letting people figure it out on their own can breed stress, frustration, and can even lead to attrition. It can lead to impostor syndrome. Definitely, that's one to avoid at all costs.

For other options, you're going to want to consider the return on investment of that effort invested. If you focus in on a higher-touch option, like a systematic training program or even ad hoc classes, this can signal a few things. Number one, it's leadership commitment to the development of employees. It can help ensure that everyone's actually speaking with one voice. It can help imbibe that desired organizational culture and reinforce the desired behaviors. At the end of the day, confidence drives behavior. Behavior repeated over time is what drives the culture.

The first question that you may ask is, what should I teach? What's this all about? What you include in your training program really boils down to three different dimensions. We'll look at maturity, familiarity, and experience. Maturity is all about the maturity of your organization in adopting SRE or DevOps principles, practices, and the culture associated with that. Familiarity addresses the knowledge that individual engineers have about your organization and infrastructure: how familiar are they with you and your systems? Experience is about the experience of the individuals being trained, including their technical skill and their familiarity with SRE or DevOps.

Let's start by considering the case of those just starting out on that DevOps or SRE journey. How can we ensure that we minimize the pain on that potentially long and bumpy road ahead? This is a road leading to Torres del Paine, a national park in South America, so again, riffing on the travel theme.

This is really the case of low organizational maturity. You're just getting started in adopting SRE principles and practices. What I would really recommend is to start by addressing any skills gaps. Ask yourself some foundational questions about your teams, and you can see some of these questions here. Really, foundational-level stuff. If your answer to the majority of these questions is no, start here and address those gaps with training.

Now on to step two. Once you've gotten everyone to a common baseline, you'll want to consider who you have on your team and perhaps how to tailor that message. The reaction of any individual team member to a proposed organizational transformation will likely depend on how familiar they are with your organization and the current culture, crossed with their level of SRE experience. Have they practiced SRE elsewhere?

If you look at people who are low experience or no experience and are new to your organization, I expect these people will be more receptive. They're more likely to just go with the flow. If you have people who have low experience with SRE or DevOps and have high familiarity with your organization, these are the people that you potentially need to watch out for. They might be resistant. The idea here is you'll need to address, what's in it for me? Why is it different this time, if this latest transformation is seen as the new hotness that will quickly fade over time?

Finally, for people who have existing SRE or DevOps experience, these are your catalysts. These are the people that you want to empower to tell their stories about the impact that this has had on the work that they do, saving their weekends, et cetera. Tell those stories, much like we heard about yesterday. Again, you'll want to tailor the format and content of your proposed training to address the unique needs of these individuals.

What about an organization that already has a well-established SRE team or practice? Now it's all about building confidence and reinforcing that culture that you've built. Let's talk about building a training program for this use case.

This is the high organizational maturity use case. You've already got this well-established culture and practice, and now we need to dig in and evaluate the mix of people that you have on your team here as well. Once again, we'll do this along the dimensions of organizational familiarity and SRE experience.

Let's talk newbies. Newbies are the folks who are just joining your organization and they're also less familiar with SRE and DevOps practices. You've got your internal transfers. These people are expected to have a high level of familiarity with your company, but maybe low experience with an SRE or DevOps practice. Old-timers are the folks that have worked for your company for a while and know the company and that great culture that you've built. Finally, industry veterans are people who have a high level of experience with SRE and DevOps practices, but are new to your company.

Let's look at those learning needs in a little more detail. You've assessed the mix. You see who you've got. Let's figure out where to focus your training plans. Newbies typically need the most attention. They need to learn your infrastructure, and they also need to learn that culture and imbibe that culture. Internal transfers may know your systems, but they're perhaps less familiar with DevOps and SRE practices at your organization. Old-timers: we expect their experience on both dimensions, so you're going to want to go for technical depth to unlock their career growth. Finally, industry veterans need to ramp up on your infrastructure and your processes, and they may need to unlearn some bad habits. Focus on the specifics in this particular case.

Up until this point, we've really focused on the what: what you might want to include in the training program. However, there's a whole other side that we need to consider. I actually love drawing an analogy between the software development lifecycle and running training programs. In software development, the what is your shiny product features, and the how is all about how to deploy to production in reliable ways to meet the needs of our users. In the training program context, the what is all about the content. What are you teaching? How's that all going to work? The how is, how do you deploy a consistent and reliable training program that meets the needs of our learners?

Now let's talk about the how of delivering a training program. This is really about operations, and there are two dimensions you'll want to consider in this case. There's the size: how big is your organization? And the growth rate: how fast are you growing? This will help inform how much effort you might want to invest in training.

If your organization is small and growing slowly, focus on one-on-one knowledge transfer through mentoring or shadowing. If your organization is large and growing slowly, investing in ongoing education is a great retention tool and a career development tool. You want to invest in your people and keep your people. If your organization is small but growing rapidly, invest in onboarding. That's the easy win. If you're both large and growing rapidly, it makes sense to invest in a full-lifecycle training program, from onboarding through ongoing education and beyond.

The Google SRE team actually falls into this upper-right quadrant, and this is why we've invested in a full-time team to develop this full-lifecycle training program. A small full-time team is supported by basically an army of volunteers who have a lot of passion for sharing what they know with their fellow Googlers, fellow Google SREs.

To summarize, you want to invest the most under conditions of rapid growth. Small but rapidly growing organizations benefit the most from onboarding training. Don't forget about your existing workforce, though. Invest in that ongoing education for their career development.

We've talked about the what, and we've talked about the how. Now let's talk about the where, as in where to begin. I have two words for you: ASBATs, or one acronym, basically. A student should be able to. This is how we think about learning objectives for a training program. As you're thinking about putting together a training program, you want to define these learning objectives. A student should be able to...

Your first thought might be, a student should be able to understand the thing. But the reality is that this is written in a really passive and hard-to-observe way. Instead, focus on the behaviors that you want to drive. Think about using a tool, interpreting a graph, moving traffic away, draining. You want to be able to observe and measure how the training is actually applied, as opposed to someone saying, yep, I get it.

There's a simple model that instructional designers often use to aid in the development of training content. It's called the ADDIE model, and it stands for Analyze, Design, Develop, Implement, and Evaluate. It's very helpful for setting up a tight and iterative feedback loop to drive continuous improvement.

Let's get metaphorical a moment. This talk is about training site reliability engineers, and it turns out that the foundational principles of SRE actually apply really well to the training program itself. I like applying SRE principles in different contexts. This is the service reliability hierarchy that we published in the SRE book, which talks about the elements that go into making a service reliable, from the most foundational to the most advanced.

This shows how those principles can be applied in the training program context. We do monitoring in the form of attendance tracking and survey feedback. We address issues that surface via that monitoring. We occasionally write postmortems when things go wrong, so we can learn from that failure. We do a lot of testing of new content and programs. All the while, we're scaling our operations, looking for opportunities to vanquish toil through automation so that we can make the most of our limited human resources. It's only when we do all these things that our program can be fully actualized and we get the benefits of the curriculum design and the program itself.

Here's another page from our SRE playbook. Is more effort always better? No. Just like SRE, we strive for a balance between competing forces. You want to balance your effort versus the results that you get from that effort. Do just enough to meet the needs of your students, keep them happy, but not too happy. Think about your SLO and that point where happy becomes sad. Consider the trade-offs. If you've got a good program, avoid polishing a diamond. Don't polish your onboarding diamond; invest that effort into ongoing education, for example.

This is just a quick example of how we used our monitoring to drive improvements to the Google SRE onboarding program. We looked at the base of the pyramid. What did our monitoring tell us? We could see a key theme emerging from the student comments in our surveys. Clearly, they wanted more hands-on education, less of a lecture series. Our first incarnation of SRE orientation was very much a sit-and-get, like let's aim the fire hose at you. We heard loud and clear that this was not what our audience wanted.

We developed a second generation of our orientation program, which moved away from long-form lectures. It set it up so that we could have students troubleshoot a real system, but a safe system that they could break with confidence. Facilitators back off more and more over the course of the week. The whole goal here is to instill confidence, not fill people's heads with facts about production.

What does our monitoring show us now? Our students are clearly happier with those hands-on exercises. I love this one example in red: it was the funnest week I've had this year overall. It made me feel more connected to production and the technology, which made me really happy. This was actually one of our VPs of SRE, Ben Lutch, who came through orientation just to reconnect with the organization. The fact that he said this was very, very gratifying.

Our new version of orientation is better instrumented for observability, with concrete behaviors being demonstrated throughout the course. There you have it. It's a key example of how we applied SRE principles to our training program at Google.

Quick takeaways: training is an investment, an investment in your organization and in your people. Evaluate the costs and the benefits. You want to make sure you're making the right level of investment based on your organizational circumstances and the maturity of your organizational transformation. Where to invest is going to depend on the what and how of those organizational circumstances.

Finally, walk the talk. Apply SRE and DevOps principles to the training program itself for this consistent and reliable experience.

I'll leave you with a few resources that are available for free on the sre.google website, including Training Site Reliability Engineers, which goes into more depth on the topics that I've covered today.

Finally, help I'm looking for. I'd appreciate any help in terms of elevating training and recognizing the impact that it has on an organizational culture. Throughout my career, I've found that there can be unconscious bias in this space: oh, I created a training program that one time; it's fluffy, it's easy, it's not impactful. But if training is something that people don't perceive as important, people aren't going to spend time on it. You need those engineering subject matter experts to participate in order for the program to be successful.

I actually find I spend quite a bit of time getting creative on how you measure the unmeasurable, how you actually back up what you're doing with data. If you have examples of trainings that have worked well at your organization, ways that you've measured the impact that that's had on your business, I'd love to hear how others have done it so that we can again lift all our boats, so to speak, in terms of how we talk about that.

With that, you can find me on Twitter. You can find me on LinkedIn. I'll be around through lunch. Looking forward to chatting with you, and thank you very much, Gene.