Overcoming Challenges for a Successful SRE Organizational Setup

Log in to watch

Europe Virtual 2024

Overcoming Challenges for a Successful SRE Organizational Setup

SRE Engagements Product Area Lead · Google

SRE is not only a role, it typically is organized as a center of excellence that is matrixed into the product organizations. There are good reasons for this approach, but it creates challenges. This talk discusses strategies to overcome these challenges and why the matrix approach is key to the success of SRE despite them.

Chapters

Full transcript

The complete talk, organized by section.

Host Intro (Gene Kim)

Up next is Dr. Christof Leng. As I mentioned yesterday, this is a community that has adopted SRE principles and patterns at a large scale. I love that statistic from the 2019 Stack Overflow Developer Survey that shows that DevOps specialists and SREs are among the highest paid, most experienced, and most satisfied with their job.

For me, the statistics show to what extent SREs are actually a very different functional discipline than your typical developer. One of my main learnings is that the more functional specialties you have, the more sophisticated your organizational wiring must be. This is what we call layer three in the Wiring the Winning Organization book. The deeper the functional specialty, the more we need matrix structures, where the functional owner can define the who and the how, and the product or value stream owner can own the what, why, and when.

Functional owners own the who and the how. Product owners own the what, when, and why. There we go.

Here to teach us about how SRE uses matrixed organizations at Google scale is Dr. Christof Leng, SRE Engagements Product Area Lead at Google. This is his fourth time speaking at this conference, and I learn something every time he talks. Christof, over to you. I can see you, I can see your slides, and maybe we can hear you.

Christof Leng: Hi, team. Thank you. Can you?

Gene Kim: Very good, thank you. Over to you.

Christof Leng

Okay. Thank you so much. Happy to be back.

Over the years I spoke of many aspects of site reliability engineering, and what SREs do, how they do it, and so on. But one topic that comes up quite often when I talk to other organizations, when I talk to Google's Cloud customers, or to organization groups within Google, is: how do you actually set up SRE for success organizationally?

So I wanted to talk a little bit about that, what kind of challenges there are, and how you overcome them.

One thing that we all know is that developing new software is just a small part of the overall development lifecycle. Running the software in production at scale, reliably, while you keep adding, changing, or removing functionality all the time, is a big part of it that deserves attention and requires a lot of effort.

The traditional breakdown between dev and ops, these silos are not helpful in that context. DevOps teaches us how to have that broken down, that everybody should have some production knowledge and should take some part of it because it is part of the job, much like software testing is or requirements engineering. You can't completely outsource that into some kind of black box.

That is definitely a great idea that I think universally applies at every organization, no matter what you do, no matter how big you are, no matter how you do it. Often it might be all that you need. Your production complexity is limited. You have a small footprint. It's not very critical. It's relatively simple.

But what happens when that is no longer the case, when your complexity is growing and is becoming scary?

The problem here is, and we talked about cognitive load, we heard about it here before, is that the cognitive load can be overwhelming for the software engineer if they need to understand not only the business domain that they're working on, but the many horizontal topics that are out there, like testing, security, UX, you name it, the list goes on, and then also production knowledge. They can't be a master in all of these topics. So you still often need specialists that are actually supporting the more generalist developers in these areas.

Now the key topic of my talk is: if you need these specialists, how do you organize them in your bigger organization?

The naive, obvious solution is: every team needs some, or most teams need some kind of production knowledge. So in each team you develop or hire one or two full-time or part-time production experts. I don't personally think that this is a good idea. Your mileage may vary, but I think there are a number of big risks attached to that.

The obvious one is that now you have some people that do the dirty work. All of the ops work gets thrown at these people, and the rest of the team starts to disengage from that space. They then don't really have time to actually engineer production. They're just busy with the ops aspect of it.

Also, they could be pulled into feature development work because this thing has to launch really, really quickly and it's an all-hands-on-deck situation. Then again, there is not enough time for long-term tech debt reduction and improvements in the infrastructure.

They also don't have a community, so they can't really learn and grow and exchange ideas and evolve their discipline. Also, it's kind of a career dead end. What kind of bigger space of responsibility can you take on, either as a people manager or as a tech lead, if there's just one or two people in each team and they're all separate? People who are really hungry for doing bigger things, who really want to advance your organization and deliver, might not be interested in that role. So you don't really get good talent for that.

What are the alternatives? You could have a community of experts, where you put all of these experts, I will call them site reliability engineers because that's what most people do, there are a few other names in the industry, but it doesn't really matter, into a center of excellence. It could be one center globally in the company. This is what Google does. But there are alternatives that might work as well or better for you. It could be individual teams in your development organization, or it could be groups of teams in a bigger business unit, but multiple of those groups of multiple teams in your entire enterprise. You can read more on this in the resource that I linked here, if you're interested in what applies when.

Now when you have that community of experts, is it all rainbows? All problems solved? I can end my talk here? You can go quicker to lunch? Fine.

No, actually, I'm sorry. It's not that easy. There are challenges with that model, and the most obvious is: you now have two teams again that need to collaborate. You have the silos again. So how do you make these silos not hurt and not lose the advantage that you've already gained through a DevOps mentality?

Problem here is that often when you have these silos, the SRE team might not have the business alignment. At least, it might not be as close to the business as the development team. Then they might not work on the things that really matter for the business. They might lose touch with what needs to be delivered, and they might actually prioritize things that are getting in the way of what the business needs. That is really bad. That can lead to conflicts, that can lead to problems for the business. That's definitely not what we want. What we want, we want to help, not get in the way.

What are the things we can do? You can obviously make the SREs use the product, understand what the product needs, what the users do. If that's a product that's not easy to use because it's for expert users, you can expose your SREs to the users much as you should do with your developers.

You can also expose them much more to the development team, having joint planning and joint projects where SREs and developers work hand in hand together towards a common goal, and also spend time together, either because you put their desks in the same space, or you visit quite often, or you at least have regular video conference meetings, or a combination of those.

You can also focus on the end-to-end user journeys, so you do not just optimize this little aspect of the system, but you actually look at the bigger picture and look at it through the lenses, through the eyes of the user. That is how you should improve your internal system structure.

SLOs, especially user-centric SLOs, are a good tool for that. You can also align your OKRs with the user happiness and the business success, instead of some arbitrary internal targets.

The second antipattern is that your SRE team might get spread too thin. That happens when you peanut-butter your engagements across many different things, especially if these things are not coherent. There's this one system here, and this one product over there, and this other thing, and they're roughly in the same corner, but they don't really have enough touch points. They don't form a bigger picture. They're just a collection of things.

Then SREs cannot get a deeper understanding, a bigger picture, where they can add a perspective that might not even be present for most of the developers. Instead, they split their attention. It increases the cognitive load again and causes all kinds of problems.

But even when you have defined a coherent space and say, this is the responsibility for this SRE team, it could try to do too many things in parallel in that space. So limit the number of your concurrent engagements. Really prioritize where can this SRE team have the biggest impact, and put more wood behind fewer arrows, as we say.

Only do this where you actually have an impact that outweighs what the developers could do. Because if the developers could just do it, let the developers do that, and actually don't hire as many SRE, because a generalist developer is a much more fungible resource than a specialist, so you can more flexibly use them. That is what the business really needs. I think there are always enough opportunities for SRE, so I think that actually isn't a risk for our job security, but it actually underlines our value.

The organizational overhead can also be accidental, where you still have the right set of engagements, but you are in every design doc, in every meeting, and everything. So you're not getting any time to actually work. Focus on being in the loop on the things that actually matter, and do this during your office hours, especially in a global organization. Having these late 8:00 p.m., 9:00 p.m. meetings every week is not going to help your productivity.

Number three: SRE is not an ops team. SRE does operations, but as a means to an end, to actually understand the system better, to engineer the system better. But often that is perceived as, oh yeah, SREs are great on-callers. They are super good at firefighting and fixing systems, and this is really the value that they add, because the developers no longer have to be on call.

But if you're getting put in this corner, then you might have a lot of ops work on your desk, and then you might not be able to actually have time for the real engineering work.

So how do you deal with that? First of all, explain over and over why SRE is on call and where it applies, namely where it supports engineering goals. Only start with that conversation about who's going to hold a pager once you have specified what the engineering goals are, and how on-call may or may not help with achieving them.

Then when you have achieved things, or when you have figured out that these goals are no longer realistic or no longer a priority, the pager should go back. While you are holding the pager as the SRE team, do share the on-call load with the developers so that giving it back is easier. It's not as much shock for the dev team, but also because you have a shared experience about being on call. It's helpful for the developers to learn how they build better systems by seeing how these systems run in production.

You have a much better communication when you know what each other side is talking about. SREs should know how to code, and developers should know how to run production, so you can have a productive conversation. But SREs should not only know how to code, they should also be allowed to do that, should be allowed to contribute, should have access to the source code, being able to read it and being able to write some of it where needed.

Number four: make sure that being an SRE is not a career dead end. If there is a reputation of, this is a boring, crunchy job that gets you nowhere, it will be hard to get the right talent and it'll be even harder to keep the talent.

Make sure that SRE is part of the shiny projects, the biggest thing that your organization is trying to do, and is not just keeping the lights on in the basement. In the context of these projects, provide interesting work in the design and the coding, being a part in the things that engineers really enjoy, and make sure that the SRE actually don't do this for their entertainment, but because they actually contribute value there.

Provide sufficient career paths. This is something that I touched on earlier with an individual in a single team doesn't really have much of a career path. But also there are lots of things that you can do inside of SRE, in a bigger SRE organization, that limit the career paths. You should be very conscious about those. For example, having SRE representation in promo decisions, so that there is somebody who can understand what an SRE does and how impactful the work of a candidate was who is up for promotion.

It's much easier to keep people in SRE if you don't force them to. Enabling a very low-barrier mobility between dev and SRE, so people can leave anytime they will. They don't have the urge to escape, and they can stay willingly because they know they can leave when they need to, when they want to. If you invest enough in the learning and the training of your SREs, and career growth and skill growth, they might want to stay.

Last but not least, a problem can be that the SRE team tries to justify its existence by making production more complicated than it needs to be, in a way that only we can handle this. I call this the human abstraction layer for production.

It is not with malicious intent, at least in the cases that I have encountered, but it is actually SRE trying really, really hard to provide a clean abstraction layer, a clean API that the developers can use and don't have to worry about how the sausage is being made. Then investing both manual operational toil, as well as plenty of coding for automation and tooling and infrastructure, to make this very convenient for the developers. But at the end of the day, you cannot then remove that work from SRE again, because only they know how to maintain and how to run this complex infrastructure.

Keep an eye on your production complexity. Try to have some rule-of-thumb measurements of how complex are things, and where are we headed? Is it getting more complex? Very actively reward the team for making things simple. Do not punish the SRE team for actually succeeding at making things simple. If it's like, oh, production got so simple, we can reduce the SRE team in half, if there is a risk, if you have examples of that having happened in your organization, well, guess what? SRE will only ever get close to making things simple, but they still will remain a little bit complex, so you can never fire these people. That's not what you want. You want people to actually go all the way and deliver.

Work on the mindset of SRE. It's not kind of a laundry service where people drop off their dirty clothes and get them back clean and folded while never looking behind the curtain. I see SRE more also as a sports coach, where they help you getting better at this. They work with you and you need them, but at the end of the day, you still have to do the hard work yourself.

Provide more active work than building this complex machinery, for example with what I said in the earlier things, being part of these shiny projects that really advance the business and helping land them together with your developers and other specialist groups.

That is it. Let me quickly summarize. We understand that breaking down silos is critical, but you might still need full-time experts for production. They can only succeed if you provide an environment for them that works, that has a risk of creating these silos again. So help them to align their work with the business goals and their partners throughout the product lifecycle, through very close collaboration and the various ideas that I presented. There are probably tons of others that you might also have. Keep an eye on this. Thank you very much.

Q&A

Gene Kim: Thank you, Christof. If you look at the Slack channel, you'll see all the comments reacting to your stories. Just real quick before we go to break, and turn it over to Jeff, is there any help you're looking for?

Christof Leng: Yeah. I always am interested in how others are doing that. This is a collection of things that I got from many different organizations, and I think this is just the tip of the iceberg. So if you have heard any, if you've seen any problems that you struggle to overcome, or if you have had hard problems that I haven't listed and you have a solution for that, please let me know. I would be really interested in learning more about this space.

Gene Kim: Very good. Christof, always good seeing you again, and thank you so much.