Engineering ITSM Through Site Reliability Engineering

Log in to watch

Europe 2021

Engineering ITSM Through Site Reliability Engineering

We all know SRE as a growing engineering practice for IT operations. In fact, SRE is a modern ITSM framework that reimagines service management as an engineering practice with a singular goal of consistent reliability. This session will explore SRE in the context of ITSM with particular insight on how SRE approaches service level, change, incident, problem and capacity management. The session will also explore SRE as a self regulating ITSM system that most closely aligns with Agile and DevOps as three continuous flows of managing services.

Chapters

Full transcript

The complete talk, organized by section.

Jayne Groll

Hi, everyone. I'm Jayne Groll, CEO of the DevOps Institute, and I'm super excited to be with you today at DevOps Enterprise Summit talking about a topic that's near and dear to my heart, which is engineering service management using site reliability engineering practices.

A little bit about me. I am currently CEO and one of the co-founders of the DevOps Institute. You may know me from my days in the ITIL/ITSM space. I was one of the co-founders of ITSM Academy, longtime ITIL/ITSM expert, the last several years spent in the DevOps space. I'm also a former IT Ops director, so I've been in the IT space a fairly long time and have had, I think, a bird's-eye view of the evolution of the tech community. I'm also author of the Agile Service Management Guide, which you can access for free at the DevOps Institute website.

Speaking of DevOps Institute, I'd love to tell you a little bit more about us. Our mission is to advance the human element of DevOps. We're a professional member association where we try to create a safe environment for our members to connect with each other, to upskill their knowledge, to grow their careers, and then hopefully be able to support their organization's digital transformation. We have multiple levels of membership, including a basic free membership. So go onto our website, become a member. There's lots of assets and resources for you, and you'll be able to connect with other humans at DevOps as well.

So what are we going to talk about today? Well, just to level set everyone's understanding, I'm going to provide a very brief introduction to site reliability engineering, and particularly SRE principles. Then we're going to look at SRE really through the lens of IT service management. I'm not going to compare it to ITIL. I don't think that's fair. I really want to look at SRE as a standalone framework focusing on site reliability engineering practices. And then we'll wrap up by looking at SRE in industry. We'll talk about the role of the site reliability engineer and some opportunities perhaps for you to learn more. So stay tuned.

So it's no surprise to anybody that the last year was particularly challenging on a human and on an organizational perspective. Nobody expected this. Coming out of 2019, entering into the new decade, the challenges that were faced across the world were just unfathomable, and organizations had to pivot very quickly. Those that were able to adapt to a digital landscape survived, some maybe even thrived. Those that couldn't faced challenges that were never expected, and unfortunately, some organizations did not survive.

But whatever this new normal is going to look like as we come out of 2020 and half of 2021, one thing's certain: digital transformation is just no longer optional. The digital landscape is real. And organizations, I think, have learned, in some cases the hard way, that they need to be able to adopt a digital approach as we move forward through the next decade and beyond.

But if you're going to be digital, then you also need to be reliable. And I think that's where site reliability engineering really comes in, where we understand that reliability, access, the usability of a service is really the only way to truly measure value. And so we're going to take a look at some of the practices and principles that Google described in the SRE series of books. And you may not be Google. That's okay. But the practices and principles are modern, and I think they really adapt to the digital landscape in many ways better.

So what is site reliability engineering? And again, some of you may have some preconceived understanding or knowledge about it. I'm going to stay pretty high level when we look at SRE as a service management framework.

It all started with the Site Reliability Engineering book, authored by several members of the Google team who really wanted to describe how they're able to keep their very complex environment, large-scale systems, reliable. And again, it became viral very quickly. It addresses the operational side of the house, but definitely steps back pre-production.

And then while the Site Reliability Engineering book describes the practices, the principles, and has some prescriptive guidance, it was followed on pretty quickly with the Site Reliability Workbook. And then just recently, Building Secure and Reliable Systems as part of site reliability engineering was introduced as well. You can read these books for free on the Google site. You can buy them through Amazon if you prefer a hard or a digital copy of your own.

But each of these books was intended to describe how to manage services in large-scale environments by creating roles, practices, and the elimination of manual work through an engineering mindset. Look at the description from Ben Treynor Sloss of Google: "SRE is what happens when you ask a software engineer to design an operations function." So we really are looking at engineering operations, but you also have to engineer process and practices in order to have intelligent automation, and I think that's a key aspect of SRE as well.

Google, by its own definition, considers SRE its approach to service management and calls it out in the early parts of the first book. And look at what an SRE team is responsible for: availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning. Those are traditional classic IT service management practices. But then rules are codified for how the SRE teams are going to interact with their environment. So not only post-production, post-deployment, but also pre-production by interacting with product development teams, by interacting with testers, by interacting with users.

So SREs are actually key members of the development team. The operational perspective is brought into the DevOps and the Agile teams so that by the time the code or the service goes into production, that there is a shared understanding of service-level objectives, and there is a shared understanding of engineering for reliability and what that means.

To my mind, SRE is the capstone of a self-regulating system that started with Agile software development and then grew further with DevOps, build and deploy, and now looking at operations as a self-regulating system as well. We needed to be able to empower teams. We needed to be able to really build on some of the principles that first came out in the Agile Manifesto about self-organizing teams. It is an alignment that I think works very well with Agile and with DevOps, and I think is the third piece of that puzzle where we have these autonomous systems that are looking to deploy faster, more frequently with higher quality.

So let's talk a little bit about the SRE principles, and I kind of divided it up into two sets of principles, one of which really focuses on the human aspects of SRE, and the other that looks at it more from a tactical, technical perspective.

In the center of site reliability engineering is the concept of service-level objectives. That's the objective for the performance of the service. You might think of SLAs. In SRE, we really focus more on the objectives for the service. And because SLAs really took on too many different contexts and too many definitions, agreeing on what an effective service-level objective is gives everybody involved, from the Agile teams all the way through and beyond production, a shared understanding of how this service is expected to perform, what its reliability is expected to be.

I mentioned about self-organization or self-regulation. One of the key principles here is the ability to regulate their own workload. The team, the individual engineer has to have the empowerment to regulate their workload as long as they're meeting the service-level objectives.

And then in order to be able to regulate their workload and perhaps to make the achievement of service-level objectives easier, one of my favorite parts of SRE is the ability to have proactive time. Half of an SRE's time is allocated to reactive work, but the other half of the time is allocated to proactive work. Perhaps it's automating manual tasks, perhaps it's looking at ways to improve process, but we have to be able to have the time and the resources to make tomorrow better than today.

And then, like we see in other frameworks, the hyper-focus on continuous learning is essential. Failure has to be approached as an opportunity to improve, and blameless postmortems have to be the mantra of the day. We have to be able to step away from indictment and become a learning organization that is always looking at having time to make tomorrow better than today, the ability to regulate workload, and then, of course, managing to the achievement of the service-level objectives.

The second set of principles of SRE really are more tactical. So it is about embracing risk, intelligent risk-taking, managing to service-level objectives, monitoring distributed systems. I'll tell you a little bit about observability in a bit. All focused on the elimination of toil. You know what toil is: manual, repetitive work that could be automated if we have to do it more than once or twice, but unfortunately consumes a lot of human time. And humans, uniquely, at least today, are capable of higher-level thinking and innovation. So the ability to eliminate or automate toil is essential to reliability.

And so if we look at that automation, but we also want to look at simplicity in terms of how we manage our services, the two would go hand in hand. You have to have intelligent process for intelligent automation, but you also want to keep it simple. We don't want to have bureaucratic or difficult-to-navigate process or automation. And then it's all about the engineering of releases. If releases are engineered well, and this is where the SRE can play a key role pre-production, then the quality of the service post-production will, of course, be higher, and the value delivered to the customer will be greater.

So I want to look at engineering reliability through service management. So for the next few minutes, what I really want to do is take a look at specific practices that you probably are familiar with, if you have any involvement in IT service management, and how SRE approaches it. I wish I had time to do a really deep dive into each of these. I encourage you to get education, to do some self-exploration about SRE, read the books, because there's a lot of really deep guidance in there. But for today, we're just going to touch on each of these practices, and I'll give you some key takeaways in terms of how SRE approaches them.

As I said, I'm not going to compare SRE to ITIL, but to my mind, SRE is the most modern approach to IT service management since the early days of ITIL. And as I said, Google, by its own admission, considers SRE as its approach to service management. So again, if we're going to have services and we're going to focus on reliability, then those services have to be managed as well.

SRE is really about systems engineering. It's engineering a system for managing services that's lightweight, that's integrated, that, as I said, is self-regulating, inclusive, accountable, automated, and proactive. I think we need to engineer our practices, not only engineer our automation, so that the system of service management is optimized for the digital enterprise.

So let's look at a couple of practices. We're going to focus on service-level management, change, event, capacity, incident, and problem management.

The heart of SRE is service-level management. In SRE, services are managed to service-level objectives. So the service is managed to its SLO, but it's measured by its service-level indicators. So those are more discrete. It may be measuring the performance of the application stack, the infrastructure. It may be measuring the performance of testing or security or release. So SLIs are the measurements; SLOs are what are managed to.

And SRE, while it references service-level agreements, really steps back from the focus on the SLA. Over time, the definition of an SLA has taken on so many different meanings in different contexts that your understanding of what a service level is and, of course, the hyper-focus on the contract aspect of an SLA has really detracted from the true meaning of the service-level objective. So SRE references SLAs, but the key focus here is indeed the SLO, and nothing can happen until the SLO is established, or SLOs are established.

Now, here's where I think SRE really kind of moves the needle in terms of change management. If we know what the service-level objectives to be achieved are, then we can also now establish error budgets so that changes, remember, embrace risk? Well, it has to be embrace intelligent risk. Error budgets are then defined where changes can pretty much happen at will as long as the service is within its budget.

You know about budgets. You may budget your own personal finances, and if you stay within your budget, you're fine. But if you overdraw, you're not so fine. There are steps that you have to take. Well, the same here. An error budget is meant to be spent. It's part of the self-regulating system where the team can agree that the code or aspects of the service need to be deployed. So it basically raises the definition of a standard change. But if the budget is breached, then there are consequences, and the consequences don't only affect the SREs, it affects the development schedule. It affects a lot of other things that happen all along the value stream.

So as long as the team is staying within the error budget, then, again, changes can happen at will, and there are thresholds and whatever. Highly dependent on automation, but it does increase the velocity of the releases. In many ways, it removes some of the human elements, like the change advisory board or the change approval board. It reduces the number of people that have to touch a change before it can be deployed, and it empowers the teams to be responsible for their own quality. Nobody wants to release something that is low quality, and it avoids issues like fatigue or contempt for change management, or even the desire to circumvent existing process. Personally, I think it's one of the coolest aspects of SRE.

Now, event management is growing up as well, and so monitoring certainly is a key element of understanding how the service is performing. And monitoring still very much exists. We have to look at reaction to the performance, whether it's latency, what the traffic is like, any errors that have occurred, saturation. But those are mostly reactive. Now introduce another level known as observability, where it is taking both an internal look at the individual components to the service, whether it's application or otherwise, and then also looking at the outside in, taking an external perspective of observation. And developers can observe their code, and DevOps teams can observe the code, and certainly operations teams can observe the code. Observability is really rising as a new and interesting practice. I encourage you to learn more about it.

And then capacity management's always been one of those practices that happens, but nobody's really sure who's responsible for it. Well, in SRE, the SRE teams are put in charge of capacity planning and provisioning because, again, capacity is going to drive reliability. And so understanding organic growth, natural usage of the service that happens because of more transactions or just normal day-to-day business, and then inorganic growth that may be event-driven, it may be seasonal, it may be certain things that are happening at a time of the year. But it looks at capacity consumption based on certain kinds of events. In today's world of elasticity, managing capacity and having the skills and the guidance to understand how to manage capacity is important. It is critical to availability, and therefore, SRE assigns that responsibility, that accountability to the SRE teams, of course working with others as well.

And then incident management, mostly major incidents, sets up an incident command system. So very structured, sets up a command post, identifies an incident commander who's going to really structure how we're going to respond, who's going to do what, removes impediments. It's almost like a scrum master. And keeps a living document. And so that living document is available to everyone working on the incident. Could be a ticket in an ITSM system, but it avoids some of the delays that are associated with a traditional escalation by having a command post, particularly in events or incidents that have a major impact.

It also affects how on-call happens. So having an incident commander, knowing the SREs are usually the ones on call, so being able to manage on that is essential as well. So again, very clear incident response, but avoids that kind of management by running around, particularly when you're in a major incident situation.

And then the goal of incident management is to restore service, so being able to identify and remove the root cause of those incidents requires some guidance as well. In SRE, we don't call it problem management, it's called effective troubleshooting, but it takes almost a medical approach, a scientific approach of triaging the situation, examining the symptoms, diagnosing, at least coming up with a first diagnosis, testing different treatments, and ultimately finding a cure. And this is where blameless postmortems are so essential. They have to be a key aspect of SRE culture where we move away from indictment: who did this, what caused it, and look at what did we learn, and how can we avoid this in the future? How do we remove the root cause permanently so future incidents won't happen? And so again, movement to a blameless postmortem, I think, is also part of the autonomy or the empowerment that these teams will start to feel.

So those are basic ITSM practices you're likely familiar with. SRE also provides some pretty tangible guidance on emergency response, load balancing, security, software engineering as an operational practice, product launches, the human skills of communication and collaboration, and managing operational load. And all of this has to be engineered. You may think of engineering or engineers specifically when it comes to enterprise architectures of automation, but this is systems engineering. And systems engineering is a people, process, and a technology element. And each of these is a key contributor to the quality of the service itself.

So the question I often get asked, particularly by those in the ITSM community, is, isn't SRE more technical than traditional ITSM? It's an engineering practice. Well, as I just said, it's systems engineering. We have to engineer the systems of service management. If we put a durable focus on engineering, where we understand the intelligent process that's needed for intelligent automation, we embrace the principle of elimination of toil and optimizing for automation, we develop more technical skills like Python, right? One of the top skills for site reliability engineers, the ability to write scripts, right? If you're on call in the middle of the night, to really look at it as an engineering role, then the answer is yes.

But we are IT, right? We are information technology, and all of us, and I'm the least technical person in the room, all of us have to be technical at some level. But again, it's mostly about developing an engineering mindset where we look at ways to improve, and we look at the service management architecture as a system of people, process, and automation.

SRE is on the rise. Three years in a row, DevOps Institute has run the upskilling community survey and report. The new report was recently released, and year over year, we're seeing an increased interest on the enterprise level for site reliability engineering as an operational practice. Individuals in the organization are moving into SRE roles. They're actively learning about it. According to LinkedIn in 2020, it was the fifth fastest-growing role that organizations were looking to hire. It's not necessarily a one-to-one, one SRE to a development team, but it is becoming a very key role, and key teams, most importantly, key perspectives within the enterprise particularly.

It's a real job, right? Over 10,000 jobs were posted in the U.S. recently, most of which were paying $100,000 or more, right? There's some upskilling that may be necessary for you if you're looking at moving into an SRE role. Some software engineers from the development side of the house have moved into SRE roles. Systems administrators, systems engineers, automation architects have also moved into these SRE roles. So it's a real job, right? It's a real job. It's got a real job description, and I encourage you, if you're looking at your own personal career growth, to learn more about SRE as a role. Some will call it a reliability engineer. But there's an offshoot of network reliability engineers and customer reliability engineers. But at the end of the day, the core practices and principles are very similar.

So with that, I want to thank you. As I said, I've watched the evolution of the IT community for a long time. I think SRE was born out of real-life practices. As I've said several times, you probably aren't Google. Maybe you are. I think there's a lot of good knowledge and education. There's peer-to-peer shadowing, right? SREs shadow developers, developers shadow SREs.

I think it's really cool and innovative and doesn't supplant existing or throw away existing ITSM practices. Remember what I said. No framework is perfect, right? SRE is not perfect. ITIL is not perfect, right? Agile is not perfect. DevOps is not perfect. As humans, our mission is to adopt and adapt.

So I hope today you've looked at engineering IT service management through the lens of the site reliability engineering practices, and maybe it sparked some ideas for you in terms of how you personally or your organization can start to move forward, particularly if you are in the midst of a digital transformation.

So with that, I thank you. I thank the IT Revolution team for inviting me to be here again. It's really a delight, and I hope to take your questions, and I hope to meet all of you in person someday soon. Thank you very much. I'm Jayne Groll, CEO of the DevOps Institute. Appreciate it, and have a great day.