Log in to watch

Log in or create a free account to watch this video.

Log in
London 2020
Share
Download slides

Demystifying DevOps & SRE

SRE is a large, complex topic, so we’ll start with common terminology and theory, then dive into practical examples—including lessons learned from our own journey here at Datadog.


We’ll further cover the relationship between SRE and DevOps, what success looks like (and how to measure it), and how to identify and nurture both internal and external talent in order to build a cross-functional team.


This session presented by Datadog.

Chapters

Full transcript

The complete talk, organized by section.

Daniel Maher

Hi, everybody. My name's Daniel. I'm a developer advocate on the community team at Datadog, and for the next 30 minutes or so, I'm going to be talking a little bit about DevOps and SRE.

Just give you some context, I am a recovering system administrator, and I spent most of my career either building and maintaining infrastructure or coding. I can say without hyperbole that when I first discovered the DevOps movement about a decade ago now, it absolutely changed my life, as it did for so many people. And then when I encountered SRE, there again too, it was like a massive revolution for me and for the

industry. And so I'm pleased as punch to be able to share a little bit about our journey at Datadog and my journey, and hopefully help you on your DevOps and SRE journey as well.

So if you'll permit me a brief moment, I'm just going to share my screen so that we can get some slides started. All right. So this is not a talk about Datadog, but it is incumbent upon me to at least explain to you what Datadog is.

So Datadog is an observability platform that provides full visibility across your entire organization. We span end-to-end from your infrastructure and your network, to your applications and your services, all the way to your end users, wherever they may be, on mobile, in a data center somewhere.

This enables everyone, ops and developers certainly, but security people, finance people, HR if you like, anyone in the business, to have a shared understanding of your systems and the ability to communicate upon and resolve problems when they arise, and ideally before they arise.

Also, for more information about how Datadog can help your organization or to sign up for a 14-day free trial, no credit card required, visit Datadoghq.com. All right. That's the spiel out of the way. Let's get into the meat.

Topics

What are we going to be covering today? Essentially four things. All right. And this is as good a place to start as any, is what are those four things? First things first, and maybe the most important takeaway, we want to clear up some confusion about what DevOps and SRE mean. These are loaded terms. Very heavyweight.

Lots of baggage, right? So I just want to clear the air, when I say DevOps, when I say SRE, when Datadog says them, what do we mean when we say them? And by the way, these aren't necessarily our definitions, they're fairly industry standard, but I do want to make sure we get through them before we continue.

We're going to move on to talking about teams, people, humans, and how those people can be organized together to achieve success. What does that look like within the context of the DevOps and SRE?

And of course, those teams, as I mentioned briefly, are made up of people. So where do those people come from? Who are they? How can you nurture those people, those relationships, those structures? How do you put them into place?

So we're going to talk about that a little. And then finally, we're going to wrap up with some practical suggestions, hard-won lessons and truths from the trenches of my own career as well as at Datadog, and share some pitfalls and things to avoid on your journey going forward as well. Okay, great.

Clearing up confusion about DevOps and SRE

So first things first, let's start right there. To clear up some confusion about DevOps and SRE. A key part of effective communication is having a common language. I'm speaking English right now, and hopefully, you're understanding most of what I'm saying. If I were speaking French or Portuguese, perhaps you'd understand less. In other words, words have meaning, right?

And unfortunately, when it comes to the terms DevOps and SRE, the internet as a whole has played pretty fast and loose with those meanings, with those definitions. So I'd like to establish what our common language is right now.

DevOps is a professional and cultural movement that focuses on openness, sharing, and mutual respect. It seeks to improve the quality of life for its adherents and practitioners, for their company, for their organization, for their customers, for anybody participating in sort of the grand dream of DevOps.

An interesting aspect of improving quality of life is the idea of availability and reliability, and that's where SRE comes into play, is how can you ensure that systems can be trusted? How can you ensure that people have the confidence that the systems and services that are in place will be there when they need them? And that's actually an important part of DevOps in a way, and certainly a key element of SRE, and we'll start to see how these two

DevOps and CAMS

things sort of get put together. We talk about DevOps. We could talk about a lot of things, but I like to talk about an acronym. We're technical people. Technical people love an acronym. So how about this one?

CAMS, C-A-M-S. In no particular order, well, perhaps a particular order, since if you're to spell it S-C-A-M, that'd be a word that nobody would want to associate with their business. So the CAMS, again, it's an acronym. The C is for culture.

Right? And a full discussion of culture is well out of scope for this presentation, but let's talk about it in terms of the way that people choose to organize and the social contract that they put in place with one another. Then there's A, automation.

This is a big one, and it's the one that I think is, in many ways, easiest for technical people and technical organizations to sort of get behind. But it doesn't necessarily just mean writing Bash scripts, although that could certainly be a part of it. When we talk about automation, we're talking about unlocking human potential. All right?

We're talking about allowing the computer to do what the computer does best and allowing the humans to do what the humans do best. Repetitive tasks, things that could be scripted, things that would be better if they were scripted, should be done by computers.

One, because it's more efficient, but two, because it unlocks the person that was formerly doing it and allows them to express their full creative potential within the organization, and that's powerful. M, measurement, sometimes metrics.

I prefer measurement because there's a lot of things to measure that aren't metrics. The idea here is not just measuring, but knowing what to measure and knowing how to measure it, and critically, how to interpret what's being measured.

Super important when we talk about DevOps is measuring it and doing it well. And finally, S, sharing. It's all well and good to have this excellent culture and automate things and to measure all the things you've automated, but unless you're sharing that information and helping people around you to also improve their quality of life through the DevOps practices, then you're not really succeeding.

As a sidebar, you may have seen sometimes an L kind of crammed in here, CALMS, SCAMEL. The L stands for lean. I have no particular opinion on whether there should be an L in here or not, but sometimes you'll see one in here. So there you go.

SRE as implementation

We talk about SRE, then. We can't really talk about SRE without mentioning the elephant in the room, or perhaps the big lizard, I guess, in the room, and it's this book, the "Site Reliability Engineering" book, and a handbook that was released along with it not long ago.

This came out in 2016, and it's a tome of how Google runs their systems, or at least how a part of Google ran some of their systems up to 2016. Now, I mention this because, again, it's basically impossible to talk about SRE without talking about this book because it's hugely influential in how we understand and design SRE programs and environments today.

But I think it's important also to note that this is just one interpretation, and it's an important and good one, but it's one interpretation. And how a part of Google did something in 2016 is not necessarily how you should do something today, although it's a good starting point.

It could be said that DevOps is an idea and SRE is an implementation. I said it could be said that because not everybody agrees with that, but it's an interesting framework to think about it. I'm not going to talk about this book anymore, but if you haven't read it, I suggest you at least do take a look at it.

So we talk about DevOps, we're talking about ideas. Okay? And that's the key element that I'd like to get across there. And when we're talking about SRE, we're talking about practicalities. We're talking about implementations.

We're talking about actually doing something. Right? And so if we have a philosophy and a practicality, we have DevOps and we have SRE. Fair enough? All right. So we've cleared up some confusion about DevOps and SRE.

Team and organizational structure

We're going to move on to topic number two, team and organizational structure. In order for DevOps to succeed, in order for SRE to be effective as a practice, you need to consider how your organization is structured.

This is actually super important. You can't just declare DevOps victory and name some team the SRE team and, "Whoop, whoop, we're done." Right? It's not how it works, okay? You could just rename something and then it's something else. That's not...

Anyway. So let's talk a little bit about teams and organizational structure. At Datadog, as well as in many organizations, we can think about organizing people in three different ways that are pertinent to SRE and DevOps in particular, and that's product teams, squads, and guilds.

Product teams, squads, and guilds

So starting from the top, we'll talk about product teams. Teams are big. Sometimes really big. And every single role and function and responsibility that needs to be addressed is accounted for by at least one person on that team. If you can actually reasonably state these things and have them be true, then you may very well have a product team already in your environment.

It's self-sustaining. It's autonomous. Everything that that particular vertical needs to survive and function and succeed is actually contained within a single team. An interesting element of the product team model that applies in particular to DevOps is it's how DevOps scales.

There's a lot of questions around, okay, well, DevOps works in startups or in really small companies, but how could DevOps work in a really big company? Product teams are one of the ways that DevOps scales to large companies, to enterprise companies.

So when we talk about DevOps in the enterprise, we are also very likely talking about product teams. But, and this is a big but, the product team is not the end-all be-all. Product team is just one aspect of that organization, and it's a big one, both in terms of importance and size. But sometimes you don't want to be talking about something as massive as a product team.

It's just ungainly. Sometimes you have a little problem. Or even a big problem that you don't need 100 or 500 people or however many people to be targeting on. And that's when we talk about something a little bit smaller.

That's where we talk about squads. Squads are short-term groups that are organized to solve a single goal or problem or accomplish sort of a single unit of work that wouldn't fit neatly onto a team. Right?

A team's scope is vast and large, which makes sense. The team itself could be vast and large, right? The squad, again, might focus on a single thing, like a single OKR or one particular intractable problem.

Ideally, that single OKR or intractable problem would benefit the team, right? If it doesn't benefit the team, why are you doing it? Ideally, though, it's going to benefit a large number of teams, because even across a large and varied organization, there's going to be commonalities between different product teams, between different product groups, and so that means there's going to be commonalities in terms of problems and challenges across those groups.

Assembling a squad to really focus in on a specific challenge or problem or outcome is a great way to benefit across teams in a product team structure. I mentioned that they could be focused, for example, around an OKR.

To put it in more concrete terms, at Datadog, we've had squads organized around such things as recruiting, right? How do we build good coding tests, for example. Analytics. Right? How do we actually help our customers to better understand this particular data science issue, right? We even form squads around hackathons, ideas for hackathons, how we want to do them, what sorts of cool outcomes we want to get from

them. The key element here is that squads are short term. All right? They have a defined beginning, and they have a defined end, and that defined end is not only goal-oriented, it's probably also time-boxed. Right?

You don't want a squad that just goes on forever, because then that probably means that the scope creeped or that the goal you defined was just the wrong one, right? So a small sort of tiger team centered around a very specific goal or outcome, so that there's a definition of done that's easy to explain there, and that has a time box around it. Those are some good ways of figuring out,

okay, what's a squad, and did we make a good one? So we talked about the very large, right? That's the product team. We talked about the very small. That's the squad. But what is there, something in the middle there? Of course, the answer is yes, right?

It's a leading question. And that's guilds. So what's a guild? A guild owns and shepherds an important part of the organization, of the architecture, of the culture that crosses many teams. Guilds are larger than a squad, but generally smaller than a product team, for values of product team.

They're semi-permanent versions of a squad, maybe as a way of thinking about it. I mentioned that squads have time boxes and fixed outcomes. Guilds can have time boxes and fixed outcomes as well, but there is an opportunity to make those scopes larger, right? They can exist for longer.

They can try to accomplish a series of goals, for example. And I mentioned that they own and shepherd an important part of the organization, right? They cross product teams. They could be formed of a series of squads from a series of product teams, for example, or could operate independently.

They work on things like culture. They work on things like standards around automation. They work on things like, how do we interpret these measurements, and how can we best share the results of our findings across the organization? You're going to have a lot of different stakeholders here.

You're going to have a lot of different roles and responsibilities represented within a guild, because you need to have that plurality, that diversity, in order to properly assess the success and suitability of a guild's outcomes for the organization. So we have the teams, right, responsible for the product.

We have the squad responsible for a little problem, and then we have the guild responsible for kind of stuff that's in between. This is how we've organized it at Datadog with a, frankly, quite a good degree of success and how we've seen a lot of our customers do it as well, and this has worked particularly nicely in large scale and enterprise environments.

SRE teams

So you might stand up an SRE team, right? And this is one way that SRE is distinct from DevOps. You may have a specific role or a dedicated team called SRE in a way that you would not have a specific role or a specific team called a DevOps or a DevOps team.

That said, there are different ways that this might take form. We think about the cohort of site reliability engineers. So a group of humans, right, organized and managed as a group. They're going to form an SRE team. So you could have an SRE, a site reliability engineer, on an SRE team in a way that you couldn't have a DevOps on a DevOps team. It just doesn't make sense, right?

SRE teams are versatile. This is one of the key aspects of a good SRE team. They can participate in a variety of things. Code reviews, right? Incident reports. They can help facilitate postmortems. SRE teams can be involved sort of in every aspect...

of the life cycle of a product or application, right from whiteboarding the idea to retiring it in five years. The SRE teams could support a dedicated portfolio of products and services, of functionality within the organization.

They could exist solely to support a particular product team or series of product teams, for example. And any one individual SRE could rotate through these functions. You could have an SRE, for example, that is primarily assigned to a specific product team, but maybe they're just getting bored, or maybe their skills and talents could be used elsewhere. Don't worry about what the hierarchy looks like. SRE teams are versatile. Move them around.

Figure out what works. This is one of the ways that you can feel that an SRE team is being successful, if there is this concept of mobility and versatility. So we talk about DevOps as an idea, and again, you can't have a DevOps.

It doesn't exist. We talk about SRE as a practice because you can have an SRE. You don't organize DevOpses into roles in the way that you organize SRE into roles. So here we're, again, talking about what these words mean.

Finding and growing SRE talent

So we've talked a little bit about, again, these words at length, teams and organizational structures, how these things can be organized. When we talk about people, we start to get into the real nitty-gritty of it.

Where do these people come from? How can we, not only as an industry, but any one particular organization, find and grow and nurture the talent necessary to successfully implement DevOps principles and, more concretely, successfully build and see the longevity and power of a good SRE team?

So we'll start at the start. SREs are people. They're human beings. A DevOps is not a person. An SRE, that's a person. And people have personalities, sometimes very strong personalities. And these strong personalities can come out in a variety of different ways.

Anybody could potentially do anything, but some people are happier doing some things than others. I, for example, really like being on stage. I really like getting up in front of people and waving my hands around and getting excited. Not everybody likes doing that, although, of course, anybody could. It's the same with SREs. People who are successful in the role of SRE tend to have certain personality attributes, although this is not an exhaustive list,

nor is it an exclusive list. But from what we as an organization have seen within our walls and outside within our customers as well, these are some common personality traits among successful SREs in the industry.

First one is, well, just patience. Patience for staring at code, patience for staring at infrastructure, patience for being able to analyze just absolute masses of data for long periods of time. Patience is a critical aspect of any one successful SRE and any successful SRE program. You have to be willing sometimes to wait things out. And that's not for everybody.

I'm personally not a particularly patient person. It's something I'm working on. But patience is a big part of being successful and ultimately happy in this role, and you want your employees to be happy. That's super important.

Most SREs enjoy problem-solving, and I know that sounds sort of banal. Show me a job requirement list that doesn't include problem-solving on it somewhere. Yeah, of course. So here when I say that most SREs enjoy problem-solving, I don't just mean in the sense of doing crosswords or something like this.

The trick about being an SRE is you're often working on systems that were not maybe necessarily designed by you, were not necessarily implemented by you, although, of course, this could be true. Systems that were not necessarily programmed by you, software that you've maybe never seen before.

And so that aspect of diving into the unknown, that aspect of working with software and systems and processes and people that you don't necessarily have any control over, has to be enjoyable. And again, it's not for everybody.

But for the people who do enjoy it, it's just aces. So this is another personality aspect we see oftentimes in successful SREs. The capacity for self-teaching and self-learning is super important. You cannot currently get a degree in SRE, in the way that you can get a computer science degree or a computer engineering degree or a mechanical engineering degree.

You can't get even a diploma in SRE. So for the time being, this is something you have to want to teach yourself and want to learn from others who are themselves self-taught to a large degree. And again, having not only the capacity, but the desire there is super important. Most SREs have a wide range of technical interests.

I would say that most technical people do, but in the SRE world, that's super critical. And again, it comes down to having to deal with an incredibly diverse array of things getting thrown at you sometimes. And that has to be enjoyable.

Most SREs have found success not only with their technical aspect, but also with the human aspect of things. Big part of being an SRE is communicating, is teamwork. Right? So good SREs are good at communicating in different ways.

Text, to peers, to CTOs, to the outside community. Right? Good communication skills, super important. I mentioned that they work on teams. SREs need to be teammates, but also team builders. Building up that mutual trust and that confidence between their roles and their functions, and the roles and functions of other people within the organization.

And finally, as I mentioned, the world of SRE, something you have to teach yourself and be taught by people who taught themselves. And so a big part of SRE is that mentoring and teaching aspect. Also super critical.

SREs have backgrounds. Right? All SREs have backgrounds. And just give you an idea of some of those backgrounds at Datadog, certainly from traditional ops and dev backgrounds, but we've also hired into the SRE program from our own customer success agents.

We have at least one person in the SRE team who has a PhD in computer science, and at least one who dropped out of high school. So again, the backgrounds there are varied. Don't fixate on where these people came from.

Worry about... Not worry, concentrate more on their personalities, and their capacities, and their desires, and what's enjoyable. That's how you're going to identify that good SRE talent. In other words, great SRE talent can come from anywhere.

Practical suggestions and pitfalls

So last thing that I want to touch on today, just some practical suggestions about how to implement DevOps practices, and in particular, SRE within your own organization, and talk about some pitfalls that we've seen, not only at Datadog, but also outside in the industry as well.

So first things first. Is SRE a standalone team? Is it an embedded team? Is it something else? Well, in the words of my esteemed colleague, Waldo, I hate to give an, "It depends," but it depends mainly on the talent and personalities that you have access to. Right? The important thing here is that you're willing to try some different modes. You're willing to try some different ways of looking at it,

and you're willing to run those experiments to see what works within your own organization. This is a super important part of not only DevOps, but about SRE, if you're building it, especially from scratch, is you have to be willing to experiment, and you have to be willing to go, "Okay, this failed, but that's okay because we learned and we're going to try again." And this is where that CAMS acronym really

comes into play, when we're talking about developing SRE. And as much as I'd like to give you a definitive right way, it just doesn't exist. Right? You have to be flexible. I have a quote here I'd like to share with you.

"Archaeology is the search for fact, not truth. If it's truth you're looking for, Dr. Tyree's philosophy class is right down the hall." This is a great Indiana Jones quote. But to take it into the context of DevOps and SRE, we might say that SRE is the search for fact, not truth.

If it's truth you're looking for, Gene Kim's DevOps class is right down the hall. Right? Pitfalls. This could be an hour-long presentation on its own. Right? We've run into problems, and we've seen other people run into problems as well.

And so what I'd like to do is just share with you some common things to avoid. And I would say the number one thing that you're going to want to avoid is dogma. All right? Don't look at how another organization has implemented these things as the rigid, engraved into stone tablet way of how you should be doing it. You need to figure out what works for your

organization, for your reality, with the talent you have today. Right? Don't fall into the trap of dogma. Look to others for suggestions, come out to the conferences, go to the meetups, talk with people, read the blog posts, bring all that information together, and figure out how DevOps and SRE could look and work within your organizations. And again, be willing to run those experiments.

On the topic of dogma, you'll see a lot of chatter about how, "Oh, if you're doing SRE, you need to be doing it in Go." Or, "Oh, Rust is the one true path." Again, no. Right? Look at what's working within your organization.

What programming languages, and competencies, and people are comfortable with. Right? You're also going to be bombarded with information from consultants who want to sell you a DevOps. No one can do that. Right? It's impossible to purchase.

You have to build it yourself, and it's a very long path that never really ends. Right? It's ongoing. And that's just something that you have to accept and embrace, because that journey is empowering. All right, so what did we talk about today?

We talked about DevOps and SRE. What do those terms mean? We talked about team and organizational structure. We talked about how to find and nurture people within those structures, and I gave you some tips and some things to avoid.

I hope you found this enjoyable, at least informative. I had a lot of fun getting this all prepared for you, and I would like to thank DevOps Enterprise Summit for giving me the opportunity to share with you here today. Normally there's a Q&A that's going to be happening.

If not, you can see my Twitter handle over there in the corner, and you can see it on most of the slides, @phrawzty, P-H-R-A-W-Z-T-Y. Feel free to hit me up. That's it for me. Thank you very much.