Lessons Learned from a Parallel Universe

Log in to watch

San Francisco 2016

Lessons Learned from a Parallel Universe

Just within the last ten or so years, we have seen at least two separate communities evolve at the crossroads of development and operations.

The first—DevOps—grew up very much in public, the second matured sequestered within the halls of “special” companies like Google and Facebook and is only now starting to gain visibility and traction in the wider world.

The DevOps and Site Reliability Engineering (SRE) communities barely speak, yet both have common ancestors and much to offer each other. Let’s look at what they have in common, how they differ, and what are the key things we can learn from both.

Chapters

Full transcript

The complete talk, organized by section.

David N. Blank-Edelman

Hi, my name is David Blank-Edelman. I am a technical evangelist for a company called Apcera. If you haven't ever heard of us, we're in San Francisco. We're a company that DevOps typically really love because we make this software that's container management software that allows you to take whatever you want, from a shell script to an operating system, and run it either in the public clouds or privately, move it around.

And the reason why DevOps love us is that we have a policy engine in the middle that lets you set up guardrails for your users. But we're not really going to be talking about that. What we're going to be talking about today is another one of my passions.

So I'm really into SRE. And if you don't know much about SRE, you're in the right place because we're going to be talking about that sort of stuff. I'm one of the co-founders of SREcon, which is this now global set of successful SRE conferences that take place all around the world. And in fact, we have one of the chairs of 2017 in the room right now, and I'll out him later. Whoop, step.

So let's just get into this sort of thing. How many people here know roughly what the term SRE is? How many people don't? It's okay to wave your hand. Great. Okay, awesome.

So here. Here's the answer to the question, what is SRE? Can anybody tell me what this is?

That is a blank screen. No, try again. That's a lighted screen. This is what happens when your application fails, and this is what people see when your application fails. Or maybe they see this. And I apologize for putting this up because I know it's causing a few of you to have these little twitches.

Here, let me go back to the PHP version of the error screen that looks a little bit like this.

So the idea behind SRE goes like this. Once upon a time, there was a person hired by Google by the name of Ben Treynor, who was hired in. He had a software engineering background, and he was trying to figure out how to do operations sort of stuff. And he realized pretty quickly that one of the things you should be optimizing for in any environment is reliability, hence the R in the middle of SRE, Site Reliability Engineer.

Because it was Google and they had a site, that's where the site comes in. But it's not necessarily just about websites.

And so what he realized is that you can have the most feature-full app you want, but if it is not up and if it's not reliable, it's not worth, well, whatever word you like, to anybody. And so the idea was, is it possible to create a group that would engineer failure out of your system to make things more reliable, to focus specifically on that? And so that's where this came from.

And really, he was dealing with a couple of things that you all know very well as DevOps people. He was dealing with this sort of tug and pull between the developers and sort of the operations people. So on one hand, you have developers whose job it is to write software, to make new features, to build things, to continue to iterate. And you have this notion that that's what they do for a living.

And on the other hand, you have this set of people whose job it was to make sure whatever was running would stay stable and would be reliable. And so you have this tension, this fundamental tension where somebody's job is to make things change as much as possible, and the other one's job is to say change can be a real problem.

And so this was one of the things that they were attempting to address. And similarly, they were also attempting to address this problem here, right? This problem of going from laptop to production in such a way that things would work.

And finally, the other thing that I think that you can safely say is that they quickly realized at Google that SREs do not scale. People do not scale linearly with the amount of work that is laid on them. And we all know this, right?

So the question is, what do you do about this? And all sorts of stuff that you might talk about would be things like automation and stuff, but they had a different take on this at the time.

Before I go more deeply into sort of some of the things that are useful to know about SREs, I want to just get this out of the way. One of the experiences I know that I have when I go to these conferences is that you feel like, okay, I just got the DevOps things down, right? It is the case that my child calls me Scrum Master instead of Daddy, or once a year, we have a nativity play where we act out The Phoenix Project at work, or you've just got all the automation. You've got all this stuff going, and you walk into one of these sessions and somebody says, "Here's a new possible way you could be thinking about or doing things." And you're like, "Aah."

So I want to be clear that I am not attempting to tell you you need to change anything about what you're doing. What I'm trying to do is bring to you from this other parallel stream things that might improve your DevOps world. And I do believe that there are goals in the middle. So that's thing number one that I sort of want to get out of the way.

The other thing I want to get out of the way is represented in this picture. Progress of man, right? We've all seen this picture. In the early days of DevOps, there was some notion, and it's not really the case now, where somehow on this end of the spectrum, there were the sysadmins, and that the DevOps peoples were the ones that were standing upright and had evolved out of our primitive ancestry of sysadmin stuff.

And so what I want to tell you is that I'm not here to tell you that now DevOps is going to somehow evolve into SRE. Like I said, I think they're parallel streams. I don't think that one is meant to be super cool, better, and uses more of the brain than the other. Okay?

So can we agree that this is all fine and you're going to be okay as we move forward? Yes? Yes. Okay.

So, let's go. Let's talk a little bit about this. The place I want to start talking about SRE stuff is with the folks and with the work that had been done at Google, but I'm not going to stay there entirely.

So there is a book out, and I'm showing you this book. If you like the sort of stuff I'm talking about and are curious about stuff and you want to hear how Google thinks about things, this is a lovely book. I will be clear on this topic that I reviewed it, and I am an O'Reilly author, so I have some bias about this. But if you want to know about SRE stuff, this is a nice place to go deeper because we're not going to be able to go very deep in this talk. Only have about 25 minutes.

Okay. So the first question that often comes up when I start saying this is you're like, "Well, this is awesome. This is Google. How does this matter to me?" Right?

Because I know that I used to have this attitude that was like Google would show up and they would talk at a conference like this, and it was like somebody showing up with a Gila monster at a reptile show. Right? You'd go, you see the session, it would be like, "That's really cool. It's amazing what Gila monsters can do." And then you would go back home to your geckos and it would have no connection to you.

And my assertion is these days that we're all figuring out that, congratulations, scaling is important. Congratulations, dealing with automation is important. Dealing with failure is important. And so it's not just that. And I also want to say that in this talk, I'm not going to talk just about what Google has done or thought about, but I just wanted you to sort of start there because I think there's sort of a bias. And I'm not trying to say Google is great. And I'm not an employee. There are ex-employees in the room, and so they can correct me if I'm wrong.

So the place that I like to start when I talk about SRE is a talk that was given at SREcon in 2014. So the very first talk at the very first SREcon was given by Ben Treynor, the guy I told you that had started the SRE group at Google. If you want to hear an hour's worth of sort of the best concentrated version of what SRE is, I would watch this talk. In fact, I like this talk even better than what's in the book, because I think it just lays it out.

And so what I'm going to do is I'm going to shamelessly crib from Ben Treynor's talk. He put up a slide that looks like this. This is my dramatic recreation of his slide. I never asked him for his slide. I just simply typed it myself.

There's a lot of stuff on this slide. I'm not going to expect you to read this, but we are going to go through some of it together. And the place I want to start is with these three things. And if I want to give you one of the things that I think, one of the concepts that I thought was particularly cool from Google, it's incorporated in these three things.

So this is what they suggest. They suggest first, you want to have an SLA for your service, which is defining what the service level you expect is, and we'll dig into that in a moment. Then once you have that, it's time to measure and report performance against that. Okay? And then finally, once you do that, you can do something known as an error budget, and I'm about to tell you what an error budget is. Okay?

Everybody take the picture of this, by the way. Do you want to pose with it?

Okay, so here's what I mean by error budgets. I know it said SLA on a previous slide. In Google, they talk more about SLOs, service level objectives. And the idea is you're going to launch something, or you want to take a service that exists. The first step is to figure out what are the expectations around how reliable it will be. Okay?

So you have to define for yourself things like some definition of, I'm expecting this service to be up 90% of the time. Okay? Because most of our software that we write, except for, say, the things that keep the airplanes flying and the things that are like running your pacemaker, don't necessarily have to be up 100%, right? We all have maintenance downtimes and stuff like that. Or maybe once a year, you can take it down for 10 minutes.

So you first figure out what level of reliability must this thing have? And you want to be really good about that. The other thing you also want to do is define how do we know? What does it mean to be up? Right? What's the throughput or other way? There are lots of ways to measure what's it mean to be up, but the first thing is you sit down and you figure this out sort of really, really carefully.

Once you have that, the next step is to take those metrics and to start monitoring that service. Okay, so far so good? Based on those metrics. Okay? And everybody agrees. You set up a monitor and everybody agrees this will be the canonical monitoring of it, and these are where the numbers are going to come from, and we all agree that this is correct and transparent, etc.

And then once you've done that, then you can start using that information to decide things like, how often am I going to release new software?

So, for example, if you go and you look at your metrics and it shows that in this particular case that you have not exceeded your... You said you needed to be up, say, 90%, and it's been up 95% in this time period, awesome. So go ahead and push a new piece of software out. Go ahead and push out a new version. Because one of the things that we all know sort of intuitively is that one way to perturb reliability is to update things, right? Potentially. So you have the ability to know, well, okay, I'm still within my error budget. We're still cool. We still have this amount that we can use up.

However, if in fact your service has been down more than your error budget, then you go, "Wait a second. Sorry. We have all agreed on what level of reliability it must have." Then we can start making this decision that says, "Okay, I'm not going to push out a new feature for a while." And then maybe what the group does is they go back and they figure out what was causing this unreliability. Maybe that's what they have to spend time on for a little while instead of features.

And what this is doing is this is fundamentally creating a situation where everybody is working towards the same thing. They're all working towards the notion that this is how reliable we expect it to be, and we all know what that means. And then if we start falling off of that, we're all going to put energy on that next, and then you can come back and add more features.

Okay? So, that's to my mind, really cool.

Another set of things that they're also in this sort of realm that I think is useful to know is they have a common staffing pool for SRE and dev, and excess ops work... I'm just reading this and then I'm going to explain it. Excess ops work overflows to the dev team.

So if it is the case that the service that you're running is generating more sort of operational load, or they might use the word toil, then what should happen is that excess tickets over a certain amount will go back to the developers. Now all of a sudden, the developers will be responsible for some percentage of the tickets.

And this is similar to the situation that says if in fact you hand your developer a pager and that pager goes off at 2:00 a.m. in the morning, it will be the last time that particular bug manifests itself in production.

And so you have this sort of virtuous cycle where everybody is working towards this, and they cap their operational load, just strictly the toil stuff, at 50%. If it gets over 50%, then they take action.

So the observation that I want to make here in general about SRE is that it's really cool to set up these sort of virtuous and reinforcing feedback loops. Senge? I don't know how to pronounce Peter's name. Somebody can help me with that because I've only heard it written. It has these lovely sort of loops that are meant to be good feedback loops. And the notion is if you can build into your world these feedback loops that do the right thing and continue to make things better, then that's a really good thing.

Okay. And then the last thing out of these things, and we can talk about the middle ones if you want at some point, is something that has shown up, I think, really in a lovely way in the DevOps community. We've all realized that you probably want to do a postmortem for every... And he doesn't say every significant event, but it is the case he means every significant event. And that the postmortems have to be blameless, meaning you want to focus on process and technology, not on a people.

You don't want to say, "Bob or Susan over there screwed up." You want to say, "What led to Bob or Susan making the wrong decision? Why wasn't the doc good? Why wasn't the process good? What technology should have been in place?" That's the sort of things you want to be thinking about.

Because one of the things I've said in a talk similar to this before that I think people tend to dig is you can't fire your way to reliable. And we hear about that all the time, where there was some big screw-up somewhere, and they just fired that person as if that was supposed to fix it. Because eventually, if you were just firing people every time somebody made a mistake, it would be just the one guy who's sitting there, teeth-chattering, really sad in the corner. And then your system isn't any more reliable, because now you've only got the one person there who just happened not to have caught the bullet.

So I think this is an important part to realize from sort of the reliability thing is that you just can't. It just doesn't work that way. Okay?

So I've talked only about Google so far, but I think the thing that you might be curious about is what about sort of the rest of the SRE universe? Because it is the case that SRE isn't just a Google thing. So here is some example of some of the places that have SRE teams that I have personally interacted with. In fact, we have one of the SRE leads for... Can I say it? For LinkedIn in the room. They're around. They exist.

And I realize that this isn't very enterprise-y, but I'm seeing it more and more in enterprise companies. And because everybody has sort of a little bit of a different twist on how they implement SRE stuff, I'm only going to concentrate on one of the things from this list. I'm only going to talk about Facebook.

And I want to make it clear, just like with Google, I am not an employee of Facebook, which means I can't use their official logo. So here, I used this one instead. Does that work for you? Can we call that Facebook? Okay.

And everything that I'm saying now has, in fact, when I talk to the fine folks at Facebook, has been cleared through PR and stuff like that. So they're all cool.

So Facebook has, they don't call their SREs SREs, due to mostly historical facts. But they call them production engineers. And if you want to know more about production engineering, per se, congratulations. You might be surprised to hear that the next year after we had Ben Treynor, we had Pedro, who came and talked about production engineering. So you can go watch that talk as well. Equally cool.

And so in that talk or in another talk like it, he said production engineers at Facebook are hybrid software systems engineers who ensure that Facebook services run smoothly and have the capacity for future growth. They are embedded in every one of Facebook's product and infrastructure teams and are core participants in every significant engineering effort underway in the company.

Can anybody tell me what the keyword out of that big blob of text is? Prod. Yeah, good for you. That is indeed the keyword.

So one of the things that Facebook does differently from Google is Google has a notion that there are SREs that, though they are associated with products and groups and stuff like that, they have more of an allegiance to a more central, there's a more central nature to them. And it is the case that not every team gets an SRE. Often you have to show that you are ready for the giving of the SRE to your team, and that you are at the level where you could really use some help with, you're at the maturity level where you could really use that sort of expertise.

The difference is that Facebook sort of starts with this notion that there will be a production engineer along with you.

And so my next observation I want to say is that based on my experience, there is no one right way to put together an SRE team. But there are definitely wrong ways.

Examples are if you hear somebody who says, "Hey, what our SREs are is really what they are is they're like tier three. There's a NOC, and we have a ticket system, and they're the extra special super rocket scientists people who are doing stuff like that."

That is not an SRE team. That is just a case of SRE washing or trying to, like DevOps engineer wasn't good enough anymore, so we're going to call it SRE because that's cool. There's a fundamental philosophic difference between that and the ticket and the NOC sort of thing. And we can talk more about this in depth, like how do you know whether you have a real SRE team?

I have encountered multi-billion dollar organizations that have SRE teams, and I walked in there and said, "I don't really see the E here. I don't see the engineering here, and I don't really see the R here. So in what way is this an SRE team?" And they didn't really want to hear that.

Okay, so let's go back to sort of the production engineers back at Facebook. So one of the things you might be interested to know is that at Facebook, there's approximately a 1 to 10 ratio between people who are doing software engineering and the PEs, which is an interesting piece of data. Because we used to do this thing where we used to say, how many machines per person? But now we kind of think about how many operations people per people creating software.

And if you are a PE, it is the case that you're in with a group for 18 or 24 months or 36 months, with a service, which is what they call their products, whatever you want to call it. So it's not like it's a SWAT team. It's not like you go, "Wait a second, we have operational concerns," and somebody rappels down the side of the building. Or down that big thumb sign over near the Facebook sort of thing.

Nor is it the case that the PEs in a group are like the ops monkeys. It's like, "Oh, awesome. We have a PE. What they will do is they'll handle all the tickets." That doesn't fly at all. Because at Facebook, they do very much think that what you need to do is, if you are going to build software, you're also responsible for helping to run it. Okay?

You might find it also interesting to know is that if you've heard of Facebook's onboarding situation, they call it boot camp, right? And it's this thing where they come in, they indoctrinate you into the Facebook way, and things like impact and all this other good stuff like that. But if you're a PE, you do in fact go through boot camp, and they put a bunch of the SRE training stuff into their boot camp.

There's also extra training when you join a service, and I can talk about that process as well, where that particular service will indoctrinate you in how they do stuff.

Okay? So, it's an interesting thing. Google has a more separate program when it comes to training and spends a little different time. And they're thinking about that as well. They used to have a centralized version. They don't as well.

And here's perhaps a really key thing, so much so that I'm going to make it red. The lead of Facebook's SRE group, and in fact, we have the lead for LinkedIn here, reports to the Facebook head of engineering, which means that the same person in the Facebook organization is responsible both for operations and engineering.

And they think of this as totally key, because if you have two separate things in a political infrastructure, where somehow the operations person has to face off against the software development person or the engineering person, you're in for a bad time. Right? Really, you need to have, and this is, I guess, a really key thing that's important to know about SRE, and if this makes you sad and you have to walk out of the room right now, that's okay. But if you want to do SRE-type stuff, it's absolutely crucial that you have management support up to the highest levels.

You have to have people at the top who go, "Yes, we believe reliability is important. Sometimes it's more important than features." And you have to have that support, and you have to be able to have somebody whose job it is to see both sides of the picture all the time.

Okay, so the other last thing I want to talk about when it comes to Facebook, and as we walk towards the end of this talk, and we can do some questions if you want, is something that I think that was interesting. It doesn't come up at Google, comes up only at Facebook. They're really big on this maturity model where they want to determine what kind of SRE you get, or, sorry, not SRE, what kind of PE you get, or what sort of support they're going to give you, because some PEs are really good at the initial stages of building things, and some are really good at the later stages.

So they look at things like the maturity of where the teams are and they're working and what's the maturity level of the service. Some services, as you can imagine at Facebook, are more mature than others at this moment.

The other thing that they look at, which I thought was really super important, is they look at the relationship between the PEs and the people that are writing the software. I'm using, I think they call it SEs, not SWs. That's a Google term.

So there's a relationship. So if in fact you find the PEs are always arguing with the developers there, that's a bad sign, and they're going to do something different than if everybody's getting along harmoniously and singing "Kumbaya" on a regular basis.

And finally, the other thing that they also do when they're figuring out who goes where is they're looking at sort of these stages, right? They're breaking down the stage of the services between sort of the bootstrap and the places where you start, and then the next step is how do you scale it? What's the initial deployment? And then finally, they call it awesomize. I didn't make that up. Okay.

And so awesomize is that process where you look at the last 5% of the outliers and you try to make them go away, or you look at, could we make this production system better? And so various PEs specialize sometimes in various places, just like, and I'm going to use a sports ball analogy, some people are closers and some people are openers, right? And so they move people around like that.

This is kind of different how they handle this and the fact that people hang out in a team is somewhat different than the Google model. In the Google model, SREs are also free to sort of, within certain constraints, move from place to place whenever they want, and there's this big ejector button right next to their seat that they can push.

And if they push that, and so one of the things that goes on in Google often is that managers will spend a bunch of time making sure that the SREs they have working for them are happy in a different way. And it's not to say that Facebook doesn't as well, but Facebook sort of tries to do that a different way.

Okay, so with that, I'm going to sort of come to the end of my talk. All I've wanted to do is give you a couple of cool ideas from the land of the SRE, so you can put the title in your head, because I think you're going to be seeing more and more of this as Google and Facebook and LinkedIn and Twitter and Dropbox start to be more vocal about how they do things internally. And they have some really cool things to teach the DevOps movement, and I think that similarly, DevOps has cool things to teach them, and we can go into that as well.

So with that, I'm going to finish up and let's see where we are in terms of... A minute. We have a minute for questions, so ask them all very quickly at the same time.

You have a question? Because there is a mic here. If you don't want to do public speaking, you can, by all means, tackle me at any point in time in this conference, and I'd be delighted to talk about this sort of stuff. I could go on and yammer for hours.

Or he'll have a dance-off with you if that's what you want. That's good. Thanks for that offer.

So thank you very much for your attention. I really appreciate it and please do enjoy the rest of your time.