Operations Reform: Tom Sawyer-ing Your Way to Operational Excellence

Log in to watch

Las Vegas 2018

Operations Reform: Tom Sawyer-ing Your Way to Operational Excellence

Stack Overflow, Inc. had 150+ services across 3 teams with very little consistency in operational hygiene. Using a simple self-assessment, teams were motivated to improve. The self-assessment is blameless, non-bureaucratic, and succeeds where KPIs fail.

Tom is an internationally recognized author, speaker, system administrator and DevOps advocate. His latest book, the 3rd edition of "The Practice of System and Network Administration" (http://the-sysadmin-book.com) launched in 2016.

He is also known for The Practice of Cloud System Administration (http://the-cloud-book.com), and Time Management for System Administrators (http://www.tomontime.com). He works in New York City at Stack Overflow (stackoverflow.com).

Previously he's worked Google, Bell Labs / Lucent, AT&T and others. His blog is http://EverythingSysadmin.com and he tweets @YesThatTom.

Chapters

Full transcript

The complete talk, organized by section.

Tom Limoncelli

My name is Tom Limoncelli. I'm the SRE manager at StackOverflow.com. Who here has heard of StackOverflow.com?

Okay, one, two, three. Okay, some of you. That's good.

How many of you have heard of Stack Overflow Enterprise?

Oh, okay, not as many. You get the whole Stack Overflow Q&A thing, but it's for your enterprise. You can use it for your internal technology projects.

More about myself: I've been a system administrator for much too long. I do a lot of writing. I blog, I tweet. I have a column in the ACM Queue website, and I've written a number of books.

A little bit about this talk, as mentioned earlier with Jason Cox's unhappy face: this is one of the talks that's trying to fix that. Most talks at DevOps conferences tend to be on the dev side. This talk, we're going to be on the operations side. So thank you for coming.

Now, I speak at a lot of conferences, and usually I try to have my slides all fancy with pictures and everything, and I'm doing something a little different this time. I'm just going to keep it simple on the slides, and I'm really going to focus on three stories.

So the first story is about a big initiative. And these are all true stories, by the way.

First story: rewind back to 1995. I'm early in my career. I'm at this big telecom bureaucratic company. And like a lot of companies, every quarter, you have a big all-hands meeting, and executives stand up on stage and talk about these exciting new initiatives.

And being a young engineer, I was so inspired by this talk about this big initiative. Afterwards, we're walking out of the auditorium, and I'm talking to my coworker who's been with the company for more than a dozen years.

And I said, "Oh, I'm so inspired by this, and I'm going to do this and this, and I think our team should do this and this and this. What do you think we should do, Andrew?" Name changed to protect the innocent.

And he said, "I'm not going to do anything."

And I was shocked. I said, "What do you mean?"

He said, "Tom, someday you'll get to learn that these initiatives, they get announced a lot. And if everyone goes and they do all sorts of work, and then the initiative goes away, that was all just wasted. I do nothing, and I get to the same place and do a lot less work."

And I was devastated. I was shocked. Here I am. I'm 23, and I'm like, "This can't be how technology works."

But it does. It does often work that way.

Operations can say no often because executives say yes, but we say no because we're under-resourced. We don't have the time. We don't have the resources. There's too much complexity and history to try these new things.

And so the lesson that I took away from this story was I want to make it a big part of my career to find non-top-down ways of motivating people. I want to learn how to motivate things through peer influence and anything but executives saying, "This is what we're going to do, and we're going to do it by fiat."

One of my role models in this area is Tom Sawyer, the story of painting the fence. He was not the executive that said, "Paint the fence." He was the executive that said, "I'm really good at painting this fence, probably better than you. I can't let you paint the fence." And that just made everyone want to paint the fence, right?

IT Revolution, the people who started this conference: one of the books that they published recently, a short topic book, you can download it for free. I was involved in this project. This book, Expanding Pockets of Greatness. This is four case studies of successful DevOps transformations, and one thing you'll notice: in none of them was there a top-down edict. It was all from the grassroots building up, sometimes with management support, sometimes working around management. I think they're really good lessons to learn.

Okay, so story number two.

I was at Google from 2006 to 2013, some of their biggest growth years, where we went from a search engine to a million different applications. Around 2009, the Google SRE team realized, which at that point was probably 50 different teams. It's now something like 100 SRE teams. Around 2009, Google SRE realized that their operational hygiene was a bit uneven, like some services were run better than others.

And by hygiene, I mean the things that every operational team should do. We all know we should do backups. I think if you disagree, you're at the wrong conference, right? I shouldn't have to convince you.

And we use the term hygiene because it's like brushing your teeth. We all agree we should brush our teeth every day, right? Now, there might be disagreements about which toothpaste we should use, and there might be disagreements about how to best do backups, but we all agree we should do backups, right? I'm not going to write an ROI case study to explain that we should do backups. It's hygiene. It's stuff that we should do.

Now, I didn't invent this. The brilliant management at Google had this very good insight that if they said, "Hey, our hygiene isn't good in these nine different areas. Go fix it," well, that would just piss people off.

So they had a very Tom Sawyer-ish way of working on this. They had every team do a self-assessment of their services, and they came up with these different categories. And they didn't build this fancy app to track all this. They just did it in spreadsheets.

And so here's a mock-up of what the spreadsheet kind of looked like. One is bad, five is good. The columns are every month. Every month, each team would fill in a column. So you have December, January, February, March, and you had these different categories of hygiene.

So how are we doing on regular responses? Think of that as transactional requests, tickets, emergency responses, IR kind of stuff, when you get paged, and monitoring, capacity planning, et cetera.

And instead of a painful, big audit that happens and maybe, at a big bureaucratic company, each team would hire a full-time person that manages this. That's crazy. No, this was just one hour a month. Teams were expected to spend one hour a month just going through this for each of their services and do a very simple, basic assessment.

And the goal here was data gathering, not project planning. In fact, I don't think I ever heard a manager say, "You got a low score on this. You should do a project to fix it." It was, "We're just collecting data." And the engineers who are inspired to fix things. Engineers, they don't want to see bad scores. They came up with their own projects.

By the time this exercise would complete, they'd have a general idea of what kind of things they want to schedule for the next month. It gave people the data they needed to do their job better.

So teams could do a roll-up to the service level, and this gives the team two things: A, where should we put some of our focus? And B, how have we been doing? So you can get that good dopamine feeling from knowing that things have gotten better over time.

You can see service A has been kind of teetering between one and two, while service C has been slowly improving over time.

It also gives management the data they need. You could do a roll-up by team and see where resources are most needed. Now, notice I said where resources are most needed, not which teams are doing badly. Because an important part of this is the assessment is judging the service, not the people. And that's how you keep it from being... You want to make it a blameless situation. And I'll talk more about that in a second.

So why did this work? Well, psychologically, this works because it's so simple. It's a spreadsheet, and it's only an hour-a-month commitment. So that makes it a very low barrier to entry.

It leverages pride and ego. No one wants to see a lot of red on their chart, so people are self-motivated to fix things.

It also creates good culture. It's blameless. We're assessing the service, not the people. And it's transparent. These spreadsheets were visible, and the roll-ups were visible across all of the company. So you could see how you're doing in respect to other teams. And that helped create a culture of wanting to fix things instead of hiding things.

It also had non-monetary recognition of good work. So you want to encourage greatness. And engineers, money is kind of motivating, but there are other things that are much more motivating.

So, for example, if you want to improve how your team is doing in a certain area, now you have the data. You could look at what other teams are ranked better in that category and go talk to them. And what's a greater motivator than being the person that people come to? Like, "Hey, how did you guys get such a high score?" That kind of recognition, more valuable than money.

It also helped direct projects. So a good sysadmin or a good IT worker fixes a problem. A great one fixes a problem permanently, and a really great engineer builds a new paradigm that fixes or eliminates the problem company-wide.

Well, now, instead of thinking, "Hmm, I think we kind of generally do bad in this particular category," say backups. Instead, you have the data. You can look and you can say, "Oh, yeah, 30% of our teams aren't doing well in capacity planning. I'm going to build that new paradigm that lets us do that really well." So it lets your engineers guide their career in terms of making bigger impact and helps them achieve that greatness.

Let me talk more about non-monetary recognition of good work. Let's say that bonuses were tied to these scores. Well, first of all, everyone would magically have high scores, right? Because it would encourage lying, and you don't want to do that.

Also, if a service is struggling, no one would join that team because reforming a struggling service could take two or three performance review cycles to improve. And that's basically guaranteeing crappy bonuses for two or three performance cycles. Who would join a team that is struggling? That's the opposite of what you want.

You want your best people to always be looking at these charts and saying, "Oh, that's red. I want to join that team." You want your best people to be hopping to the biggest fires, putting them out, and leaving good culture there, good technology, good practices. And this creates a virtuous cycle or virtuous circle that encourages that kind of behavior.

Another reason it works so well is it seeks perfection but doesn't require it. You should absolutely never have an initiative that's like, "We want to see all fives across the board," for many reasons.

First of all, perfection is impossible.

Second of all, you would be wasting the company's money. That last 10% of perfection is probably more expensive than the first 90%. So if you have absolute perfection, you're probably wasting money.

Now, for example, backups are super important for Gmail. If Google lost people's Gmail, that's like stabbing someone in the heart, right? You've lost their personal data. There's a certain commitment there.

But maybe, and I'm just making this up, Google Finance, maybe backups aren't so important, so a four there is fine. A five for Gmail is more important. So you don't want your engineering time wasted by going for perfection.

Okay. So now the third story.

In 2013, I joined Stack Overflow, and in 2016 I became the manager of the SRE team. And one of the first things I wanted to do was implement this kind of self-assessment program at our company.

Now that's really difficult because in our case, Stack Overflow is a little bit smaller than Google, and so we had to scale it down.

So, for example, we only had one SRE team with many responsibilities instead of many, many SRE teams. We had a more granular definition of service. So the way I scaled the process down was it was one spreadsheet. I let the SREs create their own rubric.

They were kind of intimidated by this process because they were like, "What's..." It's really difficult to tell someone that they have an ugly baby, right? And that's what this is about. This is a polite way of telling someone that they have an ugly baby. The best way to do that is let them tell you that they have an ugly baby.

I also... So wow, I didn't expect that to get such a laugh. Okay. I'm going to have to tweet that tonight.

So also, to keep it simple, I said, "Let's just have a pass-fail. We're just looking for where are the areas that we're kind of in trouble," right?

This made it a little less insulting because people were like, "Oh, no problem. I'll just mostly be pass and a couple failures where we need work." That was what they thought.

What they got when they started scoring their first stuff, they came back to me and said, "Tom, pass-fail's not good enough. There are some things that it's like fail with an asterisk. Like, negligently fail. We want management to take notice."

I said, "Okay. Well, we'll add one more grade."

And then they came back the next day and said, "We were thinking about it, and we want a really good pass, like pass, and this is so good everyone should copy what we're doing."

I said, "Okay."

So now it's a four-point scale, but people were empowered to make their own scale. In my mind, it's still pass-fail, but in their mind it's a four-point scale. It just works.

This is what our spreadsheet looks like. This is real data. I didn't doctor this. I just won't tell you what month it's from.

And the system worked really well. Even though it's one team, we kind of have these sub-teams, and some of them just did this. Some did it together. Some, their leader went through and did the first draft and people updated it.

What else should I point out about this? I had nine different categories that I wanted to self-assess on, but that felt very intimidating. So for the first iteration, we just did the four categories in light blue up top. Drill down on that. So we did it iteratively.

We started just pass-fail with these four categories. Next iteration, we added more categories. These categories, I see a lot of people taking pictures of this slide. These categories work for us. Your company, maybe totally different categories. Do what works for you.

So why did this work? We kept it simple. Simple, simple, simple. If you find yourself writing an app to manage this, please don't. You'll still be working out the database schema by the time you could've been done already.

It's also blameless. It lets, as I say, assess the service, not the people. It motivates people to expose their own warts, and that makes people want to fix them.

People are more motivated to work on a problem when they thought of it, which is why, as a manager, I never say, "That's broken." I say, "How's that doing?" And they say, "Oh, that's really broken." "Oh, would you like to fix it?" "Oh, let me tell you, I can't wait to fix it." Yeah. It's like Jedi mind trick.

And also, these problems had existed for a while, but they were invisible to management. And engineers tend to think that managers have ESP.

And I actually, at one point, I said, "Wow, this is great. I can show this to our management, and they'll see where the problems were." And they're like, "Oh, they know all these problems."

I'm like, "Wow, you really believe that the executives have ESP? That's so adorable."

Ironically, if I did it the other way, if I did the assessment for them and handled, "This is what I think you're doing," right? A, they'd be insulted. B, I think what they'd really come back, I think no one ever said this because I didn't do it, but I bet they would've come back and said, "Oh, don't tell me what's wrong. I've been complaining about this for weeks," but no one listens, right?

Because engineers, when they mumble underneath their breath, they think that CEOs hear that.

Not true.

It did create a new problem, which was now all those red squares was 100 or so new projects that they wanted to work on. And if we tried to fix everything all at once, we wouldn't have time for feature-related work.

So I approached this problem in three ways.

One is I tried to rate-limit stuff. In our monthly and quarterly planning, I said, "Let's limit 20% of our work hours to fixing things in this area."

The second thing I did was we established theme months. We had a theme month of backups one month. And so all the different sub-parts of the team, or everyone, was working on it. And that worked really well for two reasons.

First of all, it helped morale. People were feeling a little isolated in their work, but because they were all working on backups, it actually improved team cohesion, even if they were working on backups of totally unrelated things.

My team works all remote. Well, two of us are in the New York City office. Everyone else works from their home, in many different time zones, and that feels a little isolating. But the fact that everyone was talking about backup-related things, that helped morale.

The third thing we did is I tried to focus on the theory of constraints. How many people are familiar with that concept?

It's explained really well in The Phoenix Project. Who here has not read The Phoenix Project?

Yeah. Raise your hand in shame. No.

Oh. That's a great book. Oh, sorry. That was kind of not blameless of me. But it got a laugh, so I'm sticking with it.

So the theory of constraints says if you have a process, say it's four steps, and each step can process 10 items per week, but you have that one process that is only five, you're going to get a backlog of work between step two and three.

And when you're picking projects... Well, the theory of constraints says you should focus all of your energy on fixing that backlog on step three. Because if you improve the system downstream, well, that downstream is starved for work. You're just making that more efficient for no good reason. And if you make upstream steps more efficient, you're just contributing to the problem of a large backlog or a large bottleneck.

So we tried to identify, we put a lot of thought into, of all these things that are red, which of them would be fixing step three, for example? What would be fixing stuff at the bottleneck?

Okay. So that's the end of my three stories, but there's actually a fourth story because we have time. That story is your story. I'd like to see you all go home from this conference and write your own story. Take these three stories and apply them in your organization.

Shameless plug. One thing that'll help you is, in a book I wrote, chapter 20 is kind of the instruction manual for doing this. And appendix A is all these sample assessment questions and also what I call look-fors.

If your capacity planning is at level three, here's what you should look for. You should see these things, and that's how you know you're at level three. Or here's how you know you're at level two. It's something like 40 pages of look-fors.

Don't try to implement every single damn look-for. That would be crazy. But use this as a guide, as inspiration.

So do this in your enterprise. Try to do it grassroots. Management should set high standards, but the engineers should be figuring out how to get there. Let teams create their own rubric, maybe even their own... Well, I think you should create your own grading system, but be flexible.

Start with one team, and then prove that success there, and then grow. Don't do a big corporate edict that all teams must do this assessment. At Google, this was only done by some of the highest performing teams to kind of kick the tires and get it working, and then over time, all other teams started doing it.

Oh, there we go. Yeah, let the rubric grow over time. And people are going to need resources to do this. That means postponing some projects or adding people, but now you're going to have the data that's going to help you better focus those resources.

And lastly, oh, no. Yeah, use a spreadsheet. Don't write code for this.

And it's so important that you do this in a blameless way. You want to encourage blamelessness, transparency.

When I gave this talk as a dress rehearsal at the New York City DevOps meetup... By the way, anyone here from New York?

Okay, come to my meetup.

Someone said, "Tom, if you're giving this talk at an enterprise conference, no one here, but other enterprises have kind of a toxic culture, and this is just setting people up with a big target on their back. This is going to be weaponized against them. You should have some advice in that situation."

Well, if you're really in that kind of culture, I have three bits of advice.

One is maybe transparency isn't for you. Maybe you need to do this on one team and not be externally transparent until you've proved it out. One of the biggest motivators is people like to copy success, and so if you can be successful in one team, let other people copy it.

The second bit of advice I have is have all of your management read this book, Beyond Blame. Fantastic book. It's like The Phoenix Project in that it's a fictional story that you read instead of a textbook. It's very readable. It's a fast read, like 100 pages. All about why it's so important from an executive level to have a blameless culture.

And the third thing you can do is try to change your culture, try to change your culture, try to change your culture, and if those things fail, send out your resume.

I'm very serious about that. For decades, actually about 20 years ago, I wrote an article. It was called "Just Quit," and it was saying, try your best, but sometimes the best thing to do is send out your resume.

So that's my talk. We do have five minutes for questions, and thank you all for being here.