Radical Ideas Enterprises Can Learn From The Cloud

Log in to watch

San Francisco 2015

Radical Ideas Enterprises Can Learn From The Cloud

The enterprise is adopting the service reliability techniques learned from the “cloud computing” and “distributed computing” world. Tom will highlight some of the most radical ideas from the new book “The Practice of Cloud System Administration”.

The book focuses on “distributed” or “cloud” computing and brings a DevOps/SRE sensibility to the practice of system administration in ways that apply to the enterprise, the cloud, and everywhere.

Some of the radical advice includes: improving uptime by using cheap unreliable hardware, why you should crashing servers at random times, and that you should make peace with outages.

Info about his new book can be found at http://the-cloud-book.com

Chapters

Full transcript

The complete talk, organized by section.

Tom Limoncelli

Hi, my name is Tom Limoncelli from Stack Overflow.

A little more about myself. I've been a system administrator for much too long. And I've also been to a number of large companies. I was at Google for seven years, Bell Labs for seven years. And I blog, and I tweet, and I write a number of books, as she mentioned.

And often when you invite an author to a conference, the talk that they give is just a thinly veiled sales pitch for their book. But I promise that it's not that kind of talk. If it were, I'd have a slide like this, and I'd wait until someone tweeted a picture of me with it.

Okay, maybe not. And I'd have a slide with a table of contents, and I'd explain how it's really two books in one, and how the first half is about building large systems and the bottom half is about operating big systems.

But it is not that kind of talk. This is really going to be a talk about the cloud.

The cloud. The cloud. The cloud. The cloud. We love the cloud, right? We heart the cloud. Raise your hand if you heart the cloud.

Everyone hearts the cloud. Of course, we heart the cloud, and why do we heart the cloud? We heart it because the cloud solves all problems, right? Right? There are no problems left in the world. Just ask any person from sales. They'll tell you this is true. This is true.

But no, we laugh because technical people, we generally don't like the term "the cloud." It's kind of marketing-y. But I think the fundamental reason why technical people don't like the term "the cloud" is because it's ambiguous. It means different things to different people. And I think there's three major definitions.

Consumers think of the cloud as putting my stuff on someone else's computer. They say, "I have my music in the cloud, and therefore I can listen to my music collection no matter where I am from different devices."

To executives, the cloud generally means some kind of elastic computing. I need 1,000 machines just for a one-week advertising thing, and at the end of the week, I'm going to give them away, and that's really radical in itself.

But to computer scientists, the cloud means distributed computing. Distributed computing is when we have a lot of computers, or we have a large job that has to be done, and we divide that work among many computers.

So a little computer history. In the old days, computers were large. The important thing is, in the old days, computers were something you had to go to. You went to the computer, you did your work, and you took your results away. And even when they got smaller, it was still very much like that until networks were invented.

And networks begat client-server computing. And client-server computing was awesome because you could have these cheap clients and a big server serving lots of people, and you had economies of scale and stuff. And it was so successful, we started getting bigger and bigger servers because we wanted to put not 100 email users on a server, but 1,000 or 10,000, or databases, et cetera.

But that had its limits. The servers couldn't get as big as we wanted. And in the late '90s, the concept of distributed computing became more popular. That's where it started in research and then trickled down to other things.

The basic concept is we're going to take lots of computers, distribute a large job, split a large job over many computers. And to do that, the computer science wasn't ready. We didn't have a way of coordinating such work, either in a conductor orchestrating everything or decentralized control. These things had to be developed.

And as a result, we have these breakthroughs in genomics because we have hundreds of computers, each taking a chunk of DNA information, processing it, and rolling up that information.

At Stack Exchange, even though we're in the Quantcast top 40 biggest websites, we actually only have 10 machines that are our front end. So all of your web requests come in through a load balancer that hits one of 10 machines, and each one of those machines gets approximately one-tenth of the traffic. So we're distributing that work across 10 machines.

Of course, there's lots of back-end servers and other things, but fundamentally, 10 front-end machines.

And distributed computing isn't just for processing power. It can also be for storage. So Gmail has many thousands of machines. All of your billions of Gmail messages aren't on one machine. They're distributed over many.

So distributed computing can do more work than the single largest computer: more storage, more compute power, more memory, whatever. The problem is that with more computers, we have more problems. So it's a bigger risk. If the whole system is down, the failures are more visible.

So the first email system I ever ran had 50 users on it. If it was down for an afternoon, I had 49 coworkers that were kind of angry, but they understood. If Gmail goes down for an afternoon, it's a headline in The New York Times the next day. So outages are much more visible.

Also, automation becomes mandatory. It goes from a "would be nice" to a "we have to." Facebook, Google, all these companies could not possibly scale if they had to hire another sysadmin for every 100 machines. You would hire everyone in the United States.

And cost containment becomes critical. If you can save $50 per machine on thousands of machines, now you're talking real money.

So in response, the distributed computing world came up with a lot of really radical ways of dealing with these issues. We stopped talking about not just reducing risk, but system safety. Can a large system be safely administered by people?

Reliability became a competitive advantage. Early on in the web history, companies discovered that people don't go to a website that's down. Not only that, but they go to your competitor's website, and they don't come back. Right? You run a sports site, they go to some other sports site, and then it's good enough, so they stay there.

Entirely new automation paradigms, things that you've been hearing about today. And also cost and economic models that are radically different. This is why Facebook, Google, Microsoft, they design their own hardware, because they want to get vertical integration and totally optimize the heck out of their hardware.

But to me, the most interesting thing about distributed computing is we learned that we have to make peace with failure. Everything can fail. When you have a large enough system, everything can fail. A one-in-a-million possibility of an outage means that's going to happen every day at Google and some of these other companies.

And everything can fail, not just technical things like parts and networks. Code is imperfect, so we have to have new ways of dealing with bugs. We can't wait until the next software release. We're going to hide features behind a flag so we can just flip it off if we find a serious problem with it.

And people are imperfect. We need better ways of dealing with people. In other words, we need to learn how to fail better.

So while distributed computing is about just incredible new services and applications and new possibilities, to me, as an operations person, it's all about learning to fail better.

So the book is like 600 pages. I can't do a summary, but I'm going to focus on three particular ways that we can fail better.

By the way, these are not... Well, let me do. So we're going to talk about using cheaper, less reliable hardware; if a process or procedure is risky, do it a lot; and don't punish people for outages.

Now, some of you in the audience might be thinking, "Oh, these are the mistakes that companies make. Tom's going to recommend not doing these things." No, actually, I'm going to recommend that you do these things.

So let's begin.

Using cheaper, less reliable hardware. To explain this, I'm going to use an analogy. I travel a lot, and I often rent a car. And when I rent a car, they try to sell you all these different kinds of insurance, right?

But it would be fiscally irresponsible for me to buy all four kinds of rental insurance, especially since my auto insurance, my personal auto insurance, covers any rental car that I get, plus my homeowner's insurance covers any rental car that I get, and if I rent it on an American Express card, that covers any rental car I get.

It would be fiscally irresponsible to do this duplication of effort. And yet that's often how we design our enterprise systems.

Let me give you an example. A friend of mine recently came to me. He said, "Tom, my company's building out... It's a new company. We're building our website, and I need some advice."

He showed me his design for their new website, and he wanted some feedback. And he said, "So first, we're going to start with a high-end server. We're going to have all the storage on RAID. We're going to get dual power supplies so either power supply can die and it keeps running. We're going to put it on a UPS. Oh, and we're going to buy the gold maintenance package. Four-hour response time, baby.

"And that's not good enough. We're actually going to put that behind a load balancer and buy five of them, so any one of these can die at any time. Oh, and you know our current software doesn't work in this kind of arrangement, so we have to invest in writing code to make it work in a distributed environment.

"Oh, and one more thing. Yeah, we're going to buy a second load balancer so there's no single point of failure."

Okay? So in other words, their plan was to spend money at every level, right?

Now, what we learned in distributed computing is to gain reliability through software when possible instead of hardware. It's less expensive because if you have a hardware device that's going to add $1,000 to each machine, if you're buying a million of those machines, that's a lot of money. But if you can do your resiliency in software, you've paid to write that code once, you can now distribute it to as many machines as you want.

It's also a lot cheaper. When you're scaling at cloud scale, a lot of these different techniques become really more cost-effective at large scale.

So for example, here we have a load balancer with two machines. You need to run each machine at less than 50% utilization because if one of those machines dies, there has to be enough room on the other one to take up the slack.

But if you have 10 machines, you only need 10% overhead because if one of those machines dies, taking up the slack is going to be spread over those nine machines.

So this is why, for example, Dropbox can offer your home directory storage for free or more cost-effectively than a lot of enterprises can provide it for themselves.

So the right amount of resiliency is good, but too much is a waste. And in distributed computing, we learned we need to gather those metrics so that we can accurately do this and not be wasting money.

And load balancing is just one of the ways of achieving this kind of resiliency.

When I was at Google, and Google doesn't do this anymore, but it was... and I wasn't directly involved in the project, but they had a project that involved... They needed massive amounts of RAM. And RAM with ECC error correction, the chance of an error is one in 10 trillion or something, but they were buying enough RAM that they were seeing these errors a lot.

And so they found a way of detecting RAM errors in software and working around the problem. Awesome. Well, because they did that in software, they then stopped buying the premium-grade RAM and started buying the cheap consumer-grade RAM.

Saved a lot of money. Then they started buying the factory-reject RAM. And it worked.

When their competition realized that they needed to compete with Google on speed, this was something that accelerated web searches, they also started putting their databases in RAM. But they hadn't done the homework to produce the software resiliency, so they were buying their RAM at the premium price, not the factory-reject price, which was pennies on the dollar.

Okay. So point one, have I convinced you to use cheaper, less reliable hardware?

Yes. Yes. Awesome.

Okay. This next one, a little more controversial. If a process or procedure is risky, I want to encourage you to do it a lot. And the reason I say this is that there's a difference between behavior and procedures.

Risky behavior is inherently risky. We can't improve that, and I'm not talking about risky behavior. So here are some risky behaviors. I can't make any of these things less risky. Risky behavior is just plain risky.

But risky procedures can be improved through practice.

So here are four operational tasks that can be very risky and can keep engineers up at night and give them the cold sweats. But when we practice them more and more, we get better at them, and we learn.

So, a little story from the company I work at, Stack Overflow. We basically run the company out of a data center in New York, where we're based, but we can fail over to a data center in Colorado.

So when I joined the company, I said, "Awesome, I want to learn the failover process. Let's do it sometime."

And they said, "Well, Colorado is really not in the state that we can do a failover right now."

I said, "Well, that's 100% risky, right?"

So a couple of months later, some projects were done. We were ready to do the failover, and the engineers were saying, "This should be no problem. This should take about an hour."

Well, the first time we did the failover after I joined, it took 10 hours that first time we were there. It required hands-on from three different teams, the ops team and two different development teams.

During those 10 hours, we filed more than 30 bugs related... Some were simple, like fix this documentation, there's a missing comma here. Some were like, we need to develop this whole big subsystem.

We also learned that the process can't be done if Nick was on vacation. Nick was our single point of failure. That's a risk.

So did we say, "Oh, this is a risky procedure. We should try to never do it"? No, we said, "This is a risky procedure. We need to do this a lot."

So every couple of months, we had a fire drill day. It was on a Saturday. And as I said, the first time we had 30 bugs, 10 hours. The next time took five hours. We got it in half the time. But still, 20 bugs were filed. The next time we did it in two hours, and our most recent attempt, we were able to do it in only one hour.

I hope a year from now I'm able to say that we can do it in one minute, or... Well, I'd like to say we do it in zero minutes. It's just automatic and no one even gets woken up, but that's more like a five-year project. But that's internal.

So why is this good? Because every drill surfaces various areas of improvement. And engineers are motivated to fix things when they know they're broken. A third of those bugs were fixed during those 10 hours, especially since a lot of those 10 hours were sitting around waiting, and we had plenty of time to fix some of those bugs.

But also each member of the team gained experience doing the process. So now everyone on the team has done the process. We just hired two new people, and I can't wait to see the look on their face when they are told they're the next person to do the failover.

Also, because of the small batches principle. We change the infrastructure and code very frequently. And so the number of changes between failovers relates to how risky the process is.

If we do this failover once a year, there might be 10,000 changes to the infrastructure, any one of which could make the failover fail. But if we're doing this more often, there's a smaller batch of changes for each change.

And this holds true for software upgrades in general. So Microsoft is much better now, but remember back in the day, Microsoft shipped Microsoft Office releases every three years.

So talk about risk. It was months of planning. There were tons of incompatibility issues. I worked at a company with a quarter of a million employees that halfway through their Office 97 rollout, things stopped because of a major issue. And for about a month, half the company couldn't share documents with the other half of the company. It's crazy.

Distributed computing can't do that. And just Google and these other companies were pushing out new releases all the time. So if it takes... Actually, I'm kind of preaching to the choir here. Every talk you heard this morning was about CI. So you get the point.

But the important thing is that the batch size gets smaller. So big-bang releases are inherently risky, and small batches are better because of the fewer changes between each batch.

Also, it's better because there's reduced lead time. Imagine the person who... Well, you write code, and it's in production days or weeks later.

Back when Microsoft was doing three-year releases on Office, I think about, remember when they added real-time spell checking? The squiggly line underneath misspelled words. I imagine the person who wrote that code and was really proud of it, and then he thought, "Oh, now I have to wait like two or three years before I can show this to my friends."

How sad is that? How sad is that?

And what if he changed companies before it was released? Oh my God. What a sad situation for developers.

And in fact, there's been a lot of studies recently by psychologists studying companies that subscribe to the small batch principle, and they find that instant gratification improves morale. You're not waiting three years to see the results of your work. It could be in production that day.

Etsy just said in 2014, they pushed 9,000 times that year. So that's incredible.

So risk is inversely proportional to how recently a process has been used.

A company that's really good at this is Netflix. They have Chaos Monkey, which picks random machines and reboots them and tests that the resiliency infrastructure that they have will fix the problem. And Chaos Monkey runs 9:00 to 5:00, Monday through Friday. So if there is a problem, they learn about these problems when their engineers are in the office, awake, and sober.

They don't run Chaos Monkey on the weekends because those problems will turn up on their own.

Okay. So have I convinced you that if a process or procedure is risky, you should do it a lot?

Yes. Awesome.

Okay. Part three: don't punish people for outages.

There will always be outages. There will always be outages. We talked about this already. And in fact, getting angry about outages is equivalent to expecting them to never happen, which is irrational. It's irrational.

And yet, we have all these outdated attitudes about outages. We have managers who say, "I want 100% uptime." Which is amazing because if your customers are all connecting from Starbucks, the Wi-Fi there has a 99% uptime. So you're spending money you don't need to.

And we punish exceptions. We say, "I'm going to fire the person responsible for that outage." And when you do that, you create this environment of fear, and employees respond very quickly to that. And they start hiding problems instead of fixing them because they don't want to get blamed. They were touching the problem when it happened. They don't want to get fired.

People stop communicating. It discourages transparency. And in fact, as a result, small problems grow and grow and grow and become big problems.

So this manager that thought they were being such a good manager by saying, "I'm serious about outages. I'm going to fire the person responsible," has created the work environment that creates more outages. Isn't that crazy?

So there's new thinking on outages. Set a realistic uptime goal, a rational uptime goal, and anticipate outages. Design into the system resiliency techniques. Have fire drills that test things. That's called the anti-fragile philosophy.

And as a result, you create an environment that encourages transparency, encourages communication, encourages small problems getting fixed before they turn into big problems, and you have fewer outages.

In the first draft of the book, I had this paragraph that said something like, "In the old days, MBAs were taught that after every outage, you should fire the person responsible, and eventually, you'll have a company that only employs perfect people."

And my co-author deleted the paragraph. It was Christine. She said, "Tom, no MBA program says that."

And then, I swear, the next day, the healthcare.gov problem happened. And what were all the newspapers saying? "Who's Obama going to fire to prove he's serious about fixing this?"

This is ingrained in our culture. We expect this.

And if you saw Mikey Dickerson's talk he's given at USENIX LISA and other conferences, he said... He's one of the Google people and other companies. He's one of the people that came in and kind of rescued the site.

He said in his first conference call with all the engineers, the first thing he said is, "I want you to all know none of you are at risk of being fired. I need to keep every one of you. You are the people that are uniquely qualified to fix these problems."

That's a modern thinking about outages. Absolutely.

So another modern thinking is it isn't just the opposite of hiding. Bad would be hiding our outages. Good is we're going to boast about them. We're going to talk a lot about them because the more we talk about them, the more we're going to educate everyone in the company.

So we actually write up a postmortem document and share it with the entire company. And these postmortem documents don't focus on blame. They focus on taking responsibility.

At Google, if you caused an outage, not caused, but if there was an outage, because there is no one single cause. But if you're involved in an outage, there's no punishment. In fact, the only punishment is you might be the person that has to... It's not even called punishment. You take the responsibility to write a presentation about what we learned from that outage, and you give that presentation to all the other SRE teams.

So every outage makes Google smarter. Isn't that crazy? That's radical.

There's a new book out. I've read a draft. The official one comes out on November 1st. Dave has been teaching postmortem writing for years. He is brilliant. If you read this book, get this book and read it right now. Pre-order it while I'm giving this talk and read it because a year from now, everyone else is going to be talking about this book, and you'll be like, "Yeah, I read that back when Limoncelli recommended it to me last fall."

So totally recommend this book, Beyond Blame.

Let me show you an actual postmortem from Stack Overflow. Our postmortem, first we have a template, so it's easy to get started. It has a summary, a summary chart. The background section at the bottom of this slide basically gives enough terminology and background so that anyone in the company will understand the rest of the postmortem, because this does get sent out to the whole company.

The next part of the postmortem is a timeline: what happened, step by step, gathered from logs and chat rooms and emails.

And then the last part has four sections. It's two and two. The first two are what went right and what went wrong. We talk about what went right because we want to celebrate those things, and we talk about what went wrong because transparency is important. That's how we're going to learn.

And the bottom two are what we're going to fix in the short term and what we're going to fix in the long term. Long-term items might require creating a project or a budget. Short-term things are things that we can fix just right away.

And when I emailed this postmortem out to the company, about 20 minutes later, I got this email. Actually, reply all. It's the whole company.

"I don't know about anyone else, but I really like getting these postmortem reports. Not only is it nice to know what happened, but it's also great to see how you guys handled it in the moment and how you plan on preventing these in the future. Really neato. Thanks for the great work."

I'm so proud to work at a company whose culture around blameless postmortems is that the reply that I get after an outage is, "Great work. Neato." It's pretty cool.

So have I convinced you to not punish people for outages?

Awesome.

Okay. So some take-homes.

Cloud computing: boring marketing term. Distributed computing: yay, computer science saving the day.

Use cheaper, less reliable hardware because when we can, we want to be solving our resiliency problems through software. It's cheaper, and we only want to pay for the resiliency that we use.

If a process or procedure is risky, do it a lot. The scientific term for improvement through rapid iteration is "practice makes perfect." This also works because of the small batches principle.

Also, don't punish people for outages. Focus on accountability and taking responsibility.

Now, I do want to say one more thing. I didn't see everybody raise their hand, so there might be some people that I haven't convinced yet. So these aren't new ideas. These are actually... And the whole DevOps philosophy dates back to the Toyota manufacturing model. This is applying operations science to IT, and the Toyota model led to the lean manufacturing model, which led to just-in-time manufacturing and everything. And so one thing I tell to the DevOps doubters that I talk to is this isn't this new thing.

Also, if you're still not convinced, it's also how we run our home life. If you think about it, we build resiliency where needed, not everywhere. That's how we build houses. We don't put a lock on every door in our house. We put a big, strong front door with a lock there so that we can move freely inside and be vulnerable in our own home.

We manage risk through repetition. If the only big meal that you make every year is Thanksgiving, that's risky. But if you cook big meals throughout the year, then cooking is less risky and the day is less stressful, and it's more enjoyable for you, and believe me, everyone else.

Similarly, if we frequently have those small, difficult conversations with our loved ones, then it's much easier to have those big, difficult conversations because we've developed the tools and techniques and the ability to do those well.

And lastly, we deal with problems when they're small. Because expecting our loved ones to be perfect is irrational, which is why good relationships include healthy portions of forgiveness.

So maybe these aren't the radical ideas that you should take from cloud computing. Oh, that's the forgiveness line.

So maybe these aren't the radical ideas that enterprises can learn from the cloud. Maybe these are the real, the very reasonable ideas. And I hope that you adopt them in your IT organization also.

Thank you very much.

Oh, can we just switch back to slides? And my publisher demanded that I include this slide, which is... Anyway, on InformIT, this is the deal of the day. You could get it for $25.

Okay. Thank you very much.

Q&A

Do we have time for Q&A? We have two minutes. Do we have any questions? We have time for... We have time for one question. One question. Yes.

Q: It starts before lean. It only goes back to Deming.

A: Okay, it goes back to Deming. Never challenge John on a Deming question.

Okay, thank you very much.