Better, Faster, Cheaper: What Does It Mean for Ops?

Log in to watch

London 2017

Better, Faster, Cheaper: What Does It Mean for Ops?

John Willis

Director of Ecosystem Development · Docker

What if you could be 2,000 times faster than your competitors, and 100 times more reliable?

Chapters

Full transcript

The complete talk, organized by section.

John Willis

So I'm Botchagalupe. Better, faster, cheaper.

I'm not going to tell you about my background. If you actually want to know me, you've got to remember that horrible-spelled Botchagalupe thing from there. That's my Gmail, that's my Twitter, and then all my presentations that I've done over the last seven years are there, plus my bio stuff.

The bigger thing is that, DevOps Handbook, I've been one of the co-organizers of these events, the DevOps Enterprise Summit, also one of the core organizers of the original DevOpsDays. A lot of startups. I was the ninth person at Chef, helped build the customer-facing business there. Got lucky and sold the company to Dell and self-appointed my name the Director of DevOps. And I sold a company to Docker about two and a half years ago called SocketPlane. So, a lot of startups. That's the book.

This is worth mentioning. I spent a year developing this course. It was supposed to be three months. Linux Foundation is really mad at me. But it was a labor of love because I included everything. I mean, everything: Senge, Deming, Conway, and a lot of technology. So it's about 15 hours of videos. It's free. Linux Foundation on edX.

People tell me, "John, can you come train DevOps?" I'm like, "I don't have to. It's free. It's everything I know about DevOps."

All right. So I'm a big shot, right? Hey, man, I'm a big shot. DevOps this, DevOps that. So last year, I'm in China, and I don't know why I do these poses, and apparently I do it a lot, so I'm now trying to less. But these knuckleheads over in Seattle, friends of mine actually, decided that they would go have some fun with this thing.

And then I think they caught me with another pose and started another thread.

So the thing is, you can get a PhD, you can start the DevOps movement, but unless you've been memed, you ain't nothing. And then there I am, of course, with Muhammad Ali. So there you go.

All right. Now to get serious, folks.

What if I told you you could be 2,000 times faster than your competitors? What if I told you you could be 100 times more reliable, maybe 200 times, depending on who's counting, than your competitors? What if I told you you could have both?

You've kind of seen this already. Gene talked about it this morning. You're certainly going to see it tomorrow if you've seen Nicole and Jez's presentation about the DevOps survey. But I've got a little bit of a side story on this faster and more reliable. And I call it immutable service delivery pattern.

Even though these all certainly do overlap, I would say that DevOps will make you faster. Of course, there's resilience built into DevOps. But I would say that combining that with a container strategy, and my favorite container happens to be Docker, will make it cheaper, and I'll walk you through some of that.

And then all that's great. In fact, okay, you would have a billion containers running somewhere, and they're all over the place. And like, wait a minute, how do I fix things when they go haywire?

And this takes us back to what we learned from Lean, particularly a book called Toyota Supply Chain. And I think if you combine those ideas, then you get the best of everything. You get to be basically very fast. You get to containerize everything, and you can have gazillions of these containers out there.

But just like an auto manufacturer, you can have metadata. You can have bill of material. There's a reason why a car company knows all the brakes in a certain car, all the cars that are out distributed, because they have bill of materials. They have manifests. They know how to recall and which cars need to be recalled. So you can build that into your structure if we learn from some of the things that Toyota and Lean has taught us. I'm going to walk you through that.

So faster. I'm not going to spend... I've got 25 minutes. I have to cut somewhere. Sorry, DevOps conference, I'm going to cut out the DevOps stuff. No.

The truth is, you've heard a lot of this. This is usually an audience that hasn't heard much about DevOps. So I'm going to go a little fast, but I'm around, and there's a longer version. If you wrote down the Botchagalupe thing, you can find the longer version of this. So I'm going to assume there's certain things you've heard already, or you're going to hear over and over.

But I do think this is my favorite all-time picture that represents what DevOps is all about. And one of the things I always tell people is, people get all held up on the Dev, the Ops, the why is it, could it be more Ops? Could be sale. It's a metaphor. It's a metaphor for two teams that have this wall that needs to be just demolished.

And what you have really on the top is, it's about shortening lead times. We want to be faster. We want to take ideas, and we want to take them. The classic was the aha to ka-ching. How do you get an idea to making money? How do you make that faster? Well, you have to shorten lead times. We learned that from Lean.

How do you do that correctly and resilient? You get really good at amplifying feedback loops. So this picture tells you everything you kind of need to know.

And early on in the DevOps movement, if you will, Damon, my DevOps Cafe podcast host, we codified this thing accidentally called CAMS: culture, automation, measurement, and sharing. And it's kind of stuck as a loose taxonomy.

And Gene, if you've read The Phoenix Project, and certainly, I hope you've certainly bought the... well, actually, you can get a free copy today, but the DevOps Handbook, we built the book in this idea of three ways. The first way is the left-to-right flow. The second way is amplifying feedback loops, kind of right-to-left telemetry. Get really good at... and we'll also have some other sides. But then the third way is this continuous learning.

And what I realized is these actually really map very well. It's something I figured out along the way. Although CAMS was much earlier than this, culture is culture is culture. Lots of presentations and ideas on how you deal with culture.

But certainly, automation could be attached or aligned with first way, measurement, second way telemetry, and sharing is kind of third way. And so you'll hear more and more about this in the book, but really we break up the three ways, and we do a lot of concentration on case studies on continuous delivery, and then the culture is extremely important.

In fact, one of the things that came out really early on the CAMS thing is, we think, if you can't get the culture right, don't even bother with the... One of the things that one customer said, which is AMS, not CAMS. Or, you can't do AMS. If you do automation, measurement, sharing... I said that wrong, but automation, measurement, sharing without the culture is like going fast, maybe in the wrong direction. You have to get the culture right.

Oops.

And you've seen this, and if you haven't seen it, you're going to get a way better explanation of this tomorrow afternoon by Nicole and Jez, and they'll give you the 2017. This is 2015.

I have people ask me, why 2015? Because in 2015... So I started DevOps movement. I've been doing IT operations for 35 years now. DevOps movement was interesting, if it was a stake in around like 2009, 2010. You saw this beautiful movement of operations and development and things actually making sense for the first time in my career.

And one of the things that we empirically learned along the way was that you could go fast and be more resilient. But you had to have a case study and say, well, they did it. But then Nicole came in in 2015 and worked on the survey, and what we found from the statistically sound data, and if anybody has read the foreword of the book, "The Science of DevOps," she explains psychometrics.

Basically, we find that based on your culture, the generative cultures, and Nicole and Jez will explain this more tomorrow, are 200 times faster than the pathological cultures. I mean, that's what the survey data says. And that's great, but they're also, depending on which survey and which data, they're 168 times better resolving issues, MTTR. And in '16, it was 2,500. That was Gene's slide this morning.

But what this really means is that the iron triangle of the pick two, or even better yet is, be reliable or fast. You get two choices. You can't do both. We see this in pattern after pattern, that that's not true. Now we have statistical data to prove it's not true. You can do both.

And what's the linchpin? You get a free book. What's the linchpin of going fast and resilient? Come on, somebody.

Culture, right? It's culture. That's what glues the two things together. Otherwise, you can't. Then you're into that broken model of when you go fast, everything breaks.

And this is like a 2006 picture. Imagine this. This is 2006. Jez Humble, Dan North, and Chris Read did this at Agile. And this is basically how this feedback loop works. Everything's automated, if possible. Somebody commits a code, and it goes through the gates. It hits one gate red, it goes back. It gets recommitted, hits the next gate, next gate, red, goes back.

And over time, you're creating that amplification of feedback loop, and you're getting really resilient, because it just has to constantly improve itself. It's kind of an antifragile thing.

And so if we look at, from Lean, there's this concept of, not kaizen, it was a thing in Toyota called the andon cord. And the andon cord, when they were manufacturing cars at Toyota, was a rope. And it was a rope that anybody on the line, lowest status, didn't matter, if they didn't like what they saw, they pulled the rope, and the rope stopped the line.

And the first thing that the line manager would say to that person is, "Thank you," before they even investigated. It was a psychologically safe place that you could stop the production line, and the person who would normally yell at you in most organizations, before they even knew what happened, said, "Thank you."

And then there's a story about a Toyota plant in Kentucky that they were producing 2,200 cars a day. And an industry auto manufacturing analyst went and said, "How do you produce 2,200 cars a day? That's amazing." You know what the answer was? We pull the andon cord 5,000 times a day.

If you get that, you get this idea. And if you don't get it, then let me show you what Google does.

And these numbers are like four years old. This is the number that, actually, that I screwed up there, but that's 100 million. It's over 100 million. They run 100 million tests a day. That is that Kentucky plant. That's this on steroids.

Sorry if people are trying to take pictures, I'm bouncing around. But that's the point. The more you build this antifragility into your infrastructure, those gates, that telemetry, and again, this is a newer slide. Some of the Google folk had promised me to send it, they haven't. But all these numbers are almost double. And they're all phenomenal, but the one that, to me, is the 100 million test cases run daily.

How do you produce really resilient software that... You could hate Google, love Google, whatever, but you think about the tools that are offered, they're pretty freaking resilient.

And then there's the legendary Amazon story of 11.6 second mean time between deploys into production. These stories have been told over and over. But it's the same type of mentality. You have to build that level of, A, culture, B, resilience, building that kind of andon cord into your infrastructure and your delivery pipeline.

And so right about now is where I'm looking at faces. Half the people are like, "This guy is so full of shit. He's never been in an enterprise." And that's not true. I have.

And so I shamelessly stole this from a guy named Pete Cheslock, who originally said DevOps, and that was security. But I see a couple of you right now, you're like, "Oh, yeah, he's totally doing the unicorn poop thing, and we're going to have to shovel it because our manager's in the room and he's going to go back and ask us why we don't do this." And you're going to think it can't be done in the enterprise. Absolutely not.

Well, we know that's not true. You're all here. We're going on the fourth year of this summit in San Francisco in November. And so we have stories. These are just a short list. Like Ticketmaster, and following the same principles.

Gene talked this morning about that first DevOps Enterprise Summit. One of the things going into that was we knew the enterprise could do DevOps the same way the web-scale companies did. And the large consulting companies were like, "Oh, no, you guys are a bunch of kids. You don't know what you're doing."

I'm not a kid. I'm 58. But I told Gene, when the customers present at that first DevOps Enterprise Summit, and they talk about, A, how hard it was, but B, they didn't compromise. They did it the way, literally, the web-scales, the Twitters, the Googles were doing it to a certain extent. And they were having success. It was hard.

But Ticketmaster, they did it the way you were all hearing about the stuff this day and the next day. Nordstrom, 20% shorter lead time. Same concepts, value stream mapping. Target, USAA, ING, all these companies, there are videos out there, and there's hundreds of them now, of stories that didn't compromise, that are enterprise. And it was hard.

So don't take anything I'm saying right here to make it sound like, oh, he thinks it's easy. No. It's incredibly hard in the enterprise. The older the enterprise, the harder it is. Doesn't mean... You have to do it. You've got no choice. Your competition is right around the corner.

So that was my DevOps. And so now I'll talk about Docker.

I think most people know about Docker these days, so giving you an overview of Docker. I will say, if you don't know, IBM had a nice white paper. It's pretty outdated, but it describes the difference between hypervisor-based compute and what we would call OS-level compute.

So hypervisors, basically you get the whole stack. So you take a bare metal machine, you carve it up, but you're running full stack compute, like everything, the operating system. And the difference between a container is that you share the kernel with the host, and so that your compute instance is typically an order of magnitude smaller, an order of magnitude faster in its startup and shutdown time, and so you get this incredible density.

So just at a high level, we call this OS-level virtualization. Provisions in milliseconds. It basically almost meets bare metal runtime performance because you don't have a hypervisor brokering the in and out of memory and all the things that happen in hypervisors.

What we find is almost everything can be containerized. Old XP applications. I saw a Fortran application recently containerized. So the idea that it's just greenfield and cloud native, that's not true. You can pretty much containerize almost anything.

They're lightweight because you can now start thinking about only building the minimum of what you need. It changes the paradigm from running an application. Okay, here's your VM, and it's got all this stuff. Eh, I'm kind of lazy. I'm just going to leave all that stuff there.

This paradigm kind of forces you to change the way you think, because you start thinking about minimal. It's easier to start thinking inward out. What is the minimal I need to run this thing? It's a lot easier, and you don't have to be a Linux kernel expert to try to figure out, hey, I can build this just enough operating system, put the application. It even gets better.

Why Docker?

So Linux containers have been around for a while. A lot of people can do it. At the end of the day, we didn't invent it, but we did invent the simplicity in the workflow. And that's why everybody's running around with their heads on fire with Docker. It's because we put a really good technology and made it easy for use for people, in the sense that we added a workflow: pull, push. We emulated Git.

And there's a lot more to that story, but the workflow behind Docker is very malleable, and very malleable to developers. This is very rare in our industry, that there's a solid developer movement. The developers are like, "We want this. Get out of our way." And Docker has definitely been one of those.

All right, moving on.

We've always been hit with this idea of Docker is insecure. And what's interesting is, three years ago, somebody showed a slide. They'd say, "Docker's amazing, it's awesome, but don't run it in production." And then what's unfortunate now is some of those slides, they still have those slides in their deck today. And that's bullshit.

I can make the argument, and I think I would win, that you could run a container more secure than a VM today. People joke to us about our unicorn, all the investment we got, all the money we got. We put a lot of that money into security.

We've built-in image scanning. You can sign a container image. The signed image can be on both ends. It could be on the push to the repository, on the deployment. So you can put policy on signed images. So you are guaranteed now, one is that image is... not guaranteed. There's no guarantee in life. But there's a really good shot that any known vulnerabilities are covered and not in the container image, and you know the provenance of that image.

We've got a trusted registry. So we've got LDAP support, so it will connect to... So now you can add policy into your actual internal security. Encrypted, read-only containers. You can build a container that has a user namespace, so you can run rooted container that won't have root access to the host. That's a big deal.

We have all the LSM support, security support. This is a big deal. DoD, unofficially, but they love this because if you know what you're doing, and I'll actually show you an example a little later, where you could turn off all the syscall kernel opportunities and then work your way back and only add the things. It's not a trivial process, but if you know what you're doing, now you can make the thing incredibly secure. Because now you can say the application only needs these three capabilities, and you can turn everything off by default. So now you've got it to a really locked-down state.

We've got secrets management, so in your composition, we can go ahead and you can now... We've got a secrets engine that can basically decompose on the fly. So you could put token passwords and token tokens, if you want.

But this is what I'm going to show you at the end. We have, I'll show you a few instances, a thing called immutable operating system. It's not coming soon, it's actually available now. It's called LinuxKit, where the operating system itself is actually a just-in-time. It can actually be built from YAML files. This is a big deal.

I'm going to skip this. This is a great presentation about immutable infrastructure. The only thing I mean when immutable infrastructure is that the idea that all your service definitions become artifacts that are basically read-only the minute they leave the developer's desktop.

So imagine that. So you basically, on commit, if it goes green through the pipeline and the developer wears a pager, the binary artifact that they tested in their development environment is the same exact binary artifact that's running in production. That's a big deal.

And this is something. But I wanted to show you this is... The other thing that's really interesting is this modernization, and I'm just not going to have time to do it. But I wanted to show you this. This is pretty cool.

I was telling you about this thing called LinuxKit. So basically, when people define Docker things, they can create this thing called Compose. They build composition definitions for a service. And what's really interesting here, I'm going to use some buzzwords, but it really is the ability to create a converged infrastructure: your network, compute, and storage in one human-readable definition for a complete service definition.

And if you look at this real quickly, I have microsegmentation. That's VMware's famous word for building multiple interfaces. But I can segment. I can have the front end be pretty open. The front end will never have another database tier in it. Network. So I can build my networks. I can have my networks be encrypted. I can use overlays. I can use VXLAN. I can mix and match at the service level with multiple interfaces.

And I can do the same with storage. I can have multiple storage interfaces. I could be using Redis at one level. I could be using Cassandra at another level. And all this is in a human-readable file. And I commit that, and it builds my infrastructure.

But what's even better now is with that same commit, I can actually build a just-in-time operating system, YAML-based. And this creates an immutable operating system that basically, same properties. Now my service and composition leave my desktop all immutable, completely defined, a converged infrastructure with an operating system that is built specifically for this service or the services. Where I get to even the kernel level. I want to define how the init structure is.

Not only is it a read-only operating system when it's executing, it's basically anything that runs on it is a container. So you can't run anything on it unless it's a container.

So I'm going to be giving a presentation in Portland later this year, and I'm going to challenge the attendees to compromise. I'm going to run it, and I'm going to challenge the... And I'm going to say, "By the way, I'm not a security wonk." And I'm going to challenge a bunch of security people and say, "Go ahead and compromise this. I dare you." And I don't think they'll be able to do it.

First, they'll have to read up on the operating system. Second is they've got to basically know how to implement... They'd have to implement a container, and then they'd have to compromise another running container that might have capabilities and network segmentation. Going to be really tricky.

Finally, I'm a big fan of Deming. The thing I wanted to point out is his book called Toyota Supply Chain, which works really well. If you want to really go fast, you want all this stuff, the problem is you've got to be good at these principles that Toyota is really good at. And one of them is...

I'm going to skip this because I'm going to run out of time. Sorry. That's the problem with doing a 45-minute presentation in 25 minutes. If I have time, I'll go back.

In that book, Toyota followed these three principles: fewer better suppliers, higher-quality parts, and then track what you use. So it gets pretty simple. And they call it the four V's. You want to decrease variety. You want to increase velocity. You want to decrease variation. You want to increase visibility.

And variety is key because you basically start thinking, Deming would say this, that you want to shorten the amount of suppliers that you have. And in fact, that prior chart shows that GM actually had more suppliers than Toyota bought, but they did more in-house work. That sounds weird, right? But that's true. The more suppliers...

So the healthcare.gov debacle in the U.S., this is what? Two years it took to develop, maybe three years. They had 17 Java logging frameworks. Google, by the way, if you read the Google SRE book: two kernels, two Java logging. They even have a team that they call the Death Squad, which will actually go into your office.

"Hey, you're running an outdated kernel."

"Yeah, I got it. Can you leave?"

"No, no. My job is just to sit here and wait till..."

Gene Kim is actually a Death Squad. If you're on the phone with Gene, this is why I run out of time. You're on the phone with Gene, and Gene says to do something, and you'll say, "I'll do it, Gene. Just give me a couple of days." He'll say, "No, no. Why don't you do it right now?" And then he'll wait, and like, "All right, I'll do it now."

But Google, across the board, two of everything. And you get these efficiencies. How many Java frameworks do you have? How many operating systems do you have? There's some magic in this consolidation.

And you say, "Oh, it's hard, John. We run old versions of this and that, and oh, my goodness." Tough. If your CTOs stand up saying, "We want to be more like Google," then point out to them, "Hey, then help me get rid of these 18 versions of a logging framework."

I wrote a paper on this, called "Docker and the Three Ways of DevOps." And again, if you want to look at what does Google do? You learn faster, limited frameworks, you limit vendors, you do small batch. These are the things we just talked about, DevOps Handbook.

And then this is where immutability comes in into containers. So the fact that when it leaves a developer's desktop, if everything, including the operating system, doesn't change, there's no opportunity to change.

I love Chef. I'm still a stockholder in Chef. But the bottom line is, if you're infrastructure as code at every level, you're adding variation. There's a great paper called "Order Matters" about how any of these little things, like a script or bad return, there's so many ways, at scale, you will get variation if you're not immutable. So immutability.

And then the visibility is the last part, which is, you can put metadata in containers. Because it's binary, you know it's left the developers... On the commit, you put the commit SHA in it. You can put other metadata in there. You could bill of material. You could put all that stuff.

And so now if you have thousands or millions of these things running, it's very easy to basically say, "Give me all the department code four that have this thing, that has this framework," and you get a list. So it's actually more manageable. The fact you've got millions and millions, but the fact that you have all this meta and you can actually now, it's included binary, and you know it's guaranteed to be in there. There's no mishap. You've got the commit SHA, you've got bill of material. You've got everything you need, and anything else you want to add, because you can add metadata to a container image till the day is long.

And this is somewhat R.I. Pienaar, the inventor of MCollective. He did this very early on. If you want the blog article, I'll give it to you.

So again, back to this. The only point now is there's one last overlay that's really important, which is the DevSecOps overlay. So we've got this. That's great. We've been really good at this. We did development, we got operations, we got QA, we really won. And now we're seeing that the security people are not finally getting on board, but getting on board with this idea that they're adding their overlay abstraction into this thing.

And I kind of adapted, stole, adapted. But DJ Schleen gave this presentation at RSA earlier this year. But this was their... So you look up RSA DJ Schleen. You'll see this is an overlay on their already existing Dockerized immutable infrastructure. And you see all the threat modeling and the development static analysis. They're injecting all that stuff into the automation pipeline.

And the thing here is, it was Aetna, because it was DJ Schleen, but here's the thing, and I want to end this up with. I promised you this 2,000 times faster and 100 or 200 times more reliable.

So the Aetna story is really interesting. About three or four years ago, they started down this DevOps path. They would track security defects per 10,000 lines, and they started out with 10. Through DevOps practices, they got it down to four. They applied the Toyota supply chain, minimized suppliers. That was really actually the only principle. The single principle, which is minimize suppliers. Get down to two Java logging frameworks. They got it down to one.

Then they went with a Docker immutable model for delivery in production. These are production applications. They got it down to 0.1. In fact, at RSA, that was a year ago. At RSA, they said that number is actually zero now. And it's a pen test bug fest zero. This is not a made-up zero.

So imagine taking 10 security defects. By the way, a bug is a bug is a bug is a bug. Security is a bug. Let's stop bifurcating security from everything else. And imagine being able to follow these principles, and the principles that I described to you are the ones that they did that took them from basically 10 down to zero. I'm being generous in saying 0.01 or 0.1.

The last thing. The guy at RSA who gave the presentation, DJ, I never met him before. In the speakers' den, he comes up to me and he says, "Dude, I love Docker." So I'm like, "Dude, I didn't write it. Calm down." And we had this conversation why he loved Docker, and he said, "Here's the thing, John." And this is the part where if this doesn't make you, like... then maybe you're in the wrong business. I don't know.

He said, "With Docker, it's one service, one container, one read-only file system." Here's the one that should get you to jump up and cheer: one port.

I see some smiles in the audience. Imagine that. One port. Imagine how much wasted time have you all done on firewall port and ports and all that? Imagine that. One service. I like to repeat this. It gives me the chills. One service, one container, one read-only file system, one port.

If you've been in this business for more than three years, there's some magic. And anyway, that's my presentation.

Thank you very much.