Keeping Your DevOps Transformation from Crushing Your Ops Capacity

Log in to watch

San Francisco 2017

Keeping Your DevOps Transformation from Crushing Your Ops Capacity

During the early rise of DevOps, Operations was promised a better future with less interrupts, less conflict, and more time to focus on what matters. After all, in the popular unicorn success stories we’d hear, that was always the case. However, for many large-scale enterprises things haven’t gone as smoothly.

These enterprises have found themselves in a situation where 1-2 years into their DevOps transformation, Dev is flourishing and Ops is struggling. Operations support costs are trending higher and the operational support load, work interrupts, and context switching is getting worse. Operations is already stretched thin and there is a fear that continued DevOps acceleration could push labor capacity beyond the breaking point.

Since it is not all enterprises that are suffering from this DevOps transformation caused Operations capacity crunch, what is the difference? Why are some Operations organizations flourishing and others fearing for the future? This talk will focus on the successful design patterns that the high-performing, large scale operations organizations have applied to reduce the operational burden and support costs across their entire organization.

Specifically, we’ll look at how these high-performing enterprise Operations organizations apply DevOps principles to improve the post-deployment lifecycle, their successful process and tooling design patterns, and how their Developers are playing a key role in reducing the difficulty and cost of operations activity for everyone.

Chapters

Full transcript

The complete talk, organized by section.

Damon Edwards

So, my name's Damon Edwards.

Why I'm standing up here, how I got to be standing up here, I've gotten to see inside of a lot of companies. I used to be the managing partner of an operations and DevOps consulting company called DTO Solutions. I've been involved in this conference since the very early days, also the DevOpsDays. I do a podcast called DevOps Cafe. Anybody listen to that? Some folks? Wow, my people. John and I need to do more of those, but we will do more, we promise you. You can get it on iTunes, those who haven't heard of it yet.

So the point of this is, I get to see inside a lot of companies: high flyers, low flyers, everybody in between, household names, startups you never heard of. And this talk is really about what I've seen to be patterns that are emerging, especially inside large enterprises. And I especially want to talk about the problem of what's happening to operations. We have all these DevOps transformations going on. They're going great in development, QA, everything else. But a lot of times, enterprise operations, that place where all that stuff has to hang together, suffers under the crunch of all this change.

So this talk is really about what happens after deployment. Deployment is just one slice of what has to happen in the operations world, and this talk is really about everything else. And specifically, we're going to talk about unplanned work, which is the most toxic and is one of the most costly and destructive things that goes on in an organization.

But operations is a place where planned work and unplanned work has to happen together, by design. It's literally the only part of a company where, by design, we're handling planned and unplanned. I guess customer service maybe. But, oh, no, never mind, they're all unplanned. So operations is the only place where planned and unplanned work has to happen together, and that can be a tough thing.

So come with me. I'm going to do a little journey here. The names and some of the details have been changed to protect the not-so-innocent. But this is actually all based on a real and unfortunately typical enterprise incident.

And this company, the funny thing about it is, they're all in on DevOps. They've got continuous delivery going on all over the place. They've got containers. Things seem to be going great, except for when you see what happens when something doesn't go great.

So the story starts at 9:30 in the morning. In the NOC, they're seeing some lights. "Hey, we've been seeing some of these intermittent lights over the past week, but this looks a little more serious. There's something going on here. We're not really sure. We're looking at it."

Next thing you know, the business manager is on the phone saying, "Something's wrong here. Customer problem, blah, blah. Escalate."

So it's now 10:00 a.m., and Bob's the NOC representative who's been assigned to manage this incident. Bob's choice is, "Okay, well, I'm going to escalate this." So that means I got to get all the app-specific SREs together, and I got to set up this bridge call for them to start figuring out what's going on.

The bridge call starts. So you've got all these people on the phone: "Try this. Try that." Now, the SREs don't have access to all the systems, so I got to get a sysadmin that's got special production access. His name's Steve. And everyone's telling Steve on the phone, "Do this. Do that." Of course, the business manager, they found their way how to get onto the bridge call. So they're saying, "Is it done? Is it done? Is it done?"

Next thing you know, it's, "Ah, I think it's the Foo service." So everyone's like, "Whew, not me. I'm done."

The Foo SRE is like, "Can you fix this?"

It's like, "No, because this is a totally new app. I'm not really sure what the deal is yet."

So Bob goes, "Okay, I got to escalate this." And so I'm going to add the lead dev for the Foo to this ticket, and that's Karen.

And Karen's in the home stretch of her sprint here. She's our rockstar developer. She's got the house music playing. She's got her cup of tea. And the email's going off, and she's just like, "I'm going to ignore that, and I'm going to ignore that. I'm going to ignore that." Until suddenly, someone knocks on the door.

"Hey, Karen, did you see that ticket?"

"Yeah, I guess so. So I'll take a look at it. Hmm, I don't know what this means. I'm going to need some more log files here." So now I got to go to the ticket and add the sysadmin team to help get me the right log files out of this app.

And if you notice down here, take a little pause. I call this the context wagon. This is one of these evil things of tickets that I'm pulling everybody along with this. That I'm adding them to the ticket, I'm adding them to the ticket, and a little piece of their brain has now got to be dedicated to listening on this ongoing saga. So you'll see my context wagon get bigger and bigger as I go here.

So Karen's like, "I need these logs." She gets the logs. And you know she got the correct logs on the first time, right? Of course not. So the next comes around, next comes around. Finally, it's like, "Okay, I got enough here. So who restarted these services? Why were they restarted? Whoever restarted them, these containers weren't fed the right environment variables. Therefore, this entire service pool needs to be restarted."

It's 2:00. We still got problems going on, but it's kind of a brownout for certain customers. I'm going to add the middleware team, say, "Hey, I need you to restart this entire app pool with the correct environment variables."

The middleware manager calls and says, "Are you crazy? It's the middle of the day. We're going to have customer problems. You need business approval for this."

So now NOC's got to go, "Boy." So I'm going to add the SVP for the whole line of business in charge of all the customers. And her name's Susan. She's got a lot of things responsible. We got to pull her out of the right lunch meeting and talking to customers. She asked the VPs, "Hey, is this going to be okay? Is this safe?"

Now, these VPs are five degrees away from the keyboard. They haven't touched the keyboard in a long time, other than for email. And they're like, "Oh, yeah. It's a restart. How bad can it be? Maybe a few minutes of downtime." Restart approved.

It's now 5:00 at night. This all started at 9:30 in the morning. The context wagon, by the way, is getting bigger and bigger.

And, hey, so who knows these production services best? Oh, Ellen. Ellen knows this the best. Well, where's Ellen? We just put her on a plane for Europe. We got to help with the launch in Europe, so Ellen's not here.

So, well, who knows next? Scott's like, "I guess I know this."

So now what's Scott got to do? Scott's got to go to the SharePoint server. He's looking for the right docs. He's dumpster diving, looking for the right docs on this stuff. He said, "All right, I think I know what's going on here. I can figure this out."

Scott does the restart. All the services start coming up. Next thing you know, there's this Bar service. And it's not starting. He looks at the logs, tries to figure out the output, says, "Oh, it's waiting for Acme service. What's the Acme service? I don't even know."

Ten minutes go by. He's sweating now because a key production service is not restarting. Turns out things failed. Scott then says, "Oh, geez, my Bar startup timed out. The error says I can't connect to this Acme service, but if I look on the Acme environment, looks like the service is running. Is this error message correct? I don't know what's going on."

So I put all that in a ticket for the Bar SRE. The Bar SRE finally gets back to me and says, "Hey, look, the environment pre-flight check we added as part of our DevOps best practices is failing because it can't connect to the Acme service."

So got to upgrade the ticket now and say, "Hey, this is an important issue." Get the network SRE person involved. They're not really answering. But I got the business managers are starting to notice that stuff hasn't come back. Customers are calling. I'm freaking out. Scott's freaking out.

Luckily, Scott knows Carlos, because Scott has beers with Carlos, and says, "Hey buddy, can you help me out here? I need a favor. Do you know somebody who can help me?" So Bob, the network SRE, shows up and says, "Hey, look, I think the firewall is blocking the traffic between the Bar environment, the Acme environment. Take it up with the firewall team."

So Paige is the on-call for the firewall. 7:30 at night. I wanted to go home, couldn't go home. I'm thinking this person's mindset right now.

And so, finally, Scott goes, "What's going on?" It's Freddy, the firewall engineer. Says, "Hey, what are you talking about? This can't be the firewall. We haven't changed the firewall since last Thursday. Your app just died today. I don't know what you're talking about."

"No, no. There's a rule change, a startup problem. Just check and see if the rule's there."

So Freddy logs in and goes, "Oh yeah, last Thursday we changed some rule. It would block access between those two environments because of the security regulation."

"Well, can you change it back?"

He's like, "Sure. On Thursday we'll change it back for you and..."

"No, wait, wait."

So the chief of staff, of course, luckily, he knows his way into everything, found his way into the bridge call, was like, "Freddy, we've got customers calling. You got to do something about this."

So Freddy goes, "Well, I'll put it into NetSec, this important change."

Of course, the NetSec, Nicole, says, "Whoa, this is production. I need three out of five CAB people to approve it."

At this point, everyone's screaming, "It's a customer outage." And we got the VPs, and then the chief of staff says the magic words, "I'm going to call the SVP, Susan." Next thing you know: ding, emergency firewall rule change approved.

It's 8:30 at night. And so Bob and Freddy and Scott are doing the round and around. "Hey, looks it's going good here. Going good here. Whew. I think things are starting okay."

Everyone's like, "You think? What do you mean?"

It's like, well, our policy was we can't check our own work: separation of duties. We got to have a customer engagement manager check all the APIs for us because they don't trust us that we've actually done things right.

Well, it's 9:00 at night. I send the ticket out. Varsha is the customer service engagement manager for this line of business. Where's Varsha? It's her birthday, so she's at dinner.

And so finally we get Varsha to come home 10:30 at night. Does whatever magic that Varsha does that all these engineers aren't allowed to do. Says APIs are okay. Services are restarted.

By this time, Bob, the NOC person's gone to sleep. They see a bunch of green lights and go, "I guess this is okay. Looks good." And obviously, they have 15 other incidents just like this they're looking at, so they just give it the thumbs up.

Of course, the next day, Susan calls an emergency meeting. Whose fault is this? Why are we so bad at this? What additional processes and approvals are we going to put in place so we never have this problem again, right? And everybody comes up with something.

So that is a way, way too common of an incident, right? Thank you.

That's it. Thank you for my talk.

Now I'm going to bore you with other stuff.

And now people wonder, like, where does our time go, right? We've got DevOps. We've got containers. We've got continuous deployment. Why aren't we getting more done? What's happening? And the reality is the pressure on us is only getting worse. So we're barely hanging on in this model of working in today's environment.

But if you think about these two pressures are just crushing down on operations. One is all this digital and DevOps transformation stuff we're talking about: go faster, open things up. Go, go, go. On the other side, it's all of the current business environment, like, don't be the next hack. Don't be the next breach. Lock things down. Be more sure about what we're doing. Take things slower.

So we're getting these conflicting messages, and in the middle is operations. And what happens then is we have, because of this, we're having more errors, more delays, less flexibility. That all leads to more capacity. So we're at this breaking point where if we think 2x, 3x, 5x what we're doing today, we just aren't going to be able to do it without rethinking how all that happens.

So I think the number one way to go about this is to look at those post-deployment processes as any kind of other procedure like you would. So you look at your delivery procedures. We use a lot of lean analysis and think about the discipline around that. We use that same lean discipline and same lean thinking to analyze our post-production problems.

And I don't have time for it today, but if you want to grab me afterwards, I'll explain. This is a true million-dollar incident that a company didn't even know happened. Nobody realized they lost a million dollars because nobody saw the whole catastrophe. So it's a lot of fun. But the idea is you want to break it down, look for the problems, and look for the commonality and build from there.

So I also want to kind of focus on a couple of key lessons that we could apply to that whole value stream we just looked at.

Number one, the first lesson is empowering those closest to the issue. All those problems in that example happened because we're constantly escalating upstream. We needed somebody else to do it. Somebody else needs to check this for me. Somebody else needs to grab that for me. Somebody else needs to restart that for me. Somebody else needs to validate that for me.

Every time we escalate away from it, we're introducing more opportunity for errors, more opportunities for delays. So the key issue is the shift-left idea we always talk about on the deployment side, about pushing a lot of the operational tasks over to developers. Get them to validate our deployment scripts, get them to validate or to do a lot of the testing early on. Use that same concept, the same mindset for these procedures.

How do we push the ability to take action closest to those who are near the problem? They have the context to solve the problem. They probably know better than anybody what's actually going on, and they're the ones that we should be empowering and figuring out how do we safely empower them to take action.

But what gets in the way, right? In almost every organization, it's this idea of silos. We talk a lot about this in the DevOps world here, of silos. And so we should look at what a silo is.

So I say silos ruin everything. They're just at the root of almost every problem we see. But a silo doesn't just mean an organizational boundary. It's really a way of working.

So when you say you're in a silo, if you imagine you're working with a team or a group of people, and when you have the same backlog, you're working from the same backlog, you have the same set of priorities, you're in the same kind of situational context, you have the same tools, you're using the same priorities. I think I already said that. You're all good and you're together and you can get things done.

The problem is nothing lives in isolation in enterprise. You always need some other team. You always need somebody else, and they have their own backlog, their own priorities, their own context, their own tools. And that's when people start to focus inward. They focus in on what they want to work on, and that's where these breaks in context and bad handoffs start to occur. And that's really what the siloed behavior is all about.

That is a destructive... It's a natural force that's common for people to want to put like with like and divide things up. But it's that way of doing that, that this siloed ways of working start to happen and all these problems start to occur.

And I say, well, how do you spot silos? I don't see big brick walls in these organizations. And one of the best ways to spot silos in an organization is look for the ticket-driven request queues. Where are the ticket systems? And I guarantee you, almost every time, you're going to find some siloed behavior on any side of that.

It's the classic: I need something, I fill it out. I don't really know what I'm asking for. I put it in this pre-form text. I send it over to the other person. They don't look at it because they have 10 other priorities. I get a project manager to go bug them. It's escalated. They look at my ticket. They land and do the best thing they can in their own context and throw it back over to me as not quite right, and we go over and over again.

So these ticket-driven request queues, that is where to look for where the silos are, and they're almost always there. And the problem with these ticket request queues is, number one, they're silo builders. Just explained that.

And number two, they're snowflake makers. And snowflakes meaning that whatever they're giving back, it's often technically correct, but it's this perfect little crafted one-off. Because on either side of these silos, I have these kind of labor pools that are taking the requests off the queue and doing some things. They'll do it a little different this time. They'll parachute in over here and a different request, do it a little different this time. Or maybe I do it one way and Bob does it one way and Nicole does it a different way. And you end up with all these little snowflakes.

And that's where the variability creeps in. That's where the worst handoffs happen. That's where the automation fails. Nothing worse than broken automation. Only worse than broken automation is the slightly wrong automation. To err is to human. To make a great disaster, you need a computer.

That's always snowflakes. And these request queues actually have a huge economic impact.

So I don't know if you guys read Don Reinertsen. He's a famous product management guru, researcher, and breaks down really the math behind these request queues and what goes on. And these queues create longer cycle time. That's obvious. You're waiting. Increased risk with all these breaks in context and these handoff points, and we're farther away from the feedback loops. More variability, more overhead, lower quality, and even less motivation.

When people have to put something in a queue and then wait for it to come back later, you intrinsically lose that connection to the outcome of that thing, and people get less and less motivated by their work.

So we have a whole paper that we wrote on all these ideas. But really, this Don Reinertsen has a great book called The Principles of Product Development Flow and talks a ton about why queues are so bad. And all those queues add up to be really, really expensive.

So another Reinertsen comment is, if you're going to measure one thing, measure the cost of delay. It's something that business speaks to. Everybody knows that however long we delay in getting to market is going to be a delay in revenue that we realize.

So every little cut, all those little things that we do, all those little queues we sit in, all those little incidents we have to go solve, all the escalations that interrupt us, all those things add up to equal more and more cost of delay. That's really the language of all of this that the business would speak.

All right, so let's get rid of those request queues. Let's get rid of those silos. How are we going to go and do that? So the popular thing you're going to hear a lot about this week is cross-functional teams, or market-aligned teams, I think is what they say in The DevOps Handbook. And the idea is let's take all those different people and let's put them in the same team. Let's give them the same context, the same tools, the same backlog, put them all together, and we'll not have those handoffs. We won't have those breaks in context.

The problem is, again, we're in the enterprise. Nothing lives in isolation. We're going to run out of teams long before we run out of services to assign to those teams. So we're always going to fall back and have all these other things that we just can't slice up enough. We don't have a data center for everybody. We don't have enough environments for everybody. We don't have enough DBAs, enough network folks, enough security, enough NOC. So we're going to keep slicing, trying to... We can't slice and put them on those teams. We're still going to have those other external teams.

And plus, those cross-functional teams need to talk amongst each other as well. So we're right back to where we were before, which is ticket systems, silos, long-living request queues, snowflakes, and that whole big, long incident that we had talked about before. Just checking my time here.

So we got to get rid of those remaining ticket-driven request queues. So if we can't get rid of them by reorging into those cross-functional teams, what do we do?

And that's where this operations-as-a-service design pattern comes in. And the concept seems really simple at first, which is, hey, I'm going to take the things that operations does, turn them into standard procedures that I can then safely delegate to those who need those things, and they can use them on a pull-based basis themselves. So that's the basic concept.

And then what happens to my ticket system? Well, my ticket system becomes what it was supposed to be in the first place, which is it's the place for exceptions. It's the out-of-band communication, or maybe still approvals, those you can't get rid of. But make it so it's not in the flow of work. It's not governing our day-to-day work. All that toxic injection that comes from those queues, we're getting rid of that.

And the interesting thing about this is that it really changes how your organization thinks about automated procedures, when you're thinking in this operations-as-a-service mindset. You start to think about that a procedure's not one monolithic thing. It's actually these kind of three parts.

One is the ability to define that automated procedure. Who has the ability to actually define this thing? The second is the execution. Who has the ability to execute it, and how do they execute it? And the third part is the governance, which is security, the oversight, the compliance. How do I put controls around that procedure?

And so what's interesting is when we start to think about these things as actually three pieces, you start to realize how you can move them to the part of the organization where it makes the most sense.

So traditionally I think, oh, self-service. I'm going to give somebody a button to push. And that's good. We're starting with something. We relieve a little bit of pressure with that. But the problem is it ends up generally being a rigid thing. Like, here's a button to deploy this. But it doesn't have a lot of flexibility to it, so therefore it ends up being kind of limited, and I'm right back to having to have my central ops team do a lot of this work.

So then people get the bright idea and say, "Hey, well, what about those handoffs? What if we move the ability to define outside of the traditional operations organization?" Well, then we've got high-velocity handoffs because all these people around ops are making things. They know best how to start it, how to stop it, how to do the health checks, how to refresh the cache, how to reset the database connections, all the stuff you'd possibly do. They know it, so why not have them do it, hand it off to operations where they can vet it, and they can run it?

Well, and when you really hit pay dirt, that's when you're able to move the ability to define and the ability to execute outside of that traditional operations boundaries, but let them maintain the ability to govern, to maintain the governance, where ultimately security and policy and compliance stays in their control. And that's really when you've hit the self-service operations nirvana.

And I guess even better than nirvana would be when you're able to actually start to move some of that governance with it. So we see this where companies say, well, different lines of businesses, I can set up the broader security parameters for them. Let's let them figure out who they want to give access to where. Let's let them decide within their own risk profile what makes the most sense.

And so all of that maps onto this operations-as-a-service design pattern, where you've got the collaboration between the people outside of ops and inside of ops vetting these procedures. The ability to execute them can also be shared as well. I can decide what things do I want to push back to these lines of business or developers to run themselves. Maybe I do need a central NOC team maybe to offload some of that work, so I'll move that ability to them.

I can move it around how I want to. And that's the key idea: move the definition, the execution, and the governance to where in the organization it makes the most sense, both for the use of your labor, and it's the cheapest to use because of your labor, but also to improve the flow of work.

And it brings up another sort of second lean kind of lesson here, is the ability to standardize the work. You have this kind of platform you can share on. You can standardize the work. And standardization of work is a very lean concept that says you need to enable improvement, you need to standardize the work.

If you standardize the work without enabling improvement, basically people become drones. They're disconnected. They're just bored of the same thing. If you let them improve their work but don't standardize it, it's just going to be chaos. So in order to get the improvement and a raise in quality, you want to standardize the work, and having this kind of platform allows people to collaborate and do just that.

So what these platforms end up looking like, it's a couple different sort of from a tooling perspective. The first requirement is the ability to orchestrate actions across all the different tools we've got. So I'm going to have different silos. I had Puppet, I had Chef. That was the coolest stuff. Now I've got Ansible, that's cool. Oh, wait, containers, that's cool. That's a bunch of APIs. I'm going to have all these different generations of things. Different groups are going to want to use the tools they're going to want to use. How do I standardize and orchestrate procedures across all of those? And then how do I collect the output from all those things and bring it back to this platform?

Also, this platform's going to have to know about my infrastructure. Because everything's coming up, coming down, changing, so how do I soft-code or parameterize all those procedures to work with my infrastructure?

And then I got to build security in. How do I make sure that I'm able to define access control and who can do what, where, in what environment? And then, of course, I'm still in the enterprise, so I'm going to have to pull the tickets in and allow people to do approvals that way and add information back and forth.

I got to give people ability to use a web GUI, to use an API, to use a CLI. They got to work the way they want to work.

And what they're doing through it is kind of three things. One is collaborating on these standard procedures. It becomes a commonplace for us to do that using whatever languages we want. It's a place for me to manage that access and governance, so it becomes a kind of control layer, control tier, for those that know my history, across all of your organization. And it enables us to both execute actions, but also importantly, share that visibility, so people aren't doing things in their own silo. I can go and see what's going on across the organization.

So got a few minutes left. I just want to kind of run through a couple of examples of companies that have kind of arrived at this design pattern on their own. And working for Rundeck, I also get to see what they're doing.

So Ticketmaster, people have seen me talk before. I love talking about these guys because of how far they've come. They had a problem where they deliver fan experiences. So if the Yankees can't print their playoff tickets, that's not a TechCrunch story. That's a front-page New York Times story.

So in their world, a 47-minute average MTTR is huge. And that was where they were at. And they had a new management team come in: Jody Mulkey, the new CTO, Justin Dean ran ops, and then Mark Mohn had actually been there a while, laying down some cool tooling improvements.

And in 18 months of this Support at the Edge program, they went from, they had a 90% reduction in MTTR, 47 minutes average response time down to 3.8 minutes. They have spoken about this here and some other conferences. Awesome stuff.

But probably what kept people really happy was a 50% reduction in escalations. So all of that work to get them pushing back at those developers, they cut that in half. So developers were like, "This is awesome." And a 55% reduction in overall support costs. This is cheaper to run their business.

So what they did, let me show you three things. So they had this new org support escalation model. They looked at it and said, "Well, hey, this SRE escalations work today, let's map this." They used this emergency room metaphor. They have the EMTs with NOC. They had 15 minutes to respond to something. The SRE used as the ER team. They had 30 minutes. Production engineering support and scrum teams, they had 60 minutes. And then beyond that, it was the specialist surgeons. It was the different services.

Well, but the point is it wasn't just they had an escalation model, because they had that before, it's that they focused on creating the ability to push the action as close as possible. How many things can be solved by the NOC? Which at this point, beforehand, was just an escalation team. It just launched tickets. Let's make them operators again.

And I think now it's something like 92%, something like that, of all incidents are handled by either the NOC or even giving some of this tooling to customer support people even further up the chain. So the key ability was giving people the action to take action.

And then, of course, they had this long-term investment in operability going on the whole time. This has really happened after this improvement. Better deployment automation, better configuration, better monitoring, better automated runbooks. That means let's cut down the number of incidents, period. But this incident response model, and the tooling that came with it, was the way that they were able to improve it.

So what did they do? The automated ops procedures were written by the delivery teams as part of their definition of done. They gave them the slots to plug into. Ops maintained full control over the policy and who could run what and where. Decided which things they could give back to the teams to run and which things they had to keep. They made the NOC operators again. They enabled the dev team to take that action as well. And, yeah, it was a really cool outcome.

And then I've got one more here, just a quick one. These people aren't as public about it, but it's a Fortune 100 manufacturing and services company.

And their issue was they were going all DevOps. They had all these different teams, all these different lines of business. Some are government related, some are not government related. Everybody needed to do their own thing. And the idea was that they realized these teams are too tightly coupled, so they want to decouple the teams, decouple their IT underneath them. Except they needed the central operations, for governance reasons and just practical reasons, to maintain control over all environments. And each line of business had their own security policies along the way.

So they used this operations-as-a-service design pattern to say ops would stand in control and define the general shell. They would give a standard set of procedures to each of these teams, and those teams would then turn around and write their own procedures and then pass them back through the ops team in a code review style. They used a Gerrit-style thing with Git. And then say, "Yes, these are good to go," approve them, and then go on their way.

So they were handling 5 to 10x the scale that they previously handled without adding anybody to the central ops organization. They think they could do another double in that current configuration.

So to recap, number one, the capacity crunch is only getting worse. So think about this stuff now before it's too late. Use that lean lens to analyze all that ops activity. If I had hours, I could go through and dissect that previous incident we saw.

Shift left all your control and decision-making. That is the number one key. Beware of those ticket-driven request queues. They're evil. They're ruining people's lives. Leverage the operations-as-a-service design pattern wherever you can to get rid of those queues. And you've got to make an explicit, like Ticketmaster's story, make it an explicit investment in process and tooling. This stuff doesn't fix itself.

I recommend drawing out some of these incidents and showing it to the executives in your organization and say, "We have to invest in this. This is a huge chunk of money and time and just pain that we can get rid of if we invest."

So again, my name's Damon Edwards. There's a longer paper, book, whatever you want to call it, on this at rundeck.com/oas, the acronym. And anytime you want, email me. I love to hear people still raising their questions afterwards.

And that's my talk. I think we're up against the time, so I guess I'll catch people one-on-one. Thank you.