DevOps Kaizen Practical Steps to Start & Sustain a Transformation
We all love the aspirational DevOps talks about organizations achieving blistering speed and dazzling nimbleness, right? But what can you do when you look internally at your own organization and everything feels complicated, contentious, and stuck? How do you overcome the silos, the legacy, and the entrenched behaviors that are making your DevOps problems seem so intractable?
This talk is about how to start and sustain a DevOps transformations in large and complex organizations using a methodical — and totally reasonable — Kaizen (Continuous Improvement) approach. This talk isn’t about mythical silver bullets or vague philosophies. This talk is about taking a fresh look at proven Lean techniques and empowering teams to find and fix what is getting in the way.
Chapters
Full transcript
The complete talk, organized by section.
Damon Edwards
This talk is really about applying practical lean techniques in a DevOps context and doing it at enterprise scale.
A little bit about me. Now it's working. Good.
I'm a pretty lucky guy, although you might think that's kind of a weird definition of lucky. But I get to see all kinds of organizations, from the biggest household brands to ones you've probably never heard of. And I get to really see what makes them tick from two different or three different angles, really.
One being my work at DTO Solutions, which I think we're best known for: DevOps consulting, organizational transformations, really applying these organizational management and lean techniques to fix people, process, and tools.
And then a newer thing I'm involved with, SimplifyOps. Some folks might know Rundeck. It's a pure tools company, totally separate from DTO, but gives me a whole different perspective on companies I get to see. If you want to talk Rundeck, Alex Honor is over there with a Rundeck shirt on. You can talk to him. The rest of this is about DTO.
So, I've been doing a lot of stuff with Gene Kim lately. I guess I've kind of become like a warm-up act for him. So we go in these places, and I hear him speak over and over again. And he talks a lot about high performers. He's always talking about cataloging their performance and talks a lot about the artifacts, like the outward behaviors you see them doing.
And after a while, I keep thinking, what is it? What is it about these companies that makes them different? What's their core capability that they have that the low-performing companies don't have?
And I've realized, fundamentally, it's the ability to improve.
And a more specific way to put it: the unique trait of these high-performing companies is that they're good at learning fast, or sometimes we'll say they're good at getting better. That's what separates the high performers from the low performers. Time and time again, it's that internal ability to find and fix what's getting in the way and keep moving forward.
And so this idea of improvement, it's been around for a long time. In fact, the scientific method, it's taught in every grammar school. Well, hopefully grammar school. Maybe junior high school.
Deming was talking about the Plan-Do-Check-Act. Now it's Plan-Do-Study-Act, or the OODA loop, or these same processes brought to life. It's simple enough. I see a problem, or I have a theory. I'm going to run an experiment. I'm going to see what happens, and then I'm going to come up with a new theory and run an experiment and see what happened and get in that steady loop of learning.
But if this has been so well known, and we've been hearing about it for literally decades, I think that last slide, Deming was in the '50s in Japan, was popularizing this. Came back over here in the '80s with the lean movement. This is not an uncommon thing.
So why are so many organizations unable to improve? What is it about these companies? It's not that they don't want to get better. What are the fundamental conditions that are causing them this pain?
Number one, the work's not visible. They can't go to a factory floor and see how the cars are being made. Everything's locked in people's minds or in a source repository somewhere, broken up in small pieces. Huge problem.
Number two, the people, because things aren't very visible, they're very siloed, and they're working out of context of each other. Everyone just sees their little piece of the work. They see their little part of the organization, constantly working out of context from each other.
And this inertia is pulling apart the organization at all times. You hear people talking about silo effects. It's really just that people are sitting like with like. You get multiple generations of technologies and people. You buy new companies. You spread throughout the world. And just that inertia of growing larger is going to be constantly pulling your organization apart, and people more siloed and isolated.
So this is why I believe, fundamentally, companies, despite their best intentions, can't get better.
So I want to talk about that visibility part for a second. This is the traditional visibility for technology managers. These are the tools that you've got. You've got your org charts and great plans there. We've got this secret strategy we've been working on for months. We've got our architecture. We've got these documented processes. Thanks to ITIL, we've got more documented processes. We got project plans. We've got release trains. We've got meetings. We've got more meetings, more meetings, more meetings.
And we all know where this is going. It's the illusion of control. We think it's going to work. We know it's going to work if we just try a little bit harder, if we just make some more meetings. This is going to work, right? And it never does.
So the reality is, because we're living in a complex system, we've got all these people. Our people are a complex system, and they're interacting with other complex systems: our systems. So we've got complex systems interacting on complex systems. No wonder things are so hard. No wonder the work's never visible, the people are constantly working out of context, and the inertia is pulling your organization out of alignment.
And something we know about complex systems, thanks to John Allspaw and everybody else who loves to talk about them till they're blue in the face: the only way to sufficiently fix a complex system is to create the conditions for the system to fix itself.
Now, probably Allspaw will tell me that statement's wrong, but that's the general gist of it. We have to create the conditions. We can only hope to fix itself. We can't actually command and control, go in and say, "This complex system shall work this particular way."
But do we do that? Of course not. What do we do? So imagine we've got this complex system. It's opaque. It's not visible. We can't see in it. And what do we do? People from somewhere in their silos say, "I know the answer. I've got the answer to this." And they say whatever seems great from them.
I can't really see my screen here.
So it's, "This costs too much money, so we've got to outsource this stuff. I've got a proposal for that." Or, "We need results now, so we're going to reorg now, and we're going to keep reorging until we get those results." Or maybe the change management side is like, "If you just were more disciplined, if you just followed my processes I laid out for you, you would do this. You aren't, so I'm going to add more approvals. That's going to fix this problem."
Or the engineers are like, "Hey, I went to this conference. Docker, Docker, Docker, Docker, Docker. We go Docker, no problem." It was Chef, Chef, Chef last year. Or as Alex likes to say, "Shuppet," Chef or Puppet.
And what's everybody living? Living the dream of the Big Bang. It's like, "Hey, here's what we're going to do. We're going to do this thing. We're going to start here. We're going to get there. It's going to be awesome. Give us some money."
But this is the reality. You start here. Just by sheer management force, because you're paying attention to it, it's going to get a little bit better. But then it's going to start to get worse because we're not good at it. And this is where people start to go back to their legacy behaviors.
So the worst thing that'll happen here is we just go back to the way we used to do it and claim victory and say, "Oh, yeah, we got the new thing, but it's just the old thing, and don't worry about it." Reality is we fall into this trough. Everybody freaks out. That's where we all get fired. And then maybe, possibly, we'll end up back where we started.
It's this huge Big Bang idea.
And so what happens? Come on, you know what's happening here. What happens? All these Big Bang dreams, we apply it to this complex system where the work's not visible, people are working out of context from each other, and these silos are pulling us apart. It's a disaster. We've had how many decades of this same problem over and over and over again?
So how do you teach an organization to fix itself? This is the big J, which is the Big Bang, which we're not good at. We're all going to run in fear. Or the small Js. How do we reduce those cycles down to small enough increments to where, yeah, there are going to be little dips, but we're going to get better, and we're going to constantly be teaching ourselves how to improve?
So it's about reducing those Js small enough. Those Js are actually those little Plan-Do-Check-Act cycles, and we're going to PDCA, or PDSA, or OODA, or whatever you want to call it, our way out of this. And if you stand back far enough, it looks like that line that we dreamed of. So continuous improvement, that's where we want to go.
But we've got to do this at enterprise scale. Sounds good. It's a good theory. Sounds nice. So what do we have to do?
We're going to have to keep the improvement efforts aligned. That's huge. We're going to have to scale quickly. We've got to go big. We're going to have to span multiple organizational boundaries. That's scary, or that's just outright political suicide.
We're going to work with a substantial number of legacy technologies. Even if you talk to the folks at Google or Facebook now, they've been around long enough to where they start talking about things as legacy. It goes fast. So we've got to pull the mainframe folks in. I have a whole different deck on bimodal IT and why that's a horrible idea, but that'll be saved for a different presentation.
Got to develop your existing staff en masse. This idea of Netflix went out and hired the best people, you can't. Sorry. Only one can win that battle. So how do we develop our people en masse? Or as Adrian says, how do you make the people, before they hire them away from you, how do you develop them and encourage them?
And key, it's got to be self-funding after initial seed investment. It's all about improvements, about fixing ourselves so we can keep investing more and more in improvement. If you expect executives to constantly be funding top line for improvement in IT, they're just going to keep wondering why it's costing more and more. So we've got to figure out how to be self-funded.
So how do we do that? You have to remember, of course, remember what we're up against. This is the same situation we talked about.
So DevOps Kaizen. What's this idea? Basically, kaizen's a Japanese word for improvement. There's a more specific definition than that, but that'll work for now. In the modern business context, this idea of continuous improvement, really. A systematic, scientific-method-driven approach to improvement, that Plan-Do-Check-Act. It's about total engagement of the workforce. It's not something those techies go and do somewhere. It's about looking at the business as a whole and engaging the whole workforce. And it's about valuing small changes as much as big changes. That what really matters is outcomes.
So if we can find little things, the small suggestions, that'll move us forward, that's just as valuable to us as the big suggestions. And in the DevOps context, it's simply continuously improve the flow of work through the full value stream in order to improve customer outcomes. A little bit chunky, wordy there, but I think you get the point.
So let's take this known concept of kaizen and apply it to this DevOps concept. And everything I'm going to be talking about here 100% comes from people before me, much smarter than me. I think we've just taken it and applied it to this specific field with its own quirks and individuality.
So the three core types of activities. Now, this ends up looking different in each organization you go to, and some organizations do this without even knowing that they're doing it. But to bring this effect out, so you can make the work visible, you can bring people together, you can overcome that inertia, these are the three common types of activities that you need to have.
The first core one is a notion of the right service delivery metrics. You see organizations, they have tons of metrics in these different silos. They can't actually have anything about how are we delivering as an organization. What is it like to go from a business idea to a customer-facing outcome that somebody cares about? How are we doing as a business aligning towards those things?
So lead times, obvious. Not just how much time overall, but lead time versus processing time, meaning how much time are we actually doing value-added things? Are we actually adding value versus meetings, and waiting, and rework, and all those other things?
Mean time to detect, mean time to repair seem like classic ops metrics, but when you put them in front of development organization, they have a role in this too. How in control are we in terms of being able to manage our systems? How well are things instrumented to detect problems and failure? These are organizational metrics that need to be tracked for how we're doing as an organization delivering, not just operations metrics.
Then quality at the source might seem like an odd one, but essentially, there's a notion of scrap. In the lean world, they say that if I'm making bumpers, I'm stamping out bumpers, and I hand the bumpers off to you to weld them on the car, and the hole is one centimeter off or five centimeters off, and you have to now either jimmy the bumper on, or you got to re-drill it, or fix it, or send it back to me, that's scrap. I've done something that's not ready for the next person in the line.
So I want to look where that's happening, because those are those rework loops that are hurting us. But more importantly, I want to know where that happened, and how far along the line did I catch that? Because the farther it goes, the more expensive it gets to fix.
So a whole other presentation on these, but I think you get the point.
The really exciting thing here is the notion of how do you make the work visible? Measure. It's in people's heads. It's in laptops. It's in source repositories. How to make the work of the organization visible? And that's this notion of these retrospectives.
And it's a per-value-stream tool. So a key thing that we're trying to do here is encourage horizontal thinking. Everyone's thinking vertical. I'm the best firewall rule changer west of the Mississippi. Every Thursday at 4:00 PM, I change firewalls. If by Wednesday at 2:00 PM you have the right ServiceNow ticket filled out, I'll fix it for you. If not, you got to wait again. If you're not an expert in my arcane firewall-changing request language, I'm going to kick it back to you.
And so I'm thinking I'm doing a great job, but I'm not thinking horizontally in terms of how am I impacting the end-to-end lifecycle.
So retrospective is really about looking at your organization as a series of horizontal value streams. And what's a value stream? Imagine there's a point of transaction with the customer, and the value stream is everything that has to line up, people, process, and tools, to make that point of transaction happen. So we're thinking about the horizontal slice of our world.
Then what are we going to do? We're going to get the people together who work on that value stream. It's key to be cross-functional, from the product people to the operations folks to QA, whatever you've got in between: security, change management. All these get together and really focus on the flow of information and artifacts through that value stream, and also looking at key metrics as you go.
But the key thing here is that in the lean context, I talk about going to the Gemba, and a lot of people see value stream mapping, they walk around to people's desks and say, "What do you do?" And then walk to someone else's desk, "What do you do?" And you're not making the Gemba or the workplace visible that way. That's their seat. That's not what they're actually doing.
Value stream mapping brings them together. The way we do it, you want to bring them together to look on one whiteboard or large butcher paper and be actually mapping out how they work. And the key thing there is not making a pretty value stream map, but it's graphically facilitating everybody having that same common vision of how things actually work. Again, taking from lean for this lean inspiration.
Then the second step is, okay, now we actually have some kind of agreement on how artifacts and information flow through our organization. Second part is, okay, let's go back with the red pen and figure out what are all things that are getting in the way of delivery.
We lean a lot on Mary Poppendieck's seven wastes of software development. We've kind of massaged a few of the names, but you'll recognize them. We've added an extra one. If you listen to our podcast with her, she gave it a stamp of approval, sort of.
And the key thing is, we're focusing on what's getting in the way of the flow of value. Not gripes, not bellyaches, not everything we possibly could fix, but concrete, what are the things that are stopping the organization from delivering, that are causing those lead times to be long, that's causing the scrap and the rework, that's causing our mean time to detect or mean time to repair to be too long.
And then step three there is identifying countermeasures. Okay, let's look for those hotspots and say, what are the actionable, backlog-ready things, the baby steps that we can actually do to go and make this stuff better? Focusing on the small Js, not the big Js. Stuff we can actually envision. Say, "Yeah, we can go do that. What will we do today?"
And it's about empowering people from across the organization to get together and think about that. Often, if I do something different, it'll be better for you. If you do something different, it'll be better for me. We can meet in the middle and figure these things out.
And then step four is taking those countermeasures and getting some agreement on what are we going to go do. And there's this notion of, I'll talk more about them later, the Toyota Kata style. Could be A3. I'll actually mention that later. But it's basically a storyboard. It's a sales technique that you're going to be able to say, "This is what we want to go do. This is what we want to go fix."
It's great for selling horizontally in the organization, getting everyone on the same page. It's great for getting your boss to buy in. But the idea is key actionable short-term baby steps. What are we going to go do next to fix this flow?
And this is what it actually looks like. It's not a pretty thing, but the best thing is, you now have, from across the organization, people can walk up and explain to you, this is how we work. This is actually how things happen in this system, and this is why they break. This is why things take so long. This is why we have the issues that you're having.
And it's a very powerful thing to see people for once on the same page, to say, "Yes, okay, now we know what we're going to go try to fix next, see how that goes, and then meet back and do this exercise again."
So this is another one. It's simpler. It's more high level, but it's a real one that was done with the client. Sorry for the slow build there.
A little bit of example of the kind of stuff, the kind of baby steps that they were talking about. In this situation, it was a large financial transaction processor. Things were taking like 40 weeks to get a feature out the door. Seventeen weeks of that was done with planning up in advance.
Things they noticed: the amount of times, all this planning they were doing, they said 71% of the time, the requirements either were wrong or they changed. So kind of the idea of, well, why are we doing these big, large planning cycles if it's going to be that way all the time? Again, that illusion of control was strong there.
They had this architecture review board that was happening. Sometimes it was the key people. They didn't really have the full context to actually make a judgment on something. So it was either a rubber stamp, or they would kind of get in the way of something that they weren't the ones actually making the change. So there's all kinds of issues and problems there.
So the team said, "Hey, this is what we want to do." Some stuff was already in flight. They want to work in smaller batches. They want to get the ops involvement early, because ops didn't see any of this. They were part of that 71% of the requirements being wrong, and they didn't find out until 20-something weeks down the road.
They're working on a standardized product catalog, but trying to say, "Hey, instead of having this architecture review board that's rubber-stamping things, what can we do to standardize this catalog, build those design standards in?" And then also this notion of the idea of plan and design by those who will do. How do we drive it so the planning and the design is actually being done by the people who are going to do the work? Because they're the ones in the system that know how it works.
And the key thing is, what can we do next? Not what is nirvana.
Let me look at one more little slice here before I move on. Huge problem with environment contentions. We only had two integration environments, so developers were kind of living in their own world for a long time, and then they were finally integrating in a big bang, and it was a mess. And things were constantly breaking.
Fifty-something percent of the time, I can't really read, the environments were wrong or had a problem that was undermining testing. I think 80 or 90% of the time, the data setup was incorrect. So massive amounts of rework and scrap happening in that process.
And so it's like, well, why? Because it's too hard to set up our environments. Well, why? It came down to the data setup.
So the ideas they had is, well, hey, first let's have the dev side provide service verification tests. How do I know if this service is up and is running and is happy? And let's have ops provide environment verification tests. So how do I know that developers and QA are actually developing and testing against something that looks similar or prod-like?
And then they also self-service test data setup, including for the mainframe, which was kind of amazing when they got the mainframe team actually in on that.
So again, baby steps. Things we can do in the coming weeks, if not months, to start really cutting some of these huge chunks of problems out of system.
Now, before this, they had all kinds of major plans. It was the new cloud was going to solve all of this. The new automation system was going to solve all of this. They wanted to adopt a whole new testing framework, and this was really a bunch of shell scripts and just some people empowered to do some new things and able to cut huge amounts of time out of the lifecycle.
And so then it all kind of comes down to these storyboards. That's how they decide to go and do something. This is very much Toyota Kata-inspired, A3-inspired.
Simple. I've got a process name. What's the key challenge or the business sort of sales pitch that I want to say here? What's my current condition down the bottom left there? What's my target condition? In the middle, what are my improvement metrics? How do I know when I'm getting from my current condition to my target condition? The to-do, which is basically a bunch of baby steps. Again, we tried to find a fancier name, but baby steps was something that people actually understood. And what are my blockers? Who do I got to talk to, and who's going to get in the way?
And these slides will be up later. You can read the one on the right. But it's all very much out of the Toyota A3 idea. And it really becomes part sales pitch. You're able to kind of sell horizontally to your colleagues, make sure we're all on the same page. It's great for selling vertically to management to say, "This is why we need to do something, and this is why I can prove that I know what I'm doing."
And it's a great coaching technique, that someone fills that out and answers all the right questions to know, do they really know what they're talking about, and do we have a reasonable approach, and can we all kind of agree that we're on the right track here? Again, very much Toyota Kata-inspired.
And this really becomes a repeatable and scalable coaching pattern. These storyboards backed up by the value stream map, which makes the work visible. People can really say, "This is what we want to do. Let's go do it. Let's get it done." When they make progress or don't make progress, we can meet back together and see what the impact it's had and move on from there.
And this is how an organization learns fast. These things come together. The organization starts learning, and you get better at getting better, which is the key part. And then people start to make bigger and bigger leaps.
And the last piece is the program oversight. This is the executive leadership that needs to be involved because you're really talking about something that's going cross-function, cross-silo, cross-organization.
So what do they do? They're looking at the metrics, hoping they're going in the right direction. They're watching these planning and retrospectives. They're seeing these countermeasures and blockers coming out of these storyboards.
And other than saying, "Good job," and helping with that coaching the Kata, what are they there to do? They're there to do three things.
And actually, Paula Thrasher's keynote tomorrow, Paula Thrasher from CSC is going to talk more about this in depth. But they're there to say the will to make the change happen, the resources to make the change happen, and the follow-through, and they're there to clear obstacles. And that's it.
They're not there to provide you with the command-and-control plan because then we're back in that inferno illusion of control. They're there to empower the organization to learn how to fix itself.
And if you want to get them inspired too, there's all these great things you can get them involved with: The Phoenix Project, Steven Spear last night, fantastic book. Stanley McChrystal's Team of Teams, Jez's Lean Enterprise, Gary Gruver's new book. There's a lot of proof out there to back you up that, hey, the command-and-control ways are not getting the results that we want. There's ways for us to build this shared consciousness and move fast and reliably and improve ourselves.
So there's a whole other talk about the management mindset to make this happen. But just through this concept of the storyboards, the value stream mapping, you're making the work visible, and you can bring them into this and show it to them like it's basically a supply chain management problem. It's no longer a very technical, specialized thing.
You can go to the COO or the CFO of a company and make your case for why things work the way they are today, what you want to do about it, and what benefits they're going to get about it, and you can back it up with the stuff we've talked about.
So that's kind of the three core pieces. And I know I'm flying through here, but I've got a minute left.
So how do you apply those? Obviously, the improvement program oversight's an obvious thing. But this really becomes an overlay, which you can put over any delivery methodology. You got some teams that are waterfall, some teams that are agile, some teams that are Water-Scrum-Fall, some teams that are who knows what. You can overlay this methodology.
So you're constantly assessing those metrics. You're doing the retrospectives at regular intervals, and then you have the program oversight looking at things, providing the will to change, the funding, and making sure that the blockers, the most important part, is being cleared for you.
So, let's recap. This DevOps Kaizen idea, what's it all about?
Establish these program elements. A lot of them are just the ceremonies. So making sure we're getting the right metrics, the ceremony of the retrospectives, teaching people those techniques. It's amazing when you see them individually, on their own, jump up on a whiteboard in a group of three people and start using that notation to talk.
You want to bake it into the operating model for your organization. Key things: making the work visible. The number one most important thing is that. And number 1B would be focus on the continuous improvement, the small Js, the PDCA, and propel yourself forward.
And if someone says, "Well, this isn't fast enough. I need a bigger bang change," think of all the organizations out there that move at blistering fast and punish their competition. This is the model that they follow: continuous improvement.
All right. And that's the end of my time. Thank you very much.