Measuring DevOps: The Key Metric That Matters

Log in to watch

Las Vegas 2018

Measuring DevOps: The Key Metric That Matters

How is your DevOps transformation coming along? How do you measure Agility? Reliability? Efficiency? Quality? Culture? Success?!

Having the right goals, asking the right questions, and learning by doing are paramount to achieving success with DevOps. Having specific milestones and shared KPIs play a critical role in guiding your DevOps adoption and lead to continuous improvement—toward realizing true agility, improved quality, and faster time to market throughout your organization.

This session will walk you through a practical framework for implementing measurement and tracking of your DevOps efforts and software delivery performance that will provide you with data you can act on!

Anders Wallgren is Chief Technology Officer of Electric Cloud. Anders brings with him over 25 years of in-depth experience designing and building commercial software. Prior to joining Electric Cloud, Anders held executive positions at Aceva, Archistra, and Impresse. Anders also held management positions at Macromedia (MACR), Common Ground Software and Verity (VRTY), where he played critical technical leadership roles in delivering award winning technologies such as Macromedia’s Director 7 and various Shockwave products. Anders holds a B.SC from MIT.

Chapters

Full transcript

The complete talk, organized by section.

Anders Wallgren

I'm going to talk about measuring DevOps. I'll talk a lot about what metrics are, why you want to use them, which ones you want to use, and which ones you don't want to use. A lot of this is really about continuous improvement and a little bit of the scientific method, which hopefully underlies a bunch of the things that we do.

But first, an interesting slide. In the last five years, the top five publicly traded companies by market cap kind of split into tech and non-tech. I suppose you could argue a little bit there, but go with me. Clearly, tech has started dominating a little bit there. That's no secret, and we've known for a long time, and we've seen with all the companies that are out there, that software really is the primary driver of disruption and innovation. Whether you're making cars, dishwashers, or rice cookers, software is generally the way that you innovate and distinguish yourself from your customers. So to stay competitive, we need to deliver better software safer and faster. All of us need to do that.

And then the question is, do we feel like we can do that? Do you feel like you can release as often as you should, as the business wants? Most of us say, "No. Not so much." Which is one of the reasons why I love this conference, because you get to come hear all the people that went from there to there. Sometimes back down, sometimes back up. It's the stories that are really telling all the interesting stories here.

We're dealing with a lot of challenges. We want to get software out on time, we want it to have quality, we want it to provide value for our customers so that they give us money, respect, love, all of those things. We've got manually operated, non-integrated toolchains, lots of silos of automation, even when we do automation. It's difficult to have repeatability and predictability. And for those of you in regulated industries, finance, healthcare, automotive, aerospace, those kinds of things, the lack of traceability and auditability just makes things even more difficult come audit time, or at least uncomfortable. We're not necessarily using our infrastructure to the best of our abilities. We might have low utilization there, which doesn't make the CFO happy, or the CIO shortly thereafter. And we're using a ton of different practices. That in itself isn't necessarily bad. But when we don't know the practices that each one of us is using, that makes it a little bit more difficult to share across. That's a really important thing in any sort of Dev*Ops: DevSecOps, DevQAOps, DevTestOps, Dev-whatever-Ops. Shared visibility and transparency are key.

So let's fix it with metrics. What is it about metrics that we want to use metrics? Do we want to make ourselves look good? Or do we want to make ourselves feel bad? Do we want to make pretty graphs or just understand where we are? Hopefully more of the latter and less of the former, but obviously there are some pitfalls there which we'll talk about a little bit.

Why metrics? Because it's science. It's the scientific method. We make an observation, we have some questions about that, we form a hypothesis, we make a prediction based on that hypothesis, we run a little experiment, we look at the results, measure, and then we go back and we do it all again. Lather, rinse, repeat. When you think about it, this is really continuous improvement, and that whole thing of the improvement in the daily quality of the work is more important than the quality of the work itself. We're always learning.

So what does this have to do with DevOps? DevOps is really, I think, applying the scientific method to software innovation. It's visibility, open cultures, using the right tools, doing automation, having humans do the things that they're good at, and running experiments as cheaply as possible. Is that just DevOps? No. That's Agile, that's CD, that's all of the buzzwords that you can throw in there.

How do you want to do metrics? There was a great paper written, I think probably three years ago now, at the DevOps Enterprise Forum up in Portland that talks a lot about metrics in general. It breaks down into three categories. There may be more, but these are the ones that I think are really the interesting ones. The first one is effectiveness: did the thing you build do the thing you intended it to do? Did it provide value to the customer that they sought? Are they happy with their purchase? Will they buy more? Efficiency: did it cost me a billion dollars to release my free product that nobody's ever going to pay me for? That's not necessarily the most efficient way to do things. And on the end, but also very important, and I know we talk culture all day long around here, culture: are the teams working well together? Is everything working fine there? It is possible to measure culture. We'll get into that a little bit.

Blameless culture is key. If you're going to go for one of the tenets of DevOps, which is visibility, transparency, and those kinds of things, then you can't have a culture where you shoot the messenger. You just can't. We're going to make things visible. We're going to put things up. We're going to look at them. We're going to make decisions based on them. It's pretty important to do that in a way that doesn't feel like people are being shamed.

Data collection for these kinds of metrics should be automatic and unobtrusive. Why automatic? Because I don't trust people. It's as simple as that. "How long did that thing take?" "Oh, about 12 minutes." Also, it's a subjective thing. Frank is doing the collecting Monday through Wednesday, and Diane does it Thursday through Friday. Maybe they don't look at things the same way. Since we're all doing a bunch of this automation around our CI and CD and DevOps and software pipelines and release and all of those things, a lot of that data, in fact most of it, is already available in there. Take advantage of it. Use it. The metadata around how long it takes you to do a build, how long it takes you to do a test, how often you have a regression, all of those kinds of things are available as data and don't need to be fed to the beast. Even more importantly, they don't need to be massaged on the way into the system.

Unobtrusive is equally important. You don't have to spend a week at the end of each cycle just collecting data so that you can decide where you should be spending your time next time. Hint: not on collecting data. Make it unobtrusive. Just like we want to build quality in, we want to build performance in, we want to build security in, we should also build in visibility and metrics. If you have important parts of your pipeline, or important parts of your product or product life cycle, emit data. You're going to do that when you go into production anyway. Start collecting that data before you go into production. Shift left. I know that's an overused phrase, but for a good reason.

Choose metrics that are measurable, that are objective. If you're going to figure out whether people like your user interface, don't say, "Do you like my user interface?" because only an asshole is going to say no. You're biasing the sample there a little bit. Talk to people and do some surveys. Were they able to accomplish what they came to the website to do? Did it take them about as long as they thought it would, or was it more complicated and more lengthy than they thought it would be? Did they understand the choices that were presented to them, or did they get stuck and have to start over? There are a ton of different ways that you can dig into that other than just, "Did you like it?"

That also leads to things like vanity metrics. Avoid vanity metrics. The number of dollars that you've made, you could probably argue that's slightly a vanity metric, but also kind of an important metric, so I'll make an exception for that. The key thing about vanity metrics often is they're not actionable. "Billions and billions served." So are we going to go to trillions and trillions, or do we go to billions and billions and billions served? It's not really clear where we go from there. A metric that doesn't have an outcome and doesn't have an action that helps you achieve that outcome is generally an uninteresting or maybe even a bad metric. You might want to have your NOC with your fancy graphs and show your customers, but you could just use playbacks from disaster movies and that kind of stuff. It doesn't have to be real at that point. Don't focus on that.

The other thing is focus on teams. Focus on efforts. Don't focus on individuals. Don't shame, don't try to use this as a performance metric for reviews, and those types of things. Focusing on individuals generally doesn't work very well. The individuals will either figure out how to game those things because as humans, we've evolved to be gaming machines. From the second that we looked in the mirror and realized we were self-aware, we're trying to get one over on the man. That's going to keep going. Don't shame because that leads to gaming, and you definitely don't want those kinds of things.

On the topic of gaming, you want to look out for unintended consequences. This is one of my favorite Dilbert cartoons ever: our goal is to write bug-free software, so we're going to have a bug bounty. Of course, third square down, I'm going to go write me a minivan. Again, this is human behavior. It is absolutely, perfectly, 100% rational for that guy to go off and do that. You have incented him to do that. You've put that banana in front of the monkey, and the monkey is going to want that banana. That's just how we are. You really want to focus on metrics that are objective and relevant to outcomes.

Things like thousands of lines of code written: totally stupid metric. If you can do it in one line of code, by God, do it. That is not something you want to measure things on. But it can get more subtle, too. We fell into this trap ourselves a little bit a few years ago at Electric Cloud, where we started measuring how many support tickets were being closed because somebody noticed that there were a lot of open support tickets. Being the efficient gameable machines that we are, the number of support tickets closed, that open rate went down. The problem was the customers didn't really understand and agree with the fact that we were closing some of those tickets. There was an unintended consequence there of, hey, look, our closed number of tickets looks great, but we got some pissed off customers because we didn't solve their problem.

So you have to be careful. What was the outcome we were looking for? We weren't looking for the outcome of we want more tickets closed. So what? If we have four times as many tickets closed this month as last month, does that mean that we have four times as many bugs or four times as many customers? Those metrics in themselves don't necessarily mean anything, much less give you something to action, to do. You want to focus on the quality of the experience that the customer had, the satisfaction. Are they coming back? Are you retaining them? What's the cost of retaining customers? All of those things are more important metrics than how many tickets have we closed.

And this one's really tough: signal-to-noise ratio. Focus on a small number of metrics. If you can, pick one at a time. Pick one at a time, or two or three or four or five. Don't pick 40, because there's going to be too much going on for people to even absorb it. You have the same sort of danger as with monitoring systems: the boy who cried wolf and false positives. You really have to get to the point where that isn't as much of a problem. Generally, it's by focusing on smaller things. Look at fewer metrics. Pick one or two or three for the next week, month, quarter, whatever the right timeframe is. Decide what the outcome is that you want to achieve and how that metric measures that, and then you can look at the metric and start to get better and better. Hopefully that metric rises and other unintended ones don't rise along as well. The signal-to-noise ratio is a big thing.

Make sure you're communicating the right thing and everybody sees the same thing: what color is the dress? This comes to things like objectivity and all of those kinds of things. If we have four times as many bugs reported this quarter as last quarter, do we have four times as many customers or do we have four times as many bugs in the code? Metrics often have context that they have to be looked at.

Take something as simple, quote unquote, as "mean time to recovery." What does that really mean? Does that include time to discovery? How long was it happening before we or someone noticed it was happening? And by the way, who did notice that it was happening? Was it a customer? Was it monitoring? Was it ops? Was it dev? Was it the CEO? Was it the CEO's nephew? Even something as simple as an MTTR comes down to detection time, mitigation time, figuring out why it happened, and making sure it doesn't happen again time. All of those things are important to think about. You may have an incident where your MTTR in terms of from when we started applying the mitigation to when the mitigation was applied was five minutes. That's wonderful. You might look at that and say, well, our MTTR was five minutes. Well, bullshit. If that bug was in there for a month, your MTTR was a month and five minutes. If you then don't put in place processes to prevent that same problem from happening again, or tests or what have you, then you're just adding to that. Definitions are important. Words are important, unfortunately. That's just how it is.

Metrics that identify patterns that predict impending success or doom are pretty useful. This is the canary, I think. I don't know if that's a canary or not, but it's yellow. It's not in a coal mine, but it ought to be because that's the example, the canary in the coal mine. The sacrificial lamb, so to speak, or the sacrificial canary, I guess. Metrics could tell you, oh, things are about to get wonky. That could be all over the place. If it's a back-end transactional system, if my transaction commit times are slowly creeping up every day this week, what's going on? Are we losing IO capacity? Do we have more business? Again, that metric doesn't necessarily mean it's a bad thing. It might mean we're being more successful and we need to plan and add capacity. Look not just for impending doom, but also impending success. Your happening success may be your doom if you don't have these canaries, if you don't have a way to notice that you're starting to hit the limits of what you've specced out for your application, whether in terms of the architecture of the app itself or the deployment architecture.

These things are going to evolve over time. This is why I think it's okay and good to choose a small number of metrics. Pick a pain point. Pick something that you don't like. Pick something that drives you nuts. Pick something that takes too long. Pick something that fails too often. Just pick something and then figure out a way to measure that and drive a better outcome for that thing, and then move on to the next thing. You don't need 400 metrics. Less is more, for sure, and expect them to change.

You might start looking at, okay, our database transaction times are all over the place. We need to get a handle on that. For the next month, we're going to be measuring that because it's been really wonky and we need to get that better. After the month, and you've applied a bunch of fixes, that line is nice and steady and not growing. You could take that off the front page of whatever display or device or mechanism you're using to radiate that information and just make it an alarm. Now it doesn't get into everybody's faces unless it starts climbing again. Then you put something else on the front page, something else that is now the one big hairy-ass goal that we want to solve and get better at doing. Expect the metrics to change over time. Don't come back six years later and look at the TV screen in the organization and realize, oh my god, we're still looking at the same numbers, because that probably means you're not even paying attention to that screen, or you just suck at improving your numbers. I don't know which. Expect them to evolve and change.

Let's talk a little bit about which metrics. Every coder knows the number of WTFs, or pardon my French, what-the-fucks per line of code, is the only true metric that matters. But I'm going to break it down into four buckets: business value, customer value, team culture examples, and pipeline efficiency examples. You can slice and dice these things in way different ways. These are not the only buckets. The things in the buckets could probably be in other buckets as well. There's definitely some fluidity here.

If you're thinking about business value metrics: customer acquisition cost. What does it cost me to get a customer in the door? What kind of revenue am I getting from that? What sort of market share do we have? What does it cost to keep a customer once we have them? Those are business value metrics, and they're actionable. If we're paying more to acquire customers than we're bringing in in revenue, I'm pretty sure that's not going to lead to a good outcome. We should probably change one of those two numbers to be something else and improve it. If our market share is shrinking, why is that? We have to go figure that out. Similarly, if it's growing and we don't know why, we might want to figure that out too.

Customer value metrics: customer sats, pretty big important one, kind of a little bit nebulous, but as Justice Potter--is it Justice Stewart or Potter? I forget which is the--but you know it when you see it, basically. Satisfied customers. Feature lead time: the time from when I code a feature or design a feature to when it's available for my customers. That's a really important metric, because that feature could be a bug fix for a very important customer, or it could be a vulnerability that you're patching. You definitely care about things like lead times. Features delivered: you could do this in points or T-shirt sizes or numbers of features or whatever. The units are probably less important than the fact that they're somewhat uniform and objective, which is difficult to do in this case. And release frequency: how often do we release? Are we monthly, weekly, daily, on-demand, quarterly, yearly? It depends. I'm not sure I want daily firmware releases to my dive computer that I wear on my wrist. I know a couple of bugs I want fixed, but I'm not sure I want them to download every night. Don't fix what's broken there, for sure. But release frequency is definitely something that a lot of us care about and will continue to care about.

Team culture metrics: employee satisfaction. If you do a net promoter survey of the employees in the company anonymously, how many of them would recommend the company to a friend? What's your retention like compared to the industry average or the geographical average? Are our teams collaborating with each other, or are we still siloed? Are we still in ticket hell where to get anything done I have to submit a ticket, then I have to wait till that person gets back from lunch so they can say that they've accepted the ticket, then they go off and do the work, and eight hours later I get my VM? Or are we collaborating across teams to say, "Hey, look, I can get you that VM in five minutes. I just have to have all the right approvals." What if we set up a self-service catalog where we have all the pre-approved things in there and all the security layered on top of it so that only people who are allowed to do it can do it, and then you get your environment in five minutes instead of five hours or five days or in some cases still six weeks for a VM? Mind you, we're not racking hardware here.

Working across teams to figure out how we can deliver things more quickly while at the same time doing it with governance and with all the auditability and things that we want to do is really important. We all want to feel like we're solving the problem. We don't want to be the team that always says no, or the team that always gets yelled at for being slow. Education and growth is really key. Are we investing in employees in terms of having them learn new skills, relearn old ones, or unlearn bad ones?

One slide, like I promised, on team culture metrics with reference to Ron Westrum, who did some research on the nature of bureaucracies of organizations and classified them into three pieces. The punitive cultures where the bearer of bad news is executed. The bureaucratic culture where, like Robert De Niro in Brazil, the bearer of bad news gets covered in paper until they disappear. Or the generative learning culture where somebody who's the bearer of bad news is supported, and we start an inquiry, and we figure out what happened and why, and should we do something so it doesn't happen again?

Some really great questions to ask in terms of what sort of culture are you in are: on my team, information is actively sought, failures are learning opportunities, messengers are not punished, responsibilities are shared, cross-functional collaboration is encouraged and rewarded. I've seen organizations where cross-functional collaboration is not only not rewarded and encouraged, it's forbidden and it's punished. I don't mean corporal punishment, but career-path punishment. It might sound weird to be reading these off, but there are places where it doesn't work this way. Failures cause inquiry, not finger-pointing, and new ideas are welcomed and implemented if they work, if the data supports it. The more of these that you can say yes to, the better, obviously, in terms of the kind of culture that you're working in.

Five minutes left. I'm going to do a quick deep dive on metrics. A very simple but very linear pipeline I've laid out here with just some examples of what some customers and people out in the industry are doing and the kinds of things they're looking at. In the Dev/CI phase, things like development lead time, rework required by defects, build breakages, downtime. Basically, time not on task is something that's important to look at for your development leads. We did a survey a number of years ago and found that the average, not a scientific survey, but a few hundred people, the average amount of time that a dev spent waiting for things like builds and test results every week was 12 hours. The average amount of time that a QA person spent waiting for those same things was 20 hours a week, which is kind of scary. Hopefully, they're doing other productive things during that time and not just Facebooking. Unless they work at Facebook, in which case I guess that's okay. Idle time is important. Work in progress and technical debt: have we built a bunch of features that we haven't tested? What is the cycle time?

On the QA side, again, idle time. Are we sitting around waiting for stuff or are we actually working? Are we on task? How many defects were discovered and escaped, and what was the impact of those defects? That starts to get really close to metrics that are a little bit scary. But defects are not something we want, so we're going to find a way to look at those. Really the question isn't so much discovered, because that's kind of your job, but what escaped and why? When something escapes, we don't want to figure out, oh, it was Joe who didn't do that testing. He's an idiot, let's fire him. We want to figure out why there isn't somebody who backs up Joe, or why don't we automate the whole damn thing? Those are the kinds of approaches you want to do there. Mean time to discovery is obviously an important one in terms of testing and QA.

When we're thinking about deployments, and deployments can be not just for production, they can be for testing and QA and even developers, again, what's the lead time? From the time that I decide I need these bits deployed on this type of system or even this specific system, how long till that happens? How often do we deploy? What's the duration of our deployments? Do they take us five minutes or are they 18-hour marathons that are like EST sessions where you don't get to go to the bathroom? What's the change success rate or, flip side, the change failure rate? How often do we have to roll back or roll forward, and how long does that take us? And that whole MTTR thing that I talked about earlier.

On the release side, this is a little bit more efficiency. It's more around release frequency, how much of this stuff is automated, what's the time and cost per release, how predictable are they? Do we hit our targets, whether they be quality targets or feature targets or time targets? Of course, you can hit all of them all the time, because we always get all three. That was sarcasm, in case anybody didn't catch that.

And then operate. Again, mean time to recovery, cost and frequency of outages. A culture thing: how often am I on call after business hours? How often do I have to leave my wife and child at the baseball game and hop into the data center and fix things? That impacts things pretty big. And then, of course, performance, utilization, all of those kinds of things.

What's next? Last slide here, just a tiny little bit of a commercial. If you guys want to play around with this kind of stuff in ElectricFlow, you can go download the community edition. It drags in all kinds of analytics for many and all of the systems that you connect it to, and all of the processes and pipelines and Kanbans and releases and so on that you run through it.

We have 42 seconds for questions, so I timed that perfectly. I'll leave you with that slide up there that has some resources. But I'll hang around here if there are some questions afterwards.