DevOps Metrics We Love

Log in to watch

London 2020

Download slides

DevOps Metrics We Love

Craig Cook

DevOps Coach · IBM

Ann Marie Fred

DevOps and Security Lead · IBM

Our Vice President came to us asking if we could build a dashboard to measure team velocity. We came back to him with a list of metrics we would rather measure, instead.

We wanted to improve DevOps practices and behaviors across multiple squads that are building and operating a global marketplace. There are hundreds of things we could measure. We seriously considered about 50 metrics that we could track (many from DevOps Enterprise Summit talks, or Accelerate) and chose to surface a dozen or so that would give us the most insights into agile delivery, the flow of work, availability, and code health.

This talk is about the metrics.

Assumption: you have a culture that will visualize metrics without fear.

How do we justify the value of each metric when challenged?

How do we set our thresholds for "good", "acceptable", and "poor" scores?

Which metrics are related and grouped together on the dashboard?

Which metrics do we intentionally not visualize and why?

Which metrics do we wish we had, that we don't?

Why we made it difficult to compare data across teams.

Chapters

Full transcript

The complete talk, organized by section.

Ann Marie Fred

Hi, everyone. I'm Ann Marie Fred from IBM. Craig Cook and I will be talking about DevOps metrics we love and our experience with using these metrics to drive measurable change within our teams.

In this presentation, we have two key takeaways. First, why should you care about DevOps metrics? What can they do for you? And second, how can you incentivize the right behaviors without encouraging people to game the system?

But first, a little about ourselves. I'm the DevOps and Security Lead for our commerce platform. That's the part of IBM.com that enables online sales. It includes things like the checkout process, provisioning, My IBM, where you manage your subscription and billing information, and the product catalog.

I have more than 15 years of software development experience, including nine years of working in a DevOps environment where test automation and continuous delivery are a way of life, and sharing the benefits of that with others. I spent the last five years on call for various production applications. I spent three years as a development manager, then switched back to a technical role to start a formal DevSecOps program. And I've been our privacy and security compliance lead as well for the past year.

Over to you, Craig.

Craig Cook

Thanks, Ann Marie. Most of my background has been in operations and infrastructure. The last few years, I've been helping internal IBM teams improve their DevOps practices. Not having direct authority over any squad made things very interesting. I had to understand where their pain was. Every squad is at a different stage on their journey. I then have to deliver the value and explain why they should implement these ideas. Ultimately, my goal is to help squads get value into production faster with higher quality.

Today, I started a new role. A shout-out to all the DevOpsDays organizers worldwide. I help with Raleigh.

Ann Marie and myself do not speak for IBM. What you're going to hear today are our own opinions. IBM has some fantastic lawyers. I really do not want to end up here again.

Ann Marie and myself are individual contributors. As you can see, there are a few layers of management above us. This talk is from the last five years helping build IBM's global marketplace. When we started our organization, we started with the Spotify squad model. We modified it a few times. Our squad was part of the Test and Operations squad. Our goal was to improve the DevOps practices of other squads.

When we first started, one of the first things we did was create an availability dashboard to understand the uptime of the services in the marketplace. You'll hear more about this later.

Around September of 2018, our vice president came to us with a request to build a velocity dashboard. He believed his squads were performing well, and he wanted data to show it. The problem is, we already knew a couple of things from experience. The first is that story point numbers are not consistent across squads, and the second is that if you ask teams to increase their velocity, they'll probably just start padding their story point estimates. What we really wanted were metrics that would help teams improve their agility and efficiency while maintaining quality.

We had a gut feeling that certain things were making future deliveries slower, and we chose an initial set of metrics that focused on code reviews and the code delivery process. The screenshot on this page is of our first prototype. For the record, we eventually threw this version away and started over again, but we got some good feedback from it.

Ann Marie Fred

This talk is in four parts. We'll cover the discovery phase, prioritizing the metrics, the gory technical details of those metrics, and outcomes: what changed on our squads as a result. Craig's going to talk about some of the first decisions we made.

Craig Cook

We really did not want to make our own custom dashboard. We'd already done that and knew it was a lot of work. The year prior to this effort, we had evaluated an open source project that did something similar. It didn't have support for Travis. Travis is a primary CI/CD platform that we use. We spent a few weeks to try and add support and gave up in frustration.

When this initiative came up again, we evaluated that project. Still no Travis support. That got discarded. Our VP had been contacted by a vendor with a commercial product that did something similar. We were encouraged to evaluate the tool. Once we dug in, we discovered there would be a lot of custom work to get our data sent to this tool. It would also be very expensive trying to send data at the scale we wanted to send.

That left us the custom option. We'd done it before. How hard could it be?

Ann Marie Fred

Once we chose the custom option, we then had to decide what we were going to put on it.

We often get pushback on the metrics we measure from teams that get a poor score, so we needed research backing up the predictive value of each metric and how each of the metrics ties back to business results. Our first resource was flow metrics from some of Dominica DeGrandis' talks. We like the theory behind these, but some are easier to measure than others.

Next, we used the "Accelerate" book by Dr. Nicole Forsgren, Jez Humble, and Gene Kim. Accelerate highlights four key metrics: lead time, deployment frequency, mean time to restore, and change fail percentage. Some of these metrics that we used have been around for so long that we didn't know where they came from, such as the availability or uptime of a service.

We also considered a couple of metrics we'd heard of from Microsoft. We have a SonarQube server, which gives us scores for security, code quality, and test coverage. And finally, we invented a couple of our own, including deployment stability and the overall DevOps score.

We already had this availability dashboard that we use in quarterly reviews with our general manager. This visibility into availability puts pressure on the low-performing teams, but also gives them the air cover they need to fix availability problems at the cost of feature work.

The goal of our minimum viable product, or MVP, was that it should be small, useful, and fully functional so we could get feedback on our ideas. As you saw from the prototype screenshot earlier, we started with just a few metrics we could pull together in one month and deployed the app to gauge interest.

First, we learned that everyone appreciated the way we had pulled together several different types of metrics into one dashboard. We had availability, build and deployment time, code review time, and batch size in one place, and we had weighted the scores to come up with an overall DevOps score.

However, a majority of the people were afraid that the metrics would be used against them, or that they would get uncomfortable questions from executives about any red tiles on their dashboard. We realized from the beginning that it's critical to optimize the dashboard for squads using it themselves, not for somebody looking over their shoulder and telling them what to do. Before going any further with development, we trained our executives that they were not allowed to question any metrics directly with the squad.

Craig Cook

Now let's talk about how we chose metrics. What metrics do we want to see? We already had a dashboard that showed availability data. That was in. Our SVP wanted to see velocity metrics. There's various ways to interpret that. Quality is important. Do you have good development and operational practices? Agile is good. How do you know you're building the right thing? How do you get those metrics? A candidate list had 34.

Just because we could visualize something, doesn't mean we should. We wanted our dashboard to drive best practices that will raise the quality of all services. When developers are on call and woken up for their own services, they get very interested in highly available architectures. We knew from the Accelerate data that high-performing squads deploy at least daily. We want to see deployment speed. Work in progress is a Kanban metric. It's better to complete one item than have ten almost done. Test coverage gives you confidence that your code is working as intended. Automated security is hard to do. We want to visualize those metrics as well.

We knew story points could easily be gamed. It's easier to count the number of stories instead. We got that idea from the Everyday Kanban website. Metrics that were not automated, too hard to get, they were thrown out. Human-created metrics could create variability and cause pain. We don't want to cause people pain when we're trying to get them to adopt a new service.

How would you visualize lead time with your teams? How many work items are never completed? Our independent squads use different task tracking tools. It became very complicated to try and tap into each of them and get that data out. We did a quick review on our own data and discovered that most stories are quickly implemented and completed or not at all.

Defects, stories, and unplanned work are all shown on the work completed graph, not as separate goals or ratios. Something like defects per developer could create the wrong impression and let people jump to conclusions. We don't want to encourage bad behavior. Defects outside of SLA are handled through a different process. Lastly, we don't enjoy herding cats, affectionately known as squads. Trying to ask them to change their workflow was going to cause trouble.

Now that we have a list, we plotted each item: importance on one axis, feasibility on the other. That made it easy to see where we should start writing stories and executing on gathering these metrics.

Ann Marie Fred

Now we'll get into the technical details of the metrics that we chose. If you're a metrics geek, this is the part of the talk you've been waiting for.

We set our own thresholds for good, acceptable, and low scores based on our experience of what high-performing teams do in our organization. To get a green or good score, you're going to have to be best of breed, not just barely acceptable. We have high standards, and we don't believe in grade inflation. We showed the calculations we used right on the dashboard so people can see them for themselves and to invite debate.

The first metric is availability, and it's measured relative to your service level objective, or SLO. Most of our applications and web services agree to a 99.95% SLO. Our corporate-wide authentication service has a 99.9% SLO, so squads depending on it can't be higher than that, and they commit to a lower 99.85% SLO instead.

The availability score is based on the uptime of the deployed application relative to its service level objective over the last 30 days. A score of 100 is given for 100% uptime, decreasing to 80 for meeting the SLO, and zero at four times below your SLO. So if you're allowed to have 0.05% downtime and you have 0.2% downtime or worse, you will get a score of zero. It's hard to get a high score, but our goal is that the site should never be down.

The Accelerate book focuses on mean time to restore as a key metric. Availability is a closely related metric, but it's easier for us to measure directly.

Deployment frequency is based on the number of successful deployments over the last 30 days. In short, you must consistently deploy at least once per week to get the highest score.

Teams that do well on this metric make small, frequent, and low-risk changes. They use continuous delivery, they fully automate their tests, and they never change or reconfigure servers running in production. Instead, they change infrastructure code in GitHub and redeploy. They also patch their systems frequently.

Deployment stability is one we invented, and it's the percentage of time when the most recent build was successful. If the build is successful less than half the time, the tile will be red. If 50% to 90% of the builds succeed, the tile will be yellow, and if 90% of the builds succeed, it will be green.

Teams that do well on this metric find errors earlier on developer workstations instead of on the build servers. They fix chronic build and deployment issues in order to make their developers more productive. This maps to the change failure rate in Accelerate. When your change process is fully automated, it can be measured directly like this.

Deployment speed is based on the amount of time needed to deploy changes to production from the time when a developer merges a change. A perfect score is given for build times under 20 minutes and decreases to zero above 90 minutes.

Teams that do well on this metric use build parallelization to speed up their builds. Their fast deployments improve the mean time to recovery whenever redeployments are needed to fix problems. And because their deployments are easy and fast, these teams also deploy more often and get faster feedback.

The repository speed score is inspired by Dominica DeGrandis' flow metrics. It's based on the time from pull request submission to merge, which is effectively the code review duration of GitHub pull requests over the last 30 days. A perfect score is given when the average life of a pull request is zero to two weekdays, decreasing to a score of zero at five weekdays.

Teams that do well on this metric don't neglect or ignore pull requests. They quickly review pull requests and help other developers who are stuck. Each unmerged pull request represents hours or days of a developer's work that hasn't delivered value yet. These teams also reduce their work in progress by finishing what they've started before moving on to the next feature.

The repository efficiency score is based on added lines of code in GitHub pull requests over the last 30 days. A score of 100 is given to pull requests with fewer than 150 lines added, and it decreases to zero for pull requests with over 500 lines added. The importance of small batch sizes comes from the Toyota Lean Production System, and its impact was illustrated in books like "The Goal" by Eliyahu M. Goldratt and "The Phoenix Project" by Gene Kim. Teams that do well on this metric keep their changes small and low risk. They are able to review small changes more carefully, and they approve changes more quickly.

Back to you, Craig.

Craig Cook

Thanks, Ann Marie. These metrics come from SonarQube. The security metric is the code vulnerabilities rating. We average the bugs and code smells metrics to get code quality. The test coverage percentage is used on the last column.

The overall score is a weighted average of the others. If data is not available for a score, it is omitted from the calculation. This gives the squads a quick, high-level view of how they're doing.

Each squad comes up with at least three epics for the next quarter. These are reviewed with our executives at the start of the quarter. It can be easy to lose track of these over time. We want an easy way to visualize how we're progressing with them.

Instead of changing thresholds for each squad, we created a squad comments feature. It's a way to document reasons for some metrics that are visible. If you have a red tile, it's not necessarily a bad thing. You need to know why it's red, and the squad comments is a way for you to document that.

Most of our squads are using some version of Scrum, Kanban, or Scrumban, which is a combination of both of them. These are generic level metrics that work for all of them. Work in progress is a count of items in the in-progress or review-QA stage in the workflow steps. A point-in-time snapshot is taken each week. Work completed is a count of work items finished in a sprint. Blue is planned items, green tasks, red defects, orange unplanned work, gray other.

The aging metric comes from Dominica DeGrandis' flow metrics talks: work-in-progress items with no updates in the last 10 days. If you haven't touched it in 10 days, why is it still there?

A VP owns each platform. We created an easy way for them to see how their squads were performing. Some companies can tell you how many deployments per day they can do. I don't know about all of IBM. We can see that metric for our area, though. On average, we do about 200 deployments per day.

Ann Marie Fred

Now, there are several metrics where we see the value of the metric, but we haven't invested the extra work needed to measure them.

If you remember what Craig mentioned earlier, we took a look at the lead time metric for our squads and found that it was not as long or as variable as we would have expected. What we found instead is that most stories that were opened were either added to a sprint plan within a couple of weeks or never implemented at all. Or to put it another way, stories and defects in the top 10 of a squad's backlog are normally implemented quickly, while the rest slowly accumulate until someone decides to clean up the backlog.

So instead of measuring that, we would like to track development lead time, the time from when a story is committed to the development backlog, or that top 10, to the time when it's done and in production.

Flow efficiency is another metric that we like from Dominica DeGrandis. It measures the amount of time the work is waiting for something, whether that's resources or deployment or people. We would love to be able to quantify this, but our squads' existing workflows weren't set up with wait states in them, and we haven't been able to muster support for imposing a new workflow on our squads. Our squads are protective of their workflows once they're happy with them.

We also already collect squad health metrics, as described by Spotify on the website here, for each squad in our business group on a quarterly basis. These are valuable because of the discussions they provoke within each squad, and those discussions usually lead to positive change. We haven't pulled them into this dashboard, though, because we've been collecting the answers using spreadsheets, and they're not in a database that we could pull from. Craig is part of a volunteer team working on a squad health app, so then we could use APIs to get the data once that work is done.

IBM also collects employee engagement metrics on a regular basis. The screenshot on the right is from our employee engagement website, which provides guidance for managers and individuals on how to make use of the survey results to improve employee engagement. It would be nice to make those visible as happiness metrics on our dashboard, but we don't have API access to that data.

Craig Cook

Let's talk about grades, numbers, and colors. Numbers are objective, but without context, they can be confusing. For example, 95% availability is a poor score indeed, but if you got a 95% on a test, it would be an A. Also, letter grades are more powerful than colors. People really hate seeing a D, but they might tolerate an orange tile. Our general manager actually asked us to remove the letter grades to soften the blow. We decided to make that a feature flag to show only colors by default.

Every squad is different. Our squads are autonomous and independent with different goals. We intentionally made it difficult to compare squads using our dashboard. There's no easy way to see the overall score for each squad. You have to drill down into each section.

Feedback is a gift. Some squads were upset with their grades, and they let us know. This led to conversations about great practices and the definition of what is good enough. That also prompted us to add that squad comments feature.

We refuse to set different thresholds for different squads. Everyone needs to be treated to the same threshold. Metrics have to be consistent. That's the reason why we created the squad comments feature.

We have many examples of where the metrics and our conversations around them change business outcome and behaviors.

After a set of squads adopted SonarQube, they saw poor metrics for vulnerabilities and code smells. They started work to address these issues. Some of them went further and integrated SonarQube into their IDE to catch issues even sooner, shifting development left.

One squad was upset with the PR creation-to-merge time. After discussing with them, I mentioned mob programming as an experiment. They adopted it with great success for nine months. They created high-quality services and created the happiest squad, as shown in their squad health surveys.

Some squads deployed every two weeks. That was a sign that they might be behind on their patching. After discussing how to do daily deployments, we helped them get their JavaScript repos doing that daily. They now do the latest version of packages every day, and they use a thing called npm audit. In the JavaScript world, that tells you if your packages are vulnerable. We open-sourced a script that helps you to do that throughout your CI/CD pipeline.

Ann Marie Fred

Our own squad used the dashboard every day in our daily standup meeting, so you could consider us expert users. We saw correlations between high work in progress and things not being completed, so we reduced our work in progress as a result. Basically, we committed to fewer stories each sprint.

The pull request list also highlights work in progress. It makes blocked or stuck work in progress more visible and helps us reduce that and deliver value faster.

We added deployment stability when our developers were complaining about spending too much time fixing broken builds. With data and red tiles to back that up, our team agreed to invest more time, a few weeks, in just improving the builds.

And deployment frequency showed us where something hadn't been deployed in a while. So if an app hasn't been updated in a month, it probably needs security patches.

We would review the pull request list at the end of each daily standup meeting. As a result, no longer were pull requests lost or forgotten because nobody thought to check one of the dozens of repos our squad owned. It's especially helpful when we have bots that are creating automated pull requests with our security patches.

This has been a whirlwind tour, but hopefully you've learned why you should care about DevOps metrics and how you can tailor metrics to your own organization and incentivize the behaviors that you care about.

Thank you so much. Feel free to reach out if you'd like to discuss our work in more detail.

Craig, any final thoughts?

Craig Cook

Yes, Ann Marie, thanks. I would like to invite anyone to discuss these metrics with us. It's been challenging to create these and influence squads to adopt them. We've seen some great benefits. Let us know if you try. Thanks.