Measuring for DevOps Success

Log in to watch

US 2021

Measuring for DevOps Success

Head of Development · Hermes Germany GmbH

When introducing DevOps for one or two teams, there was no need to provide evidence of the effectiveness or implement organizational measures for improving our DevOps approach.

Ever since rolling out DevOps to our entire IT organization, both changed: we need to make sure that we are on the right track with developing our organization and we need to demonstrate this.

Deriving from the metrics of State of DevOps Report 2019, we introduced KPIs without much additional tooling required. Working in an environment that is not used to make KPIs visible, it took some time to get accustomed. I will show how we did this and how we are now using our metrics to improve our processes.

The question however is if process performance really matters. I will dig into this and correlate it with result orientation.

Chapters

Full transcript

The complete talk, organized by section.

Stephan Stapel

Hello, my name is Stephan. I'm from Germany, and I'm working for a company called Hermes.

At Hermes, we introduced DevOps about four years ago now, and we are quite happy with the path we took. Today, I'd like to share with you some insights of our journey.

Part of our journey was to measure the success of DevOps. I'd like to start with some short introduction and then discuss what success means, at least in our environment. I'd like to introduce the metrics we are using, and I'd like to conclude with some takeaways, hoping to inspire you for your own journey.

Let's dive into the introduction. Hermes Group is the largest post-independent parcel company in Europe. We have subsidiaries in multiple countries, and Hermes Germany alone delivers about 500 million parcels per year. Besides private customers, our typical clients are medium and large e-retailers.

In this environment, we have a market that is growing with 5% to 10% per year. That means that we need to cope with an ever-increasing amount of parcels. This, in turn, means that we need to automate everything possible, and that is key to our business. Our customers and business clients expect digital innovations to be happier customers or to foster their own businesses.

Having worked with Hermes for ten years now, I can clearly say there is no business without technology. And in this environment, the question is: what is success?

The first factor clearly is to be able to focus on bringing value, and that sounds simple and obvious. Those of you working in larger enterprises with lots of different interests competing with each other know what I'm talking about.

Another factor is the ability to provide value faster, to remove technical and organizational burdens, to have everything in place that we need to provide value.

Another success factor is, as a tech organization, to be a reliable partner in the company, to be someone to trust.

I have an example I want to introduce to you. We have two projects on the diversion of parcels, to redirect parcels, if you are not at home, to your garage, to the neighbor, to the parcel shop. Those two projects have similar sizes, similar complexity, similar stakeholders, and even a similar topic to deal with.

The first project took place seven years ago in the old working system. We had Scrum by then, but I would probably not call it agile, at least not today. We had no idea of pipelining, of automation. This project took us nine months, if not longer.

The second project, which we conducted last year with the current working system, with a good understanding of automation, of delivery pipelines, of fast flow of work, with feedback mechanisms, took us four months.

Probably you cannot speed it up even further, at least not by that degree. If we would speak in seven years, it probably would not be two months. But what you can see here is that the effort to enhance the working system pays off really quickly.

It is important to note that enhancing the working system is not only about speed. That is not only because value is more important than speed, but also since delivering tech products is no sprint which is finished after 100 meters, but rather an endless marathon, running kilometer after kilometer after kilometer.

This is why I like this agile principle, which was written more than 20 years ago. It says: keep constant pace indefinitely. Which means listen to yourself, listen to your organization, find out how fast you can work, but don't work faster. Find a pace you can keep, you can best work at.

By saying that, and by aiming at that, we found two problems in our organization that we wanted to work on.

First of all, we didn't know how long a larger piece of software, a larger piece of work, would take. Can we improve our estimates? Can we generate more reliable estimates? Can we improve ourselves?

The second question is: we put lots of effort into introducing continuous delivery, and can we prove that that really paid off?

By asking these two questions, you have to understand that DevOps itself moved from a grassroots movement to a general direction for our tech organization. We are getting frequently asked by our top management if all the effort is really worth it, if it really needs this modern way of working. So we regularly need to prove that we are on the right track, and we sometimes even need to defend.

We then decided to shine a light on the system of work to better understand what is going on. That was really good to make the situation transparent, to share this transparency with everyone in the organization, because that helped to feel the pain together.

To shine the light on the work system, we took a look at these four key metrics. I like these four key metrics as they were introduced in the DevOps Report and the Accelerate book from Nicole Forsgren. We now make use of most of them in a way that is achievable for us.

Let's dive right in: measuring lead time.

To understand what we are measuring here, you first have to understand how we are working in our organization. The ways of working, the way of how we are coordinating the work, is based on the flight-level model from Klaus Leopold. This model basically comprises Kanban boards on three levels of management, on three levels of abstraction, and each of these levels is aligned with the other.

On the top, we have the strategic level to manage large company-wide initiatives, making sure that there is strategic fit and that there are valid business cases. On the second level, we take a look at the context. We are aligning teams if they need to work together, for example, on a particular feature. On level one, on the team level, this is where each team plans and manages their work, for example using Scrum or Kanban or whatever might be appropriate for a particular team.

We decided as a first step that we want to measure the lead time on the coordination level, measuring the time it takes to work on a particular feature.

The question is now: what is a feature for you? Sometimes features are referred to as an epic. For us, the effort of such a feature should be two to three months. Not because we are calculating based on time, but because we believe that we want to coordinate something which really has an impact, for example happier customers or higher profit. It might also be an experiment where we are aiming for learning.

Some real-life examples are the introduction of electronic payment to our customer website, the implementation of a new newsletter, or adding certain countries where our customers can send their parcels to.

These features comprise a number of stories, eventually implemented by multiple teams. Those stories then are managed on the level-one boards. Our goal is that all features on the coordination level should have similar effort, similar size. This allows us better estimates and eases coordination and prioritization, because you can compare them at least in effort or size against each other.

The effort that I mentioned is typically called lead time, which is the time from starting to work on a feature until this feature is available to the user, in our case the customer. It might or might not involve multiple teams. In this case, we have team A on the top and team B on the bottom that are involved in this particular feature and this process.

If we have multiple teams, the goal of the coordination board is not pitting teams against each other. Instead, we are observing their collaboration and helping them align, helping them collaborate to get the feature done.

The organization I'm responsible for comprises 14 of such product teams, with approximately 100 people in total. In this organization, we deliver 100 to 120 of such features per year currently.

What we want to avoid is a U-curve-looking statistic, with lots of features finishing quickly, lots of features taking literally forever, and just a few features of the desired size. We are aiming for quite the opposite type of curve.

This is what the lead times currently look like. What you can see here is the lead time medians of the features calculated per month, along with the all-time median of 78 days. What I did here is to smooth the values, at least to some degree, to smooth out some outliers. I took the all-time median of 78 days, which is approximately two months. I now took the two-month rolling median to calculate the median for each bar which you see here.

We still have some varying lead time, so we clearly have room for improvement. We might even someday aim to lower the lead times a bit, but that is not the goal for now. Shortening the feature-level lead time is no good goal because this would not improve the work system at all. People would just start to cut the features in half. It is just a technical measure, and the features will finish quicker. Goal achieved, but nothing improved. So that is not the goal for now. The goal for now is to be more consistent.

One side note: if you take a closer look, you see two sections in this diagram. The explanation is quite simple. Until April of this year, we worked on normal business features to make our customers happier. Then in May, our cloud migration project kicked off. We had completely different topics that we needed to start working on, with little expertise on these topics and additional risk introduced. What we saw then is that the lead times pumped up. I am now eager to see what happens when the migration is finished in October of this year, if we will return to the old level of lead time.

Then we are measuring deployment frequency and failure rate.

Let's take a look at the necessary information we needed to calculate those. During an earlier ITIL implementation, a change advisory board was created. When introducing continuous delivery, it was like putting a horse in front of a racing car. The CAB probably never works in companies with lots of software change going on. We changed the game for some time, but when we streamlined our processes during introduction of continuous delivery, we needed to wipe out the CAB. We are happy that we succeeded with that.

But even for wiping out the CAB, we still need to document the changes to give transparency about what is happening. What we did is we automated the change documentation within our deployment pipelines. That gave us a good acceptance by the teams. Everyone added that to their deployment pipelines, so we have a comprehensive database of changes. This comprehensive database is now used to generate the metrics.

Measuring the deployment frequency is quite simple. From the change database, we are just counting the number of deployments per time and per solution. Easy.

What is really important before showing you some numbers: we are not comparing teams, and we are not comparing their performance. It shall never be a race for number of deployments. What we can do is compare the trends and work with the teams to find a good frequency that fits their skills, the topics they are working on, and also the maturity of their solution.

Thinking about deployment frequency, that directly says something about the health of our continuous delivery pipelines. Those automatic pipelines make the delivery process safe for everyone to use, so there is no anxiety to deploy to production. That is what results from Jez Humble's quote, which I really like: do it more frequently and bring the pain forward if it hurts.

Secondly, deployment frequency is a proxy metric for batch size. This means the higher the deployment frequency, the smaller must be the batch size. A smaller batch size in the context of software delivery means that we have better control of what is going into production. That, in turn, means that we have better control of quality and risk.

As an example, I brought to you the deployment frequency of one of our teams. Team PI is responsible for the API that generates labels for all of our private customers. You see the month along the X-axis and the number of deployments displayed as bars.

I'd like to share with you two observations I see here, at least from the context I have. At Hermes, we are traditionally very cautious with deployments during peak, which is November and December of each year. As you can see here, we just rewrote this rule, and we just continue working and bringing stuff into production with proper pipelines, with proper automatic tests. We see that there is no service degradation. To get back to Jez Humble, we brought the pain forward.

On the other hand, what you can see here is what results from going into holiday during Christmas season, with five deployments in January.

What are my learnings? I have the observation that a team with at least five deployments per month, or one deployment per week, is generally doing fine. If the team has fewer deployments than once per week, it is a good chance to get into discussion with each other to find if we can do anything to help them. For example, if the application needs some re-engineering, or the pipeline needs some re-engineering, to ease deployment.

Then what we found is that solutions that get more mature have lower deployment rates. That is because the teams can focus on outcome, on value, instead of delivering a pure amount of features.

Also, taking these measures is good to bring teams together, to let them inspire each other, to help them get better pipelines, enhance them, and find cooler solutions. From a manager's perspective, it is a good tool to better understand and offer support to the teams.

Second metric: failure rate. In contrast to the original metrics, I believe that there is no such thing as a change fail rate, because with automatized pipelines, deployments will almost always work. The boundary condition, of course, is that we have comprehensive pipelines containing automatized tests, code analysis, and the like.

Obviously, despite all the effort we take, we sometimes send bugs into production. This remains true even with continuous delivery. We asked ourselves: how can we get transparency about such events from the data that we have? We decided to start with a statistical approach, because basically a bug in the context of DevOps is usually detected by monitoring data or feedback from users or customers. Such feedback usually comes in quite quickly if you have internal users. When developing software for customers, for the public, such feedback comes in much slower. In this case, telemetry is king.

With these mechanisms in place, we can assume that we quickly find such bugs. Because the change was small, we can assume that we generally are able to fix it quickly, at least if such bugs occur during office hours.

So we looked at the change data for a second deployment coming in quickly after the initial deployment. By looking at the data and discussing with people, I found that a typical fix is delivered within the next three hours after bringing out the initial malicious deployment.

Using this statistical approach, we found that Team PI had one fix forward in November and a second fix forward in March. The precondition, of course, is that quality checks prevent the deployment of bad code, and those checks need to be baked into the deployment pipeline.

All in all, we found that the fix-forward rate across all my 14 teams is less than 1%. For me, that is a good proof for the benefits of continuous delivery.

What are my observations? We were glad that we didn't have to introduce an additional measure. That allowed us to keep the effort small. Of course, this approach certainly is a compromise, and that compromise comes with some additional work. We need to get into discussion with the teams to verify the approach. We need to observe and adapt the three-hour threshold.

The questions we have are answered. We can prove that the quality mechanisms are working. We can prove that continuous delivery does not yield instability. In the future, it would be interesting to learn about the detection times of bugs now, the reaction times, and the times to fix, to cut the metrics into smaller parts. So there is room to improve.

Let me now conclude with some takeaways. We've discussed some numbers, some statistics, but the question is: how happy are the people? We are not measuring happiness yet. Instead, we are getting into a conversation with each other, getting regular feedback on the measures we are taking.

I'd like to share with you two quotes from two colleagues who have been with us for more than ten years now as well. The first quote is from Kirsten, who is the product owner from myHermes DE, our private customer portal. She said: we are faster, good, fair enough. But even more important, the quality improved tremendously. We bring small changes into production and are monitoring them with the entire team. The entire team looks at what is going on in production.

The second quote is from Michael. Michael is a QA expert in one of our API teams. He said: formerly, I could easily make dev teams sweat by imagining weird test cases. Nowadays, everything is so transparent and streamlined in pipelines, I as a QA engineer can barely even break anything at all.

I'd like to conclude that implementing these measures was really worth it for us, and I think they are worth it for everybody. We are now able to work on consistent lead times, and we can now make sure that we have small batch sizes, introducing little risk to production, if any risk at all.

But to be honest, with most successes come some vulnerability, and we did one big mistake. We did not collect the information before we started our DevOps transformation. You can only prove your success if you know how bad it was when you started.

I'd like to encourage you to do it better. If you did not start yet with the DevOps approach or with continuous delivery, please collect the before state. Please collect how bad or maybe even how good it is now in your environment.

What I can say is that we now have information where we can improve on. We can now improve from here at least. Besides all technical discussions, besides all discussion about continuous delivery, in my leadership role these analyses now allow me to connect better with my teams. It now allows me to better offer support where it might be necessary.

We can now connect the teams better. We can now foster collaboration between the teams, encourage them to share their learnings from their own continuous delivery journey.

With this, I'd like to conclude. If you have some feedback or want to get into discussion, this is my contact information. I'd be more than happy if you want to connect.

Thanks a lot for listening.