Why the Dora Metrics and Feature Management are a Brilliant Combination

Log in to watch

US 2021

Why the Dora Metrics and Feature Management are a Brilliant Combination

Find out how the 4 key DORA metrics, as popularized by the Accelerate book, can be enhanced through the use of feature flagging. Michael Gillett, author of the new book Feature Management with LaunchDarkly, will talk about how decoupling deployments from feature releases, testing in production, and adopting trunk-based development will enable you to deploy more frequently. Each of the DORA metrics (Deployment Frequency, Lead Time for Changes, Change Failure Rate, and Mean Time To Restore) can all benefit from these approaches and through this talk you will be discover how you can accelerate your team’s performance through the use of feature management.

This session is presented by LaunchDarkly.

Chapters

Full transcript

The complete talk, organized by section.

Michael Gillett

Hi, I'm Michael Gillett. I've recently authored a book about feature management with LaunchDarkly, and I'm head of development at a global company based in London.

Today, I want to talk about the ways in which I think the DORA metrics and feature management are a brilliant combination. I've used feature management for a number of years, and today I want to share with you the experiences that I've had and some of the ways in which feature management has really enabled us to accelerate the way that we work and improve the DORA metrics that we follow.

[Shares screen.]

The first thing I want to talk about are the four DORA metrics, just to make sure that we're all on the same page with them, and refresh anyone's mind if you've forgotten what they are. DORA stands for DevOps Research and Assessment team, and they've taken a look over the past few years at understanding how the DevOps landscape looks, and what does a high-performing team versus a low-performing team look like. From that, they were able to pull together these four metrics.

The first being deployment frequency: how often can deployments be made? The second is mean lead time for changes: how long does a piece of work take to go from a developer's machine, from the idea, writing it, committing it, pull requests, all of that good stuff, and getting to production? The third is the change failure rate: how often of all of these deployments does that change result in a failure? And then finally is the mean time to recovery: when something has gone wrong, how long does it take to recover from that?

Let's just first look at deployment frequency. As I said, the DORA team have identified what a high-performing team and what a low-performing team look like. Actually, this is really just a spectrum, and we might be able to place ourselves on this spectrum. It's useful to understand this because it does show the level to which our DevOps maturity is, and that's useful to know, and it also gives us something to work towards when we identify where we sit on this spectrum.

A high-performing team for deployment frequency metric can deploy on demand whenever they need. That could be multiple times a day, or it might just be one time a day. But the point is, whenever they need to release, they can release without things getting in the way and the processes taking so long. Whereas a low-performing team, it may be once every one to six months. We're all going to be somewhere on this spectrum.

The second one, mean lead time for changes, again, is a spectrum. This time, that is where it's less than a day for that work to be done, to be committed, to go through the various quality gates and code reviews, pull requests, all of that, before it ends up on production. It's less than a day to do that. Whereas a low-performing team, it's actually one to six months for that change to get to production. It can be a small change, it can be a massive change. But the point is, even when we're dealing with the small changes, we're still looking ideally at less than a day. Big changes, maybe you're doing small iterations of work.

The change failure rate, again, another spectrum. This time, what we're looking at here is that for high-performing teams, they have a change failure rate of only zero to 15%. Of all of their releases, all of their deployments, zero to 15% of them result in a failure in the production product. Whereas a low-performing team, it's actually much higher. It's between 40 and 65%. Again, I'm sure we can all look at this and get a sense of where we are.

Then the final one, the mean time to recovery. That's how long does it take for us to recover from an issue? For high-performing teams, it's less than an hour. A deployment's gone out, something's not right, let's recover from that. We need to redeploy, we need to roll back, we need to do something. It's less than an hour for a high-performing team to get back to a good state. Whereas for low-performing teams, it's one week to one month.

Knowing these four DORA metrics, understanding where we fit within these, is useful to know how good we are at that whole DevOps methodology. Many of us here, if not all of us here, will be aspiring to be on the high-performing teams and believe in a good DevOps practice, and these DORA metrics are a great way to identify how mature we are with DevOps.

Now let's take a look at feature management and understand what it is by feature management that I mean, and then share with you some of the things that I've learned, and some of the ways in which we've been able to improve our DORA metrics through it.

The first thing to understand with feature management is it boils down basically to a feature flag. It could also be known as a feature toggle. The idea here is that we've got a piece of code or an experience, and we know what that is, but we want to offer another variation of that as well. It could be a variation of improving that feature. It could be that we didn't even have the feature in the first place, and that we're going to introduce a new feature. The point is we're able to encapsulate bits of logic within our app, within the customer experience, that we can then determine when we want to turn that on.

Really, it's just an if statement where we can return variation A or variation B. The important bit with a feature flagging system is how does the evaluation get determined? Feature flagging isn't a new concept. It's been around for a long time. It's been done via databases, app settings, config files, all of this stuff. But what I'm really looking at today, and what my experience is really about, is actually a modern feature management platform. I'm experienced with LaunchDarkly. There are others out there.

When I'm talking about feature flagging, feature management, it is this idea of a modern distributed feature management system that actually is doing the evaluation for us. But there's a lot more to the evaluation than it just being done by someone else. Actually, what's going on is that we are targeting a feature to a user or to a session. We can have the idea that we've got a person, a customer, who's using our product. We need to be able to turn on conditionally variation A or variation B, turn this feature on or off for that customer. That's really what we want a feature management platform to be able to do very well for us.

We need to provide it data about the customer. That could be the country they're in, could be the device they're on, but it could actually be more information about the customer themselves. Maybe if they're on a subscription tier, we could use an entitlement pattern. Or it could be to do with how much money they've got in their balance, or even the country that they registered in. We need to provide that, and then within a feature management platform, we can target features, variations, to those users, those sessions, that actually meet those requirements. That is key to the whole point of feature management: have that fine-grained control over who will and who won't receive any of the variations that we have.

With that in mind, let's take a look first at switches, which are perhaps the simplest feature management concept. The feature is either on or off for all customers.

That's very powerful because when things aren't going right with our product, it could be that we've got increased demand, load of traffic coming to our site that's degrading things. Or what if we have some non-essential functionality on our site that we could actually turn off to restore service? Or we could do the opposite even, which is maybe there's some customers who are experiencing a weird edge case, and we can't easily identify what's going on for them. We could turn on extra debug for them. We could serve them an unminified version of JavaScript, which will contain more human-readable names, maybe has additional logging going on within it.

That's really powerful for us to be able to have these levers of either turning something off for everyone or fine-tuning the customers who we want to turn things on for. This helps with the DORA metric of the mean time to recovery. Because if we're suffering, if we're struggling, if we've got a slightly degraded experience for our customers, what we really want to do is get back to the good state. With switches, that does help us. Certainly, if we can turn off expensive but non-essential pieces of functionality on our product, that should free up some resource and allow customers to experience the essential part of the product. Equally, if there are some edge cases, then we can get more information by turning on additional logs for those customers, allowing us to recover faster, move us towards that high-performing end of the spectrum for this metric.

The next one is rollouts. This one's a bit more interesting. This isn't about turning something on or off for everyone. It's about turning something on for select parts of our customer base. This allows us to think about how we would roll out, how we would deploy, how we would release features.

The important thing here is that actually every new feature that we develop, we should put in the off state of the feature flag, so that when it's deployed to production, that feature is not exposed to any customer. It is only exposed to those users who we want it to be. In that manner, we can decouple a deployment from a feature release. That gives us a lot more safety when doing deployments, and is something extremely valuable to do.

Let's go down this scenario where we've made our deployment. The code is there. What we can do is roll this feature out first to our QA team, our stakeholders, to sign this thing off. We can select those individual users. They can validate that this works as expected, and then we can think about rolling this out to customers. We've got two ways of doing that. We could roll this out to a percentage of customers, go 5%, 10% of customers. Or we could roll this out to a group of customers and do a ring rollout. Either way, these are progressive rollouts where we're going to take a feature slowly and carefully to more and more customers. Ultimately, we want this to get to 100% of customers, and then we can tidy up that feature flag.

This is an extremely safe way of releasing that new code to our customers. If we think about it from the DORA metrics point of view, this actually helps with the change failure rate. Because we aren't going to expose new code to customers without it having gone through quality gates, without it having gone through our own QA processes and sign-off. Deployments could happen all of the time, but nothing's changing for the customer. In that manner, the change failure rate can be reduced. Yes, there is still the chance that a deployment could break production. But if we're encapsulating our feature flags correctly, the chance of production actually breaking is seriously reduced through the use of rollouts and through the pattern of all new work being done as feature flags with the default experience, the current code that we've got, being the default flag setting. That's what customers receive, and then we turn on the new implementation when we need to.

Moving on from that, though, is experiments. Maybe with a rollout, we wanted to check that something technically worked as we wanted. Maybe we need to improve some caching. We have an idea of what that new caching experience should do. How is it going to improve performance? That's a technical hypothesis that we think we know what the metrics are going to be like. Let's roll it out. Is it looking good? Yes, it is. Cool. Keep rolling it out till we get to 100%.

But there's another type of experiment, another type of hypothesis, which maybe is more of a business hypothesis, where we want to build some features for our customers, but we don't actually know if the customers are going to like it, or maybe there's better ways of doing it than we were originally thinking. In that regard, we've got a business hypothesis that we want to experiment with.

This becomes interesting. We can use rollouts, we can use percentage rollouts, and we can gather information about how the users are using the product. Are they clicking it more? Are they engaging with it more? Are they spending more? It depends what the metric is of that particular hypothesis that would deem it a success or a failure.

That's the interesting bit. When we're doing an experiment, it might succeed, but it might also fail. If it fails, we don't want to have spent a long time building this. We want to have spent the minimum amount of time building this. With that opportunity for doing work that might ultimately come back and show us we're on the wrong path, or we need to rethink this, or our customers aren't interested in it, that drives us to do things in small chunks and test those small iterations of a feature or steps towards building out a bigger aspect of a product.

That helps with deployment frequency. The reason being is because the changes that we're looking to make, to deploy and then to release to customers, are small. We don't want big changes. We want very small iterations. With that in mind, what we're able to do is deploy more frequently. Our code is going to be smaller each time. The tickets, the work items themselves are going to be smaller each time. We want to be able to deploy more frequently.

Then there's the other bit as well, which is, maybe if we are confident within our release pipeline that we have got good unit testing and we have got automation testing, can we have less testing going on within the pipeline so the actual deployment takes less time? Remember what I said about rollouts, which was we can decouple deployments from releases of features. We can have deployments happening really, really quickly that are doing regression tests and smoke tests. But do we need to run the full suite of tests every time? Maybe not. That's an option to us. Where we're happy to take that approach, then we're in a position where our deployments can speed up.

But there is a way in which we can work differently with feature management, not just about how we're doing things in the product, but actually how we build the software itself, and that's called trunk-based development. The idea here is that we want to be as close to the main branch of our Git source code repository as possible. Some of you might be using GitFlow, where there's the notion of release branches and feature branches. That's all fine. There's a bit of admin, a bit of overhead in managing the branches and keeping them all in sync. But the idea of trunk-based development is that you work a lot closer to the main. You actually get rid of the release branches, and you just make feature branches off of the main branch, work away on it, make a pull request, and put it back in.

That speeds up development because you don't have so many branches to deal with, and there's also slightly less chance of merge conflicts as well because you're actually dealing with fewer changes going on. It's a little bit simpler, a little bit quicker, a little bit easier.

But then there's another opportunity as well within this idea of how can we change the way that we work, which is production is the best test environment that we have. Most people don't like the concept of testing in production. It's seen as a joke. But when we've got feature management and we're using feature flag encapsulation in the way I described it earlier, then we do actually have this opportunity to deploy untested code if you really wanted to, although I might advise against that.

It is possible to deploy untested code to production, and as long as you have a very strong set of tests within your pipeline that will check that nothing has degraded, nothing's regressed, that untested code that is encapsulated in an off state of a feature flag won't be exposed to customers. It's actually very safe to do. You can then go along as a developer and turn it on just for yourself and see what that's like on production. It is possible to skip test environments entirely and go straight from a local machine to production. As long as nothing's been written outside of that feature flag encapsulation, then you're going to be in a pretty safe space to make that kind of change.

That helps with the mean lead time for changes. Combined with what I was saying around experiments and having things be small iterations, changes now can be done really, really quickly. We can iterate really, really fast. That pipeline is rock solid for us. We can deploy on demand, ideally. The work we're doing is small. It takes a very short amount of time to get through that pipeline because we've removed some of the steps to it. Now we can get it live onto production. We can turn it on for ourselves, we can turn it on for stakeholders. We can show things as we're building them. That means that changes can happen really, really quickly. If we want, we can turn these things on quickly for customers as well. It can make bug fixing much better as well. There are fewer steps, fewer gates in that pipeline, which can really help with reducing that mean lead time for changes.

Now let's just take another look at this and approach it more from the DORA metrics point of view. I touched on them there, but let's switch this around a little bit and actually look at how DORA metrics and feature management really do go together.

The first was deployment frequency. In what ways can the deployment frequency metric be improved? If we're making smaller changes, we can release more often, so we can deploy more frequently. That comes around again from this idea, certainly from experimentation, but actually it works for everything really, that we always want to be doing small iterations. You don't want big-bang releases. If we can make smaller changes, especially when it comes to experiments where it might ultimately be a failure, so let's not invest a huge amount of time and resource, then we get to release more often.

Additionally, there's less testing for each release. Just because we're not turning this feature on for customers when it gets deployed does mean we don't have to do as much testing for every deployment. That isn't to say we shouldn't do testing, but we can do less. We can cherry-pick the type of tests that we want to do for each deployment, because it's when the feature is going to be released that the testing becomes most crucial, most important. Deployments don't need all of the testing, but the feature will. We can determine when the feature gets enabled, and we can do that testing just before we want to turn that feature on. Deployment frequency can be improved through the use of feature management and move us towards that high-performing end of that spectrum that we looked at earlier.

The second is the mean lead time for changes. This is how long does it take to go from the idea, from someone working on it, all the way through that pipeline and with all of the quality gates that we've got, to then getting to production. Similar to deployment frequency, if we're doing smaller changes, and in this regard we're doing actually smaller features, that helps. That's reducing the time to dev things. But that's maybe not everything about this metric, because you could have already been doing small features.

What we're really talking about here is a complete mindset where everything can be seen as a small feature. There's also the opportunity that every day, even in unfinished work, can still be released to production. It's always good practice to commit your code at the end of a day or when the fire alarm goes or whatever, to make sure that all of your work is still there. What if every commit triggered a release? If you're doing it in the correct manner with feature flag encapsulation, it's very safe to do this. It keeps the releases small, and we can constantly be shipping changes.

Equally, we'll have less testing, just like the deployment frequency one. We can skip entire testing environments. Whole parts of the testing framework that you'll be using could be skipped for deployments. Regression and smoke, I would say, need to be there. Unit tests need to be there. But there could be a huge suite of automation tests that you've got. You don't need to run that on every release, and that does mean this mean lead time for changes will reduce.

The third is the change failure rate. Because all changes should be done within a feature flag and encapsulated in that manner, it does mean that there's a fair degree of safety in making releases and deployments because the known state of a feature is what is always going to be served by default to our customers. It's the new work, the new implementation that is being added within the flag where we need to turn it on. The change failure rate should come down with feature management because it is less likely that deployments are going to disrupt that customer experience and actually have a negative impact to our site. That is because the flags are all off by default. As long as you follow that pattern, you can safely release work, as I say, in unfinished ways, maybe in untested ways. But that does allow the change failure rate to come right down.

Then the final metric is the mean time to recovery. How long does it take to recover if things have gone bad? Or perhaps I should say, when things have gone bad, let's assume they will at some point, how quickly can we recover from this? If we've just turned on a new feature and it's not looking very good, even though it went through all of our tests, we can turn it off like that. It's a click of a button to turn a feature off with feature management. In that regard, new features can be turned off immediately, which again is bringing us down to that high-performing end of the spectrum where we can recover in less than an hour.

There are a few other things as well. Deployments are small and rollbacks are easy. What I mean by this is because we're looking to do small changes and small features, if the actual deployment itself breaks production, which you would hope wouldn't happen with really good testing in your pipeline, but it can happen, the change was very small. It's quite quick and easy to identify what has gone wrong, because there should really only be one change per release, per deployment. The rollbacks are easy. We can redeploy. We can work with this. It's not a huge exercise of figuring out which particular change within a large release has caused this problem. It should be quite apparent which release and therefore which change has actually resulted in this problem. That can allow us to recover more quickly as well.

Finally, with the opportunity to use switches, like I mentioned right at the beginning, as a mechanism, as a lever to pull when our site is degraded in some way, we can turn non-essential pieces of functionality off, and that can allow us to recover as well. There might be other work that needs to be done, so we're not completely out of the woods, but maybe we can return to offering our essential functionality to our customers. That's important. Again, these are just options for us to really reduce the mean time to recovery and improve that DORA metric.

To me, that is why the DORA metrics and feature management are a brilliant combination. They complement each other very well, and it feels to me like feature management is an extension of the ways in which the DORA metrics can be improved. It's an extension of CI/CD even, where not only are we looking at deployments being the crucial metric and thing that we measure, but actually what we're able to do with feature management is reduce the risk of deployments. Not completely, but significantly.

That does allow us to really then improve those DORA metrics because the releasing of a new feature, the releasing of new code to our customers, is separated from the deployment, allowing us to deploy far more frequently, far more often. That allows us to make smaller changes. We can recover from that faster. When things have gone wrong, we're in a position that we've got options available to us to improve the product that we're trying to serve to our customers.

For me, it's a great opportunity to look at feature management, and really things like experimentation are valuable as well. But certainly within a conference like this and the focus on DORA metrics, I think it's a fascinating opportunity for us all to really improve those DORA metrics with feature management.

I have recently written a book about feature management through using LaunchDarkly. It is available soon, and it's available on Amazon. If you want to know more about ways in which we can use feature management to not just improve DORA metrics, but a whole host of other things, then do check the book out. Thank you very much for listening. I hope I've given you something to think about.