A Layered Approach to Progressive Delivery

Log in to watch

Europe 2021

A Layered Approach to Progressive Delivery

Progressive Delivery is the practice of decoupling deploy from release, allowing changes to be safely pushed all the way to production and verified there before releasing to users. Selectively dialing up and down the exposure of code in production without a new deploy, rollback, or hotfix is the foundation of Progressive Delivery, but the higher-level benefits of safety and fast feedback come from layering practices on top of that. Whether you are new to Progressive Delivery or are already practicing some aspect of it, you'll learn/refresh on the basics and then come away with a powerful model for layering higher-value benefits on top of that foundation.

This session is presented by Split.

Chapters

Full transcript

The complete talk, organized by section.

Dave Karow

Good afternoon. I hope you're enjoying DevOps Enterprise Summit.

My name is Dave Karow, and I'm a continuous delivery evangelist at Split Software. Today I'm going to ask you three questions, and then we're going to talk about layering data-informed practices on top of progressive delivery.

Let's jump right in. First question: do you do this when your team releases to production? Each time you're going out to release a feature, is there sort of an, "Ooh, hope it goes okay?"

Question number two: can you remember a release night that went something like this? I'll give you a second to take in the meme. I've been through a few releases where we kind of knew partway through, "Oh, no, this is not going as planned. This is very bad." Not a lot of fun.

Finally, how do you respond when someone asks, "How successful was that release?" You might not say it to them, but you're thinking, "Well, I don't know. Haven't really had any incidents. Not sure, but hopefully it went okay." I don't think it has to be that way.

What I've been up to for the last year or two is demystifying progressive delivery, especially the role of automated data attribution early on as you're rolling things out. I got here because I've spent most of my career focused on developer tools, developer communities, and what I call sustainable software delivery practices, where the focus is delivering impact without burning out humans. Early on, I did my share of burning myself out, and then I became a product manager and kind of on from there. I believe there are practices we can adapt that let us have greater impact without having to chew people up. That's what motivates me.

I'd like to present a layered approach to progressive delivery. How do we build our way up to faster, safer, smarter releases? We'll spend a little time on background: what is progressive delivery, really? Can we define it concisely? We'll look at a couple role models: who has already been doing this for quite a while, and what do they use it for? Then we'll lay the foundation. In order to use these practices, you first need to figure out how you're going to decouple deploy from release: how can I push software all the way to production but have it be off effectively? Then we'll cover the upper layers: the automated data-informed practices that are what deliver value to your team and to your business.

Progressive delivery: what is it really? Let's talk about the roots, because the roots are important to understand. Is this just about gradually rolling out software, or is there something more going on?

You may know Sam Guckenheimer. He retired recently from Microsoft, where he was head of Azure DevOps, and he was having a conversation with James Governor, who goes by monkchips on Twitter. Sam said to James, "When we're rolling out services, what we do is progressive experimentation, because what really matters is the blast radius. How many people will be affected when we roll out that service, and what can we learn from them?"

I give this example because it is important to focus. Sam said progressive experimentation, and he was talking about learning early in the process. He wanted not just to limit the blast radius and go out slowly, but to learn as much as possible. James had this aha moment. He was connecting a lot of dots and seeing practices about changing the way you roll all the way out to production, and he coined the term progressive delivery. He described it as a new basket of skills and technologies concerned with modern software development, testing, and deployment: how do I roll in an intelligent, effective way where I have smaller lead times, happier teams, and more impact on the business?

Carlos Sanchez, now a software cloud engineer at Adobe, wrote a blog post and described progressive delivery clearly and succinctly. Progressive delivery is the next step after continuous delivery, where new versions are deployed to a subset of users and evaluated in terms of correctness and performance before being rolled out to the totality of users, and rolled back if not matching key metrics. The subset of users is important. I'm evaluating as I go, and I'm not just evaluating QA correctness: does it work or fail? Performance can mean speed and system resources. It can also mean business impact. Is it having the impact I expect before I expose everyone to it? Your culture may determine the criteria for rolling back instead of saying, "Okay, we learned something; let's try again."

Let's switch gears to a couple role models. The term progressive delivery was coined in 2018, but the practice has been going on for well over a decade.

First, Walmart Expo. Walmart built its own platform for gradually rolling things out and learning as it was happening, because when they built their solution there was not really a market of solutions that could do this. They built it from scratch. What's interesting is that they use it for two reasons: test to learn and test to launch. Test to learn sounds like A/B testing or experimentation, and that assumption is right. Test to launch is more about layering practices on top of progressive delivery: they wanted to run A/B tests on the fly during partial deliveries so they could determine whether they were impacting the business before rolling something all the way out.

The second example is LinkedIn Experimentation. Users make requests from a central service. The central service has a library it can call that tells it what to expose to the user, and the user gets what they're supposed to get. You could decide that students get a 50/50 split, job seekers get 20/80, and everybody else gets none of it. What's interesting is that the example says, "LiX failed on site speed." I can be fairly sure this was not an experiment on what would be faster. It was probably an experiment on what would make people sign up for job offers. But this is an example of a guardrail. LinkedIn is always watching things they've determined are always important, like errors and response time. In this case the alert is firing that the thing being rolled out is slowing things down by 50%, and that is not good. We'll come back to guardrails.

Let's move on to the foundation. In order to do any of these practices, we need a way of decoupling deploy from release. There are a number of ways to do that, and how you roll matters. Blue-green deployment, canary releases, and feature flags expose different benefits: avoiding downtime, limiting the blast radius, limiting work in progress and achieving flow, and learning during the process.

Blue-green deployment is having a copy of your production infrastructure. Because you have a copy, and this is easier in the cloud, you can build the next release and take as much time as you want to get it ready. There is no maintenance interval while you upgrade the solution. You do the work on the green deployment and switch network traffic from blue to green. Instantly, people are in, with no downtime. Because you can switch network traffic between blue and green, you can switch back, which is why it gets partial credit for limiting blast radius. If something goes wrong, you can flip it back very fast. You may have exposed all users to the problem, but not for long. It does not really help with rolling out smaller pieces or learning inherently in the process.

Canary releases use containers to spawn a few extra containers and send maybe 2% or 3% of network traffic through those containers with the new release. Again, you can build those containers on your own time, so there is no downtime. Canary releases get full credit for limiting blast radius because you expose only a small percentage of users and can route traffic back through the other containers quickly. They do not necessarily help with limiting work in progress or achieving flow, because you are still pushing a release out on those canaries, unless you have achieved the microservices holy grail where a feature is on its own server. Most of us do not live in that world. For learning during the process, canaries get partial credit because you can be hypervigilant over the small number of servers and try to notice how it is working, but you are still looking for a needle in the haystack of everything in that release.

Things got different when feature flags became widely adopted. It is not a coincidence that this happened around the same period that progressive delivery was coined. Feature flags unlock limiting work in progress and achieving flow. They let you release not just a deployment, but literally a block of code in a deployment. Any subset of your app can be wrapped in a feature flag and controlled remotely. You can push more things out and have more nimble control. If one feature has an issue, you can turn that feature off without reverting the whole release.

But feature flags themselves do not build in learning during the process. You may be releasing any number of things at roughly the same time, so time-based correlation is not going to help you. Again, you are searching for the needle in the haystack.

Then comes the approach of feature flags and data integrated together. That's what you saw with the Walmart and LinkedIn examples. They have telemetry associated with the deployment so they know which users are getting which experience, and they know what the experience is for each group of users. Therefore they can calculate what is happening. Because they have data science built into automation, they can figure this out on the fly. They do not study it after the fact; they know about it as it is happening. That is the practice I'm talking about today: the upper layers of the pyramid.

A quick recap on feature flags: you place a function call in your code. You can create one version of your code that can move through different environments and the pipeline, while exposure can be controlled remotely. It's like having a dimmer switch for changes in the cloud. You might roll all the way to production and have 0% of users exposed, so you can test in production or push a component of an unfinished feature all the way out so it can be validated in production before exposing it to users. Once you're ready, you can dial it up gradually.

For those who like a little code, the deck shows an on/off example asking the flagging subsystem whether this user at this moment should be shown related posts. This is evaluated a user at a time, a session at a time. It is not a line in a config file for the whole server or population. It could be right down to Dave: if the treatment is on, show him related posts; if not, skip it. The multivariate example tries two new search algorithms against the legacy search algorithm. You can divide the population, perhaps 80% through legacy and 10% each through V1 and V2, then run to one-third, one-third, one-third and compare business outcomes and system load.

Once you have feature flags to decouple deploy from release, you get incremental feature development for flow. If you build things a little bit at a time and roll them out as smaller packages that are easier to evaluate, you can get more done and achieve flow instead of a logjam. You also get testing in production, which does not mean using users to test. It can mean testing code in the production environment with production data while not exposing the feature to users until you are happy with it. You also get a kill switch, a panic button. If something goes wrong, you do not need a hotfix, rollback, or war room to decide how to fix it. You can turn it off, then have the conversation about what went wrong and what should be done next.

That's the foundation. Now come the data-informed practices, the upper layers. These practices are less known in the wild. Companies that have been doing this for a while know about them, but they are generally very large corporations with very large tooling budgets, and they invested years and millions of dollars. My premise is that this is no longer necessary. There are ways to accomplish this with off-the-shelf services now. I want to talk about introducing these practices into our environments so we can work a different way.

Why automate data-informed practices? Because a different way to ship becomes possible. It starts with deploying with no user exposure. The code is deployed, but no one is seeing the code or being influenced by the outcome. Then we go through error mitigation: testing in production with 0% of users, then perhaps 1%, 2%, or 5%, trying to find bugs, crashes, or things going wrong that we missed earlier. What can I catch before the big Twitter rant from people?

Then we move to measuring release impact. At this step, we're halfway through. Maximum power is achieved by pushing as many people as possible through both sides of the experience, the people getting the new thing and the people not getting it, so you have the greatest statistical power to determine whether it is delivering the impact you want. One last stop before full rollout is scale mitigation. If everything seems fine until peak traffic arrives and then things go wrong, what if you could ramp to 70%, 80%, or 90%, catch a peak period, and know whether the thing is going to behave at scale? Then you can finish the release.

The key is that these practices of error mitigation, measurement, and scale mitigation can be added without asking people to do more work.

Can't we just change things and monitor what happens? I came through IT at a time when we had heroes: amazing people who could dig through log files and do amazing feats of diagnosis when things were going wrong. You always wanted that person around if something went wrong. But it is expensive to find these people, and expensive on them as human beings to always be in that mode. You cannot always see what is going on and figure it out quickly, especially when you're doing more changes more often.

The problem is separating signal from noise. Other things are going on in the world while we're doing releases. There may be other product changes happening at the same time. There might be marketing campaigns; your company may be paying money to deliberately change user behavior. You do not want to take credit, good or bad, for a change in user behavior if somebody else is moving the needle through a marketing campaign. Global pandemics change user behavior. Is the change because everyone is working from home, or because my feature is meeting their needs? Even nice weather can influence user behavior, depending on what you sell and who you sell to. We need a way of separating signal from noise.

Imagine rolling out a feature. At 100%, traditional IT sees response times and latency go up and throughput go down. That's bad. But there is another bar: when the feature was rolled out to 5% of users. If you decouple deployment from release and roll out to 5%, then look at your usual graphs, you might see ambient traffic and no anomaly and say, "We're good." The problem is that if you roll out to 5%, the problem would have to be 20 times bigger than normal for you to see it. Progressive delivery inherently breaks the typical way of looking at how we are doing today.

The good news is that there is a way around this. Think about noise-canceling headphones, but for your metrics. Noise-canceling headphones listen to ambient noise and inject the inverse into your ears along with the music or podcast, so you hear the thing you want without outside noise. With metrics, take half your users and send them through the new thing, and half through the status quo. A scientist would call those the treatment and the control. Compare the distributions of user behavior and system behavior for those two populations. If they overlap exactly, then what you did did not have influence. It may be a busier day, and both may have more latency than normal, but if both show the same increase, the change did not cause it. If the distributions are apart, then you know you did something and can see where the metrics differ.

That unlocks the upper layers of the pyramid. First, automate guardrails. Guardrails are the LinkedIn site-speed idea: alert on exception and see performance hits early in rollout without toil or people being hypervigilant. If a stats engine watches what you generally care about, such as errors, response time, or unsubscribes, and alerts you, you can push 10 or 100 times a day without making people go crazy paying attention. The system does it for you.

Next, measure release impact. We want to know whether the thing we did makes a difference. If you are iterating often, achieving continuous delivery, using decoupled deploy and release to achieve flow, and shipping repeatedly, but you don't know whether it has an effect, that is demoralizing. It has been called a feature factory. We do not want a world where people move faster and still do not know whether they are having an impact. When you have direct evidence of your efforts, you are more likely to get pride of ownership. Even if something goes wrong, people can say, "Let's fix that," because they know it is not as good as it could be. Direct evidence is better for psychological safety because you actually know whether you are hurting the business. It is not rumors or someone's opinion.

Finally, test to learn, what you might traditionally think of as A/B testing. We want to take bigger risks safely and learn faster with less investment. A common misconception is that A/B testing means building two of every feature. Product managers may say they do not want to use two story points instead of one. That is not what this is about. What you may be A/B testing is the status quo against one new change. That is just one new thing and seeing its impact.

There are also ways to try different new things without creating multiple versions of software. One is dynamic configuration. If you use feature flags to decouple deploy from release, they can carry a payload: parameters specific to the population getting that flag. Speedway Motors, a car parts site, has a recommendation engine that takes input parameters. They wanted to see how different parameters behaved in the real world, so they set up dynamic configuration to send different sets of parameters to different cohorts and observe what happened. They could then iterate without a new release by changing the dynamic configuration values. Instantly, the user population was on a new experiment and data was captured in sync.

Another example is painted door testing. It is a usability hack where you build the entrance to a feature without building all the backend complexity. When people click it, you can either throw an error or, better, say, "Thanks for your interest in this new feature. We're working on that. We'll get back to you." This lets you test interest without full investment.

To take bigger risks safely, imagine a food delivery service. If you ship the right food to people, they are more likely to buy more, like you, and stay. But you need to ask more questions to know what they want. Imperfect Foods had a sign-up flow and proposed adding more questions so they could better cater to customers. The concern was that people would drop off and not finish. They created a traditional flow, a slightly longer flow, and a significantly longer flow, and tested them side by side. The longer one yielded consistently better results, with $7 or $9 more per order from people who went through that flow. So they fired it up. This was a way to take a risk without a big debate over whether to do it and what would happen.

This is what sustainable software looks like. This new way of progressive delivery with automated data science creates a more sustainable flow. Traditionally, we think of deploy and release. The steps in the middle are only possible by automating the statistics as the rollout happens. This is not about studying the data afterward. It is not about begging a favor from someone to look at data. This is something that happens automatically. That is the way we build pipelines: automate things so we can move them as many times as we want, and the system does the same thing every time to make sure we operate consistently, effectively, and efficiently.

This is the only slide where I'll really talk about Split Software. In-house progressive delivery platforms paved the way for Split. Our founders came from multiple shops where they had built large complex systems. When they went to places that did not have them, they missed them and wished they could just buy this. That is what they set out to build.

Many companies have adopted parts of progressive delivery, like feature flags. Some have added sensors and tried to do correlation. Very few have figured out the stats engine, system of record, and alerting. Those that did were spending tens of millions of dollars a year, and they re-up because they love it. It is a competitive advantage. We do not all have $50 million, $30 million, or $25 million a year to allocate, or two to five years to build it. The engineers who built Split set out to make this something you can subscribe to as SaaS.

Let's move on to Q&A with Slack. I'm looking forward to your questions. I also want to invite you to come to our booth. We're doing a fun thing called Confessions and Redemptions in Continuous Delivery, where we usually have great conversations with people at in-person shows about what they're working on and how it has gone. We're going to recreate that here. We're also holding a raffle to give away an Oculus Quest 2 virtual reality headset. I hope you found this talk interesting and informative. I have a lot of vendor-neutral content we can talk about. This is not all about Split. Split is just one company leading the way in making this productized.