A Layered Approach to Progressive Delivery
Progressive Delivery is the practice of decoupling deploy from release, allowing changes to be safely pushed all the way to production and verified there before releasing to users. Selectively dialing up and down the exposure of code in production without a new deploy, rollback, or hotfix is the foundation of Progressive Delivery, but the higher-level benefits of safety and fast feedback come from layering practices on top of that.
Whether you are new to Progressive Delivery or are already practicing some aspect of it, you'll learn/refresh on the basics and then come away with a powerful model for layering higher-value benefits on top of that foundation.
This session is presented by Split.
Chapters
Full transcript
The complete talk, organized by section.
Dave Karow
01Opening questions and motivation
I hope you're enjoying DevOps Enterprise Summit. My name is Dave Karow, and I'm a continuous delivery evangelist at Split Software. Today, I'm going to ask you three questions, and then we're going to talk about layering data-informed practices on top of progressive delivery. Let's jump right in. Okay, so first question, do you do this when your team releases to production?
Each time you're going out to release a feature, is there sort of like a, "Ooh, I hope it goes okay?" Question number two, can you remember a release night that went something like this? And I'll give you a second to take in the meme here. I've been through a few releases where we kind of knew partway through, "Oh, no.
This is not going as planned. This is very bad." Right? Not a lot of fun. And then finally, how do you respond when someone asks, "How successful was that release?" You kind of like, "Mm..." You might not say that to them, but you're thinking to yourself, "Well, I don't know. Haven't really had any incidents.
Not sure, but hopefully it went okay." Well, I don't think it has to be that way. And what I've been up to for the last year or two is demystifying progressive delivery, especially the role of automated data attribution early on as you're rolling things out, right? And I got here because I've spent most of my career focused on developer tools, developer communities, and what I call sustainable software delivery practices, where the focus is delivering impact without burning out humans. Right? Early on, I did my share of burning myself out, and then I became a product manager and kind of on from there.
And so I believe that there are practices we can adapt which actually let us have greater impact without having to chew people up. Right? And that's kind of what motivates me. So I'd like to present a layered approach to progressive delivery. How do we build our way up to these faster, safer, smarter releases?
Spend a little bit of time just on some background. What is progressive delivery? Can we define it in a kind of concise way? We'll look at a couple of role models. Who's already doing this and been doing it for quite a while, and what do they use it for? Then we'll lay the foundation. In order to use these practices, you first need to figure out how you're going to decouple deploy from release.
How can I push software all the way to production but have it be off effectively? And then the upper layers. These are the data-informed practices, the automated data-informed practices that are really what delivers the value to your team and to your business.
02Progressive delivery: what it is
So progressive delivery, what is it really? Let's talk just a second on the roots because I think the roots of progressive delivery are important to understand. Is this just about gradually rolling out software, or is there something more going on here? So you may know Sam Guckenheimer. He retired recently from Microsoft.
He was head of Azure DevOps. And he was having a conversation with James Governor, who goes by the handle @monkchips on Twitter. And this is what Sam said to James. He said, "Well, when we're rolling out services, what we do is progressive experimentation because what really matters is the blast radius.
How many people will be affected when we roll out that service, and what can we learn from them?" And I give this example because I think it's important to focus. Sam said progressive experimentation, and he was talking about learning early in the process. He wanted not just to limit the blast radius and go out slowly, but he actually wanted to learn as much as possible.
And what happened is that James had this aha, and he thought, "You know, I'm connecting a lot of dots, and I'm seeing a number of practices that are about sort of how do I change the way I roll all the way out to production," and he decided to coin this term progressive delivery. And he described it as you see here, "A new basket of skills and technologies concerned with modern software development, testing, and deployment." How do I roll in an intelligent, effective way where I have smaller lead times, happier teams, more impact on the business, right?
So, Carlos Sanchez, who's now a software cloud engineer at Adobe, wrote a blog post, and he described progressive delivery in a really clear, succinct way. I'm going to use this here, which is, "Progressive delivery is the next step after continuous delivery, where new versions are deployed to a subset of users," and that's really important, a subset of users, "and are evaluated in terms of correctness and performance." So I'm evaluating them as I go, and I'm not just evaluating like QA, does it work, does it fail?
That might be what we consider correctness. But performance, is it performant in terms of speed and system resources? Is it also performant in terms of business impact? Is it having the impact I expect before rolling them out to the totality of users? So I'm actually going to learn things before I expose everybody to it, and you roll it back if it doesn't match some key metrics.
Now, your culture may determine what your criteria are for rolling back instead of just saying, "Okay, we learned something. Let's try again." And we'll talk a little bit about how that gets done.
03Role models in the wild
So then let's just switch gears to talk about a couple of role models. Who's doing this already? And the term progressive delivery was coined in 2018, but the practice has been going on for well over a decade. And let's look at these two. First of all, Walmart EXPO. So Walmart built their own platform for gradually rolling things out and learning as it's happening because at the time when they built their solution, there wasn't really a market of solutions that could do this for you.
And so they had to build it from scratch themselves, and you can learn more about how they kind of did the plumbing for that. Right? Big company, bunch of engineers. You can fill in the blanks. Right? But I think what's really interesting is if you look at these two reasons they use it, they call it test to learn and test to launch.
Now, test to learn sounds a lot like A/B testing or experimentation, and you'd be right if that's what your assumption was. Test to launch is actually more about what I'm talking about here in terms of layering practices on top of progressive delivery, and that is that they wanted to be able to effectively run A/B tests on the fly during partial deliveries so that they could determine whether they're impacting the business before they roll something all the way out. Right? The second example is LinkedIn Experimentation. And the stuff on the left shows you kind of roughly how this flows. The users make requests from the central service.
The central service has this library it can call that tells it what it should expose to the user, and the user gets what they're supposed to get. And you could decide, hey, students are going to get 50/50, and job seekers are going to get it 20/80, and everybody else is going to get none of it.
What's interesting about this example, though, if you look on the right side of the screen, what does it say? It says, "LiX failed on site speed." I can be fairly sure this was not an experiment on what's going to be faster. This was probably an experiment on what will make people sign up for job offers, et cetera.
But this is an example of what we call a guardrail, which is that LinkedIn is always watching for certain things they've determined are always important, like errors and response time. And in this case, the alert is firing that, hey, this thing you're rolling out is slowing things down by 50%, and it's not good.
So we'll get back to guardrails in just a bit. Okay. So let's move on to the foundation.
04The foundation: decouple deploy from release
In order to do any of these practices, we need a way of decoupling deploy from release. Now, turns out there are a number of ways to decouple deploy from release, and how you roll does matter. If you've heard of blue-green deployment or canary releases or feature flags, these things will, whether you've heard of them or not, I'm going to show you how each of these kind of exposes different benefits.
The benefits are down the left side there, avoiding downtime, limiting the blast radius. Again, that's not hurting people for very long and not hurting very many people. There's kind of two aspects of that. And then limiting work in progress and achieving flow. This is how we get shorter lead times.
If we have smaller bits of work that are going through and a smaller number of them, they can go all the way through to safety without everything getting kind of stuck in a log jam of testing and integration troubles. And then finally, learning during the process. This was what Sam Guckenheimer mentioned when he had that conversation.
How do I learn during these partial deployments so I can get value out of them, not just limiting my blast radius? And so blue-green deployment, this is the notion of having a copy of your production infrastructure. And because you have a copy of it, and this is obviously easier in the cloud, if you have a copy of it, you can actually build the next release and take as much time as you want to get it all ready so there's no downtime.
There's no, hey, we're down for a maintenance interval while we upgrade the solution. You can literally do all the work you need to do to get everything ready on the green deployment, and then just switch network traffic over from blue to green. Instantly, people are in, there's no downtime.
And because you have that ability to switch network traffic between the blue and the green, you can switch them back, which is why limit the blast radius gets half credit here, which is that, hey, I can decouple deploy from release, and if something goes wrong, I can basically flip it back really fast.
Now, I might have exposed all my users to the problem, but not for very long, and that's why you get 50%. So 50% the time, check the box, the scope not. And then there's nothing really helping here on rolling out smaller pieces or learning inherently in the process. And then along comes canary releases, where you use containers to sort of spawn a few extra containers and send maybe 2% or 3% of your network traffic through those containers with the new release on it.
And again, you can build those containers on your own leisure, so there's no downtime needed there. And limiting the blast radius, it gets 100% credit because you both are only exposing a small percentage of your users to it, and if something goes wrong, you can route the traffic right back through the other containers very quickly.
Doesn't really necessarily help you with limiting work in progress or achieving flow because you're still pushing a release out on those canaries, unless you've achieved sort of the microservices holy grail, where a feature is on its own server, but most of us don't live in that world.
So the last one is learn during the process. You'll notice it gets one-quarter credit, and that's because when you roll out this way, you could be sort of hypervigilant of the very small number of servers you're running this on and try to notice how it's working. So you could try to learn quickly because it's all focused in one little place.
But you're still effectively looking for a needle in the haystack of everything that's in that release. Things got a little different when feature flags became widely adopted. And it's not a coincidence that when feature flags became widely adopted, it's the same time period that progressive delivery was coined.
And that's because of that third full circle there, the limiting work in progress and achieving flow. Feature flags let you release not just a deployment, but literally a block of code in a deployment. So any subset of your app can be wrapped in a feature flag and controlled remotely.
And that meant that you could actually push much more stuff out and have much more nimble control of it. And if one feature has an issue, you can turn that feature off without having to revert the whole release. The last circle is empty because learning during the process, there's nothing built into feature flags themselves that lets you learn what's going on, and you might be releasing any number of things at roughly the same time.
And so time-based correlation is not really going to help you. Again, you're kind of searching for the needle in the haystack. Then along comes this approach, which is having feature flags and data integrated together. And that's what you saw with the Walmart example, with the LinkedIn example.
They have telemetry associated with the deployment such that they know which users are getting which experience, and they know what the experience is for each of those groups of users. Therefore, they can calculate this. And because they have a lot of data science built into automation, they can figure this out on the fly. They don't study this after the fact.
They know about it as it's happening. And that's the practice that I'm talking about today. That will be our upper layers of the pyramid. So just a quick recap on feature flags. You place a function call in your code. It goes out. You can create one version of your code that could be in different environments and rolling through the pipeline, and your ability to expose that code can be done remotely. It's like having a dimmer switch for changes in the cloud.
And you might roll all the way to production and have, as this first row shows here, 0% of your users exposed. And the reason you would do that is so that you can test in production and because maybe the feature isn't even finished yet, but you wanted to push a component of it all the way out to production so it could be validated in production, and move on to the next component before you ever expose it to users.
And then once you're ready, you can dial it up gradually. And for those that like a little bit of code, this slide and the next one are the only code I'll be showing in today's presentation. So don't panic if code's not your thing. A function call that says, "Hey, I'm asking the flagging subsystem to tell me, should this particular user at this particular moment in time be shown the related posts?" And this is evaluated a user at a time, a session at a time.
This is not a line in a config file for the whole server or for the whole user population. This could be right down to Dave. If the treatment's on, show him the related post. If it's not, skip it. And then here's a multi-example, which is I may be trying out two new search algorithms, and I want to compare them to my legacy search algorithm for their ability to recommend or respond, either time or value of the responses.
And I can divide my user population up, and I might be 80% going through legacy and 10% and 10% going through the V1 and V2, and then run it up to the point where it's one-third, one-third, one-third, and then compare the business outcomes. And the system load. So again, that's the end of the code.
So once you have feature flags in place to decouple deploy from release, you get things like the ability to do incremental feature development for flow. So if you can build things a little bit at a time and roll them out as smaller packages that are easier to evaluate, you can actually get more done and achieve flow as opposed to a log jam.
And then testing in production, as I mentioned, and that doesn't mean using your user to test. It literally can mean testing the code in the production environment with the production data, but actually not exposing the feature to users until you're happy with it. And then finally, and this is very popular with developers, is a kill switch, kind of a panic button, a big red button.
And the distinction here is that because you're using feature flags, if something goes wrong, you don't need a hotfix, you don't need a rollback, you don't need a war room of people deciding how are we going to fix this. You can literally just turn it off and then have your conversations about what went wrong and what should be done next.
So that's the foundation, and now comes those data-informed practices that I promised that are these upper layers, and these are the practices that are a little less known out in the wild. Obviously, the companies that have been doing this for a while know about this. And we can talk more about what some of those other role models are, but I hesitate to name a lot of role models because they're generally very large corporations with very large tooling budgets who've invested years and millions of dollars to actually accomplish this practice.
And my premise is that that's no longer necessary. There are ways to accomplish this with off-the-shelf stuff now, with services. But I want to talk about how would we, in a sane way, introduce these practices into our environments so we can start working a different way. So first of all, why would we automate data-informed practices?
What's the main motivation? And I believe it's because there's a different way to ship that becomes possible, and I think most of you are here because you're trying to figure out different ways to ship or deliver value, deliver software. So it starts with deploying with no exposure to the users.
The code is deployed, but no one's seeing the code or being influenced by the outcome. And then we go through a step we call error mitigation, which is here I'm testing in production maybe with 0% of my users, but then I might roll it out to 1% or 2% or 5% of my users. And here I'm just trying to find bugs or crashes,
05The upper layers: automated data-informed practices
things that are going wrong that I missed in my earlier testing. What can I catch before the big Twitter rant from people? How can I actually check for errors? Then I want to move over, and this is where it gets a little more exciting, which is I want to be able to measure the impact of the release.
And in this step, we're kind of halfway through, and frankly, when it says your maximum power ramp, maximum power is achieved by pushing as many people through both sides of this experience as possible, the people who are getting a new thing and not, so that you have the greatest statistical power to determine whether it's actually delivering the impact you want.
And then one last stop before we go all the way out, which is scale mitigation. So if you've ever had a release where everything seemed fine, but as along came your peak period with the big time when people are opening their messages or doing their thing, and then things go wrong. So what if you could ramp to, say, 70% or 80% or 90%, whatever seems more reasonable in your environment, and ride through a peak period. And only then decide, "Hey, yeah, so during our peak period, we saw the usual patterns.
We didn't see any weird spike. We didn't see any race conditions or anything. It's all good." So then we release. So historically, we think of these white ones, deploy and release. And now what we're proposing is that there are these data-informed practices of error mitigation and measure and scale mitigation that we can add and without actually asking your people to do more work, and you'll see why that's true.
So first of all, can we just change things and monitor what happens? So, I kind of came through IT at a point where we had our heroes, and we had our amazing people that could dig through log files and do sort of amazing feats of diagnosis when things were going wrong, and you always wanted to have that person around if something did go wrong.
But it's expensive both to find these people, and it's expensive on them as a human being to always be in that mode. So you can't always see what's going on and figure it out quickly, especially when you're doing more changes more often. So this is the problem we need to solve, which how do I separate signal from noise? There's other things going on in the world while we're doing our releases. And there may be other product changes happening at the same time.
There might be marketing campaigns. Your company may be paying money to deliberately change the behavior of your users. So you don't want to take credit for good or for bad for a change in the user behavior if somebody's actually moving the needle through a marketing campaign. And then an example, global pandemics, they change user behavior.
So how do I separate that from the fact that I introduced a new feature? Is it because everyone's working from home, or is it because... my feature is really meeting your needs. What is it? And then finally, something as simple as nice weather, depending on what you sell and who you sell it to, can have an influence on user behavior.
And so if the weather changes, do you want that to throw off your ability to determine how well your release works? Do you want to be like, "Meh, I don't know. It was a really sunny weekend." Be the guy at the first part of my deck, right? How do we make it so that those things don't throw us off?
We need a way of separating a signal from the noise. And if you ask yourself, what do you already have in your life that lets you separate signal from noise? We'll come back to that in just a second. Think about when we used to fly on planes a lot, what you wanted to have to avoid the noise.
So imagine this. You're rolling out a feature, and as you go to 100%, the bar on the right is kind of traditional IT, which is we released it, and then, oh my god, response times went up and latency went up and throughput went down. And that's bad. But look it, there's another bar to the left of that first one, and it's actually when the feature was rolled out to 5% of the users.
So if you adopt this sort of progressive delivery and you define that as just decoupling deployment from release, and you say, "Yeah, we're going to roll it to 5%," and you're looking at your usual graphs, you might see something like this. And you'd say, "Well, looks like the ambient traffic on the graph.
I don't see an anomaly, so we're good," right? And that's the problem, that if you roll out to 5%, the problem would have to be 20 times bigger than normal for you to even see it. So inherently, it kind of breaks your typical way of looking at how are we doing today. But the good news is there's a way around this.
And again, I gave the tip, was think about noise-canceling headphones. So how do we cancel out external influence with a stats engine? It works much like the noise-canceling headphones would, but for your metrics. Noise-canceling headphones have microphones on the outside of the headphones that are listening to the ambient noise.
They then inject into your ears the inverse of that ambient noise, along with the music or the podcast. As a result, you hear the music and the podcast without the outside noise because it's been canceled out. Right? First time I tried that on a plane, I literally laughed when I turned it off again and heard the difference. I was like, "Oh my God." So here's how this works with your metrics. Take half your users and send them through the new thing, and half your users and send them through the status quo.
What a scientist would call the control for the status quo and the treatment for the new thing. And then compare the distributions of their behavior and the system behavior for those two populations. If they overlap exactly, then the thing you did really didn't have any influence. It may be a busier day and there may be more latency than normal, but if they both show the same increase in latency, it's not the change that did it.
Conversely, if these distributions are apart from each other, then you know you did something, and you can actually see where the metrics are different. So that unlocks the upper layers of the pyramid. So how do I automate guardrails? Guardrails are this notion of that site speed thing. How do I find a way to alert on exception and see these performance hits early in my rollout without toil, without actually people having to be hypervigilant?
06Guardrails, release impact, and test-to-learn
So limiting the blast radius without manual heroics. So if you can actually have a stats engine watching the things you generally care about, errors, response time, unsubscribes, whatever, and have it alert you, you could push 10 or 100 times a day, and you're not having people have to go crazy paying attention. The system's doing it for you.
Now let's move on to measuring release impact. So here, we want to actually be in a situation where we know whether the thing we did makes a difference. Now, if you're iterating often, if you're achieving continuous delivery and you're using decoupling deploy from release to achieve flow and you're ship, ship, ship, ship, but you don't even know whether it's having an effect, it's very demoralizing.
It's been called a feature factory. And so this is not a world we want to create. We don't want to be in a world where we're just making people move faster, and we don't even know whether we're having an impact. When you have direct evidence of your efforts, you're more likely to get pride of ownership. You're more likely to have people, even if something goes wrong, they're going to be like, "Hey, let's fix that because I know it's not as good as it could be." So direct evidence of your efforts leads to greater pride. It's better for psychological safety, which is that we actually know whether we're hurting business or not. It's not rumors.
It's not someone's opinion. And then finally, test to learn. What you might traditionally think of as A/B testing. And here, what we want to accomplish is we want to be able to take bigger risks but in a safe way. I'll give you an example of that in just a second. And we want to be able to learn faster with less investment.
So one of the common misconceptions here, by the way, would be, oh, A/B testing. Well, I don't want to build two of every feature. More than once, I've had a product manager say to me, "Well, I don't want to use two story points instead of one." This is not what this is about. So first of all, what you might be A/B testing is status quo that you currently have, and should we add this new change?
That's not two versions of the code that you built new. That's just one new thing and seeing what its impact is. Second of all, there are ways to actually try different new things without having to create multiple versions of your software, and one of them is called dynamic configuration. And the idea here is that if you're using feature flags to deploy, decouple from release, they can carry along a payload that's parameters that are specific to the population that's getting that flag. So if I've already determined that I'm dividing my users into three populations, I can say, you know what? For this population...
I'll give you a specific example that helps. So Speedway Motors is a car parts site online, and they have a recommendation engine. It's a third-party system that takes input parameters to determine how it comes back with recommendations. And they wanted to see how these would behave in the real world.
And so they set up dynamic config to send different sets of parameters to different cohorts of users and then observe what happens. They could then iterate on that experiment without a new release by just changing the values of those dynamic configuration parameters. Instantly, the user population is now onto a new experiment, which is that new set of parameters, and they're capturing data in sync with that.
And the other example I want to give is Painted Door. Now, this is kind of a usability hack, which is you can actually build the entrance to a feature without building all the complexity in the back end. And when people click on it, you can either throw an error or better, you can say, "Hey, thanks so much for your interest in this new feature.
Annual plan is of interest to you. We're working on that. We'll get back to you." And the last thing I want to say while we're here is taking bigger risks and safely is that, imagine you're a food delivery service and you know that if you ship the right food to people, they're more likely to buy more and like you and stay.
But you need to ask them more questions to make sure you know what they want. And so the entry, there's a company called Imperfect Foods in the United States. They had a sign-up flow, and they proposed to actually add more questions to the flow so that they could better cater to their customers. The concern would be that people would drop off and they wouldn't finish the sign-up. And so what they did was created this sort of a traditional flow, a slightly longer flow, and this significantly longer flow, and test them out side by side. And they found out that the longer one yielded consistently better results, and they were getting $7 or $9 US per order more from the people that went through that flow.
And so they fired it up. And so this was a great way to take a risk without having a big debate over whether we should do this or not, and what'll happen.
07Sustainable delivery and Split
So this is what sustainable software looks like. I believe that this new way of progressive delivery with automated data science in there actually creates a much more sustainable flow. And if you look here, traditionally, we think of the deploy and the release. And these yellow steps in the middle, those are only made possible by automating the ability to do the statistics as this rolls out. This is not about studying the data afterwards.
It's not about begging a favor from somebody to look at some data. This is something that happens automatically. That's the way we build pipelines. We want to automate things so that we can move it as many times as we want, and it does the same thing every time to make sure that we're operating consistently and effectively and efficiently.
So this is the only slide I'm really going to talk about Split Software in, which is who is Split Software and what's our deal? Well, in-house progressive delivery platforms paved the way for Split. In fact, our founders came from multiple shops where they had built these large complex systems, and when they went to other places where they didn't have them, they missed it, and they wished that they could just buy this, which is what they set out to build. So many companies have adopted parts of this progressive delivery, like feature flags, and some have even added some sort of sensors and try to do correlation, but only very few have figured out how to do the stats engine and kind of the system of record and the alerting. And those that did were spending tens of millions of dollars a year on making that happen, and they do it again, and they re-up because they love it. It's a competitive advantage. We don't all have the advantage of having $50 million or $30 million or $25 million a year to allocate to this, or spending two, three, four, five years to build it, and that is what the engineers that built Split set out to do, was make this just something you can subscribe to as SaaS.
So let's move on to Q&A with Slack. I'm looking forward to your questions. Also want to invite you to come to our booth. Usually, when we have an in-person show, we have great conversations with people about what they're working on and how it's gone, and we're going to kind of recreate that here. We're also holding a raffle to give away an Oculus Quest 2, so virtual reality headset. I hope you found this talk interesting and informative. I have a lot of sort of vendor-neutral content we can talk about. This is not all about Split.
Split is just one company that's kind of leading the way in making this productized.