Reducing Risk by Testing Every Change Where It Matters, in Production

Log in to watch

Virtual US 2022

Download slides

Reducing Risk by Testing Every Change Where It Matters, in Production

Dave Karow

Sr. Progressive Delivery Advocate · Split

Ariel Pérez

VP of Engineering - Measurement & Experimentation · Split

Testing in staging alone doesn’t give teams the confidence they need to move fast and reduce cycle times. No matter how much you invest there, it will never have the same data, activity, and signals that production has.

Instead, shift your focus towards making a wide range of testing in production strategies safe, efficient, and well-understood by your teams. Testing directly in production gives them a clearer understanding of how a change behaves, whether in infrastructure, the data layer, the front end, or anywhere in between. Pairing this with the ability to limit who sees a change in production means a superior developer experience and a better user experience.

What You'll Learn:

Specific strategies we’ve used to test in production safely with feature flags

Tap compare testing to validate functional parity of new services before cutting away from old Mirroring/shadowing to exercise new services at full scale with low risk

Canaries to gradually expose real users and learn before full rollout

Config testing to explore optimal combinations of settings

Alpha and beta programs to safely test in production

How feature flags + data change the risk/reward balance in your favor, using real-time metrics, monitors, and alerts to quickly identify and rollback features having a negative impact

This session is presented by Split.

Chapters

Full transcript

The complete talk, organized by section.

Dave Karow

Good day, everyone. My name is Dave Karow. I've spoken at DevOps Enterprise Summit a few times, and this year I thought I would do something a little different, which is to invite one of my colleagues, Ariel Pérez, who's a VP of Engineering for Measurement and Learning here. Ariel has some very relevant experience that I think you'll find useful. So I'm just going to write over to Ariel.

Ariel Pérez

Thank you very much for that introduction, Dave, and thanks for having me here at the DOES Summit this year.

Just a little bit of my background: I've spent the last 20 years or so in the technology industry, in product and engineering roles at startups, at large enterprises, and most recently at JPMC, where I actually helped build out one of their two big projects recently. One was their internal feature flagging platform at scale, and secondly, helping launch the Chase UK bank across the pond. And now I'm over here at Split.

What we're going to talk about today, just a quick taste of where we're going today: we're going to talk quite a bit about why testing in production is important and how that reduces risk, and dive into some specific strategies that we've used when it comes to testing in production, but more importantly, actually show you deep dives into actual examples of how we tested in production through my two migrations. And then lastly, not everyone knows exactly where to start, so we're going to give you some tidbits and some tips on how to start along this feature flagging and testing in production journey.

So why test in production? Here's the hint and the TL;DR: pre-production testing is not enough. And one of the biggest reasons is because staging is not production, and it's not production in many ways.

Firstly, the breadth and depth of staging is just not set up the same way. You have smaller clusters. You often have less services running, whether it's for a cost reason, a management reason, or even integrating with outside partners. And lastly, your configurations are different.

When it comes to user behaviors, users in staging behave differently than production users. You have fewer users, they go through fewer scenarios, and you'll have much less complex interactions in staging. So you can't always replicate the user behaviors and the system behaviors that you have in production.

And lastly, your monitoring differs. You're going to get completely different signals in staging than you will in prod. Your thresholds are going to be different. It might be more sensitive than they are in production, and you'll have certain signals, certain alerts, that will never be emitted in staging because they don't get triggered just like they would in production.

And one quote that makes it easy, I think, potentially to take this all away, and this comes from Cindy Sridharan, who had written an awesome set of articles on how to test in production the safe way. And she said, you know, none of this is meant to suggest that maintaining a staging environment is completely useless, only that, as often as not, it's relied upon to a much greater degree than necessarily needed, to the point where many organizations it remains the only form of testing that happens before full production rollout.

Dave Karow

People will sometimes say, testing in production, that sounds dangerous. But it's really a 180, which is if you don't find a way to safely test in production, when you go from staging to production, that's dangerous.

Ariel Pérez

Exactly. And that's the thing that we're going to talk about today. I want to start with first, this misconception of what testing in production is. Many of us over the last 10 years might have seen this meme from the most interesting man in the world, which says, "I don't always test my code, but when I do, I do it in production." And that's not what we're meant to say. Our goal is take it from this view of the meme to a new view of the meme. It's: no, I always, always test my code, and then I test it again in production.

But here's the slight nuance: you can actually test your code quite extensively in a lower environment. In production, the first thing you're testing for the first time is your system. You're not testing your code, you're testing your system. And that's the big key piece of testing in production. You're never going to be able to replicate that system in a lower environment.

So let's talk about the three distinct stages of production. Production isn't just this one endpoint that you reach, and now you're in production. You actually get and progress through production in stages. First, there's the deployment, and there are many things that you can do to test during the deployment. After the deployment, you've got your release, and this is when you're actually putting your users in front of your code and putting that code and enabling it for users. And there's post-release. That's the life of your software once it's in production and you're running it.

So when it comes to deployment, there are several strategies we can apply: integration tests, tap compare testing, load tests, shadowing, configuration tests, and dark launches. I'm not going to go through all of these. On the release side, several ones, some that stand out to me, are canary and monitoring. And then once you're actually running in production and you're a live system, there's so many different kinds of testing. You have profiling. For many of us, we're a lot more comfortable with the idea of monitoring and tracing, but you also have other things like A/B/n testing and dynamic exploration that also come in very, very useful.

But now let's dive into some specific strategies that we've used, and I've used in my experience, when it comes to testing in production, and I'm going to highlight a few of them.

The first one is progressive delivery. Progressive delivery is the ability to roll out to an increasingly larger impact radius, and you have really fine-grained control of how to manage that risk and things going wrong while also enabling you to test. This goes generally through this progression that you see on here: you start with testing with your dev team, the people that built that know best what they're looking for, and they're going to tinker around and play with it in production. After you get past that phase, it's probably a good time to start bringing in your QA team, if you have one, to say, now actually let's start testing this in production from a perspective of trying to cover all the cases, trying to cover edge cases, and get a comprehensive view of what you've just delivered.

From there you expand to a larger set of internal users. The important part here is to get a broader set of views, but also people who weren't involved in the development. They completely come at it from a fresh perspective ideally. All this testing, as you see all these people I've just talked about, none of them are your true end users. So you're still safely, very safely testing in production, but that's not where it stops. Eventually you want to put it in front of customers. You proceed from there to beta testers. These are customers that are very happy to test out things that might not be ready yet, but they want to get a first look. You have some of these people who are very progressive and really want to test it out, and those are your best users to get early feedback in production. You move from there then to slightly larger and increasingly larger percentages of users just to get a larger set of feedback. So 5% of users, 50% of users, eventually roll it out to everyone. All along the way, you're managing that risk of something going wrong while also optimizing for getting the best feedback and the right feedback.

The next one I'll talk about is what's called a dark launch. Dark launch was a technique pioneered generally by Facebook. I think people talk about Facebook doing this in the early days of social media platforms. This is basically deploying new features without enabling them in production. Just a simple way of thinking about this is you put your code in prod, but no customers are seeing it. That's a dark launch, and there are many, many things that you can do even at that stage of the deployment, and we're going to talk about that in a bit.

Moving on, slightly more advanced strategies: something called tap compare testing. Tap compare is a testing technique that allows you to test the behavior and performance of the new service by comparing the results against the old service. There are many different ways to implement this, and that new service doesn't have to be exposed to new customers. What you're primarily looking at is: let me send the behavior to the old. Let me send requests of the old and new system, and then let me compare the response. At the minimum, you're looking for responses to be identical and/or to only be different where you expect them to be different. That's what tap compare testing helps you do.

Moving on to something else that's often paired with tap compare testing is called traffic mirroring. Traffic mirroring is a strategy where you have your new service and your old service both in production, and you replicate all the traffic that goes to your old service. You send it to your new service, but you fire and forget. The key piece there is the fire-and-forget aspect. You do not want to impact the existing system, the existing running system, but you do this to actually gather telemetry from your new proposed version to understand: how does it act under load? How does it act under stress? Is there anything else in the system that potentially goes wrong when you send this traffic to the new version of the service? But you do this all dark. This service is not impacting the actual flow, and that's the key piece. We're going to dive into that some more later.

The last two ones we're going to talk about: one of them is canary releases. This comes from the concept way back in the day, when we were mining, a canary in a coal mine. You roll it out to a small subset of your users and you try to figure out their signal, an early warning signal, that something might be wrong. That's what a canary release is, and that's a big part of what I talked about for the progressive delivery early on, that flow. You release it out to a canary population.

We're also going to talk about configuration testing. This is something that becomes very important, whereas as I said before, your staging system is never, ever set up configured like prod. It just can't be. Production is the first time you actually test your configuration. It's important to have the ability to test your configuration and different varieties and variations of that configuration to understand how your system behaves with different parameters, with different attributes. That's the key piece: testing different combinations of configuration in production to find the right one. Basically, you can think about it as tuning.

Dave Karow

Yeah, quick point on that before going on, Ariel, which is an important distinction. So this isn't I'm loading the configuration, I'm booting my servers, I'm running for an hour, and then I'm stopping and I'm changing a variation and I'm restarting my servers and I'm running for another hour. This is so decoupled that literally you've got the same code in there. You're just changing the configuration on the fly and there's no need, for people that haven't been down this road yet, this is a live change that takes effect at that moment without having to do deployment.

Ariel Pérez

Exactly. That's the key piece: do it without having to do a deployment. Obviously, it's not lost. Not every system can behave that way. Not every system can react to that. But again, it's up to you how far you take this, and the more flexibility you give yourself to very quickly update and validate these configurations, the more it allows you to manage that risk in production.

Then I've got two more examples, and this is where we talked about really leveraging your internal users and external users. Alpha testing: that's a really key way of testing in production and getting internal users who aren't the people who are involved in actually building the system to start kicking the tires. It's amazing how much feedback you can get from other folks who aren't involved, who are looking at this in a new way. It's closer to your production users, but not exactly so close that you might be worried that if the system has a problem, you're going to actually have a reputational loss or a brand loss or a customer loss.

Lastly, that is the beta program. You do find some users that are more than happy to be a beta tester, and often we see implementations of this is just setting up a form saying, I want to opt in to beta testing, and in your system it's adding those people to a beta group so that every time you roll out, you go through the beta group and gather feedback from them directly. A key thing to notice here and realize here: it's very important to actually have a very close relationship with those beta users. There's other pieces that get involved with this, like there's a much more qualitative, much more hand-holding aspect to a beta testing program. These people need to get something in response and reward for their time, and they expect to get a lot of responsiveness from the internal team for it.

But now I want to dive in a little more into specific things that we can do and examples of how we did it. The key part here is how do feature flags and data play a role when it comes to testing in production, because primarily throughout all this talk, we're going to talk about how we use feature flags in particular to enable this kind of testing.

So the first deep dive we're going to go to is our Kinesis to Kafka migration, and our primary goal here was to reduce the overall cost for our data pipelines. We wanted to migrate, however, while meeting the following constraints: we wanted to lose no events, we wanted to see no increase in latency, and we wanted to see no decrease in throughput. Of course, the key thing is we wanted to achieve this with zero downtime, which can be very daunting and challenging.

So we decided to execute this migration primarily through the use of traffic mirroring, which I mentioned earlier. This particular strategy allowed us to compare costs between the new and old implementations while also validating that we weren't losing any events. And more importantly, through the use of feature observability, we could actually confirm that the latency and throughput had indeed not degraded, and again, all while maintaining zero downtime through that migration.

So let's talk a little more about traffic mirroring with feature flags. We'll start with the current flow. It's pretty straightforward. We send an event from service A to Kinesis, which is read by service B and then written to the data store after processing. We need to update both service A and B, however, to understand what treatment a message is in by using a correlation ID as the assignment unit. This piece is core to implement the mirroring. We need to know for every event that passes through: is it in treatment A or is it in treatment B, or in this case, is it in the Kinesis treatment or not?

Now is when the fun starts. We'll have a new treatment called Kinesis with mirroring. It results in the same flow as the Kinesis treatment: post to Kinesis, read by service B, and written to the data store after processing. But now we have to update both services to post and read from Kinesis or the Kinesis with mirroring treatment. To this point, we're still not mirroring. This is where we do the mirroring now. We need to post to Kafka for the Kinesis with mirroring treatment, but want to make sure that we don't impact the current functionality. So we wrap this in try/catch.

Service B, which now reads the same event from both Kinesis and Kafka, needs to figure out what to do with the custom messages. For the Kinesis with mirroring treatment, it just discards the event. Now for completeness, we also want to discard any Kafka events that have the Kinesis treatment.

Now once we've done this, we've got the mirroring happening, and we gradually raise this to increasingly larger and larger percentages of all the traffic to the point where we can get a clear picture of the Kafka costs so we can compare them against the Kinesis baseline. Once we're satisfied that the flow is performing as expected after mirroring all the traffic, we'll add the Kafka treatment, and that Kafka treatment will only post events to Kafka. When service B reads those, it'll actually write them to the data store instead of discarding them. Now, of course, to finish the loop and close that loop, we also have to get service B to discard any events that it gets from the Kinesis treatment. This is how you actually perform the entire migration from Kinesis to Kafka.

Now, there's the other side of it. We still haven't talked about the latency and throughput. So to measure the impact on latency and throughput, we feed telemetry from the services that are writing to and reading from Kafka and Kinesis for each event. The key thing that allows us to do that is by adding a timestamp to every event. That way, we know when it was generated. On the service B side, when we process the event, we calculate the processing time by subtracting the processing time from the current time, and then we fire a tracking event to Split.

This enables us to create metrics on the Split side that track latency, which is the event latency per millisecond, and the throughput, events per second, for each of the Kinesis and Kafka implementation. Once we have these metrics in Split, we add monitors to alert if either of these metrics changes beyond a certain threshold. And the most important part is that we do so while determining causality. We know for sure it was the migration that caused the change and not some random variation.

Now moving on to another migration. In this case, we moved from S3 to Mongo. Our primary goal here was to support more complex querying, but of course we wanted to do that with other constraints. We don't want to lose any data. We want to reduce latency, and at the minimum for write throughput, we wanted to increase that. So if we had a lot of updates, we wanted to make sure that we could write more quickly and more frequently. Again, we want to achieve this with zero downtime.

In this case, the most important strategy we wanted to implement was actually tap compare testing via traffic mirroring. So now we're actually taking two strategies and putting them together. This strategy in particular allowed us to validate whether the latency and throughput had improved while also ensuring that we have lost no events. Again, this is through the use of feature observability.

So starting with our current flow, we write to S3 from our service. Just like before, we create a treatment for mirroring: S3 with mirroring. If we recall from before, we want to ensure that we don't impact the current flow while mirroring, so we wrap the write in a try/catch. At this point, we've only started mirroring. So now the key piece is to also actually write to Mongo, and this so far is just mirroring.

Tap compare now comes in. We need to actually get both responses back from S3 and from Mongo. So after we write, we read both responses and we compare them in a custom compare function. In this case, the compare function is looking for material differences in the responses. It's going to ignore things like IDs or timestamps, especially if they're not relevant to the business logic. If we detect the difference between them, we log that difference to our logging stack and we can create alert and see if we ever see this alert logged. Hint: you don't want to see this alert go off.

We gradually raise this to increasingly higher percentages of all traffic until we're satisfied we've covered enough use cases. At this point, we turn it up to 100% and make sure that no traffic is having any comparison differences. Once we feel comfortable that there are no differences, then we add the Mongo treatment, and then that one will just read to Mongo and never write to S3.

So again, we want to specifically measure the impact on latency and throughput. Just like the Kinesis to Kafka migration, we feed telemetry from the services for each request. In each of these, we capture the response time and the type. We want to know whether there's a read versus a write for the operation, and we fire tracking events for those to Split. From each of these, we create a few metrics in Split: DB operations per minute, read latency, and write latency. Split can then tell us whether those movements in those metrics can be attributed to the migration itself. It can tell us if the DB operation per minute went up as part of the migration, if it's actually attributed to Mongo. It can tell us if the read latency went up or down and the write latency went up or down, so we can now confirm that our migration not only improved our querying capabilities, but also actually had an impact on our throughput and our latency for both reads and writes.

Dave Karow

Let's tease something out of that, Ariel. You made an important point: we're trying to get to causation, not just correlation.

Ariel Pérez

Exactly.

Dave Karow

So if you were partway through this test and there was a spike in your demand on your business, it's going to be going through both pipelines. This is what's so important, because if you run a test on Tuesday and run a test on Thursday and compare how the metrics looked, you're comparing an apple and an orange. But you're running these in parallel, and then Split is doing the math on comparing the distribution of response times between each of the two paths for the same time period.

Ariel Pérez

Right, literally. And that is why you can make that claim that we're actually getting to causation instead of just correlation, because literally these are the same events passing through both pipelines at the same moment in time, at the same business conditions.

Dave Karow

And then we're looking at what's different.

Ariel Pérez

Exactly.

Dave Karow

I think that's important. The thing people always get: it's kind of like an A/B test, but it's not. It's an engineering-driven use case where you're literally subjecting two sets of flows to the same stuff being presented two different ways, and then you're comparing the outcome.

Ariel Pérez

Yeah, and I think the key thing there where you're trying to get at is, it doesn't feel like an A/B test, but underneath, the math, the statistics are exactly the same to determine causality, to say, no, I know for a fact it was the Mongo change. It's a statistically sound conclusion.

Awesome. So hopefully you got a lot out of those two examples of how to actually implement migrations and these particular testing in production strategies with feature flags. Not everyone is ready to get on this journey, but if you feel like you're not ready, you feel like you don't know any of this, and this sounds so far, realize that you're not alone and you're in good company.

When we start looking at feature flagging and experimentation and monitoring and alerting, we're all at different stages. It's important to understand and explore which best practice to use at which part in the stage. If you look through this graph, thankfully the industry has gone to a place today where everyone pretty much knows about feature flags, so that's great. But when you start moving up from there, how to use just some features behind a toggle or almost all your features behind a flag. Going even further from there, feature flags with data to then measure impact, and finally full-blown experimentation, monitoring impact, optimizing impact. Different companies are in different places, but there's a lot of hope right now that more people are getting more comfortable with these ideas and these concepts and how to use them.

So we're all on the feature flagging journey. The key thing is how to make informed decisions every step of the way. So many of us start with just basic flags on and off. Engineering teams are quite familiar with this concept of an on and off toggle. This is simply 100% on, 100% off. As you get more progressive and evolve through this journey, you start doing feature flag progressive rollouts. Here we often suggest taking that on/off toggle to the next step and rolling it to increasing percentages to eventually get to 50/50.

The 50/50 percentage is very important, because that's where you really get a statistically rigorous test. But by doing controlled rollouts and adding data events into Split, you can actually start quantifying the impact of every feature on engineering and product metrics without risk and without releasing it to 100% of your users. Eventually, you start growing out to the place of managing this at scale. You start wanting to use the feature flags to increase velocity. You start monitoring code quality and performance when you have new releases, and with rollout boards you can start seeing and manage multiple features at once.

Lastly, you get to the place, as you are serving in comfortability, that you're experimenting with every feature. It's all about ideation and validation. You can smoke test an idea to reduce cost. You can do painted-door tests. You can actually understand whether a user really cares about a feature. For our most advanced users, you really start doing extensive A/B testing. It's an experimentation process that compares two or more variations in a randomized test. So the more mature you are, the more your experimentation roadmap actually becomes your development and product roadmap. You start testing and validating every idea and confirming what works and what doesn't work. But again, as I said before, we're all on this journey. You might be at different places here and you figure out the ways to move along this journey.

Dave Karow

One thing that's interesting about that is that these are actually fairly fluid boundaries. Our customer success group came up with a very innovative way to look at this, which is that it's not really a left or right progression. We've done that on this slide left to right, but in reality there's kind of a maturity continuum that has local axes. I guess it's a radar chart, where you are further or not along these different avenues. Have you achieved a sort of decoupling thing? Are you doing that a lot? We can tease that out more with people if they check in with Split people at the event. But I think it's important to know that you might actually be doing measure and learn, but there may be some aspect of governance or something that you really haven't gotten to yet. Maybe you started with a small team and now you're like, this worked great, now I want to go to bigger teams. Well, now you have more issues with governance, and how am I going to manage that? These different parts of the maturity kind of ebb and flow, and the nirvana is you filled out the whole circle on all the axes. But I think it's very important people don't think that it's strictly a left or right thing.

Ariel Pérez

Awesome. Yeah, that's right. You're moving around this continuum, and you might be moving further right on some things and you're still further left on others. This is just a general progression of the things that you start seeing as you get more mature.

And then I think, as Dave just alluded to, you're not on this journey by yourself. The key piece to remember, and this is the thing we always lose when we talk about technology often: technology is the easy part, and the hardest part is people. How do we bring people along this journey? What will differentiate some companies from others is how many resources we provide to those people to actually help them along that experimentation journey, that maturity curve of monitoring and learning and feature flags and data.

At Split in particular, we have our customer success managers that are with you every step of the way, helping you understand whether your particular implementation is fit for your particular organization and your goals, and help bring you along that curve. Our integration advisors ensure that you're integrating correctly, you have the right system set up to really have the right data feeding in and out. The experimentation advisors are amazing resources to help you define and design experiments, ensuring that you're maintaining statistical rigor and actually measuring the thing you want to measure.

But last thing: not everyone has the ability or the time to really sit with other folks, or the desire to stay with other people. For that, we have our own LMS called Split Arcade, where you can learn at your own pace, and anyone can join in and use that as an amazing resource to understand where in their journey are they, and what are the things that they need to dive in deeper to get certified, to get actually much more knowledgeable in each of the best practices. So there's so many ways. It's tailored to you, it's fit to you, but the key piece is you're not alone in this journey.

Dave Karow

Yeah, I would say one really cool thing about Arcade is there's definitely this whole aspect of it. A couple things: one is it's actually built for adult learners. It's a curriculum. It's not just a collection of tips. It's literally a curriculum people can go through. One of the things we hear over again is that the exercises let you safely play with the ideas, not in your product, not in production, not even in staging. The ability to sandbox and try things out in a very low-risk way actually increases people passing through the curriculum because some people are a little afraid of breaking something. That's really cool.

And the other thing was that I was talking with someone the other day, and they said that as they went from one team to 14, you've got some knowledge transfer typically you'd have to do, internal sharing practices. Literally they just shared the login for the Arcade on Slack and watched people pop up and sort of go onboard themselves. We're busy. Everybody's busy, right? I think that the work they did there was pretty cool.

Ariel, I really appreciate your taking us into the deep examples. Probably people get a sense of these are malleable tools that can be used. There are patterns that repeat once people kind of get a handle on these essential patterns. There's a lot of power.

Ariel Pérez

It is indeed. So thank you very much for your time. I hope you all got a lot out of that.

Dave Karow

Hopefully you get to talk to the Split people at the booth later. Have a great day.