Advanced Feature Flagging: It's All About the Data

Log in to watch

Europe 2022

Advanced Feature Flagging: It's All About the Data

Sr. Progressive Delivery Advocate · Split

Progressive Delivery is the practice of decoupling deploy from release, allowing changes to be safely pushed all the way to production and verified there before releasing to users. Selectively dialing up and down the exposure of code in production without a new deploy, rollback, or hotfix is the foundation of Progressive Delivery, but the higher-level benefits of safety and fast feedback come from layering practices on top of that.

Whether you are new to Progressive Delivery or are already practicing some aspect of it, you'll learn/refresh on the basics and then come away with a powerful model for layering higher-value benefits on top of that foundation.

Chapters

Full transcript

The complete talk, organized by section.

Dave Karow

Hey everybody, welcome to Advanced Feature Flagging: It's All About the Data.

I'm Dave Karow, and I'm looking forward to having a conversation with you in about 18 minutes, give or take a few. I'm going to go through this slide presentation pretty quickly. My goal is to get us all on the same page, and then we'll have a really good conversation afterwards.

We'll do three sections. We'll review feature flags to make sure we all understand what they are and how they're used. Then the meat of why this session is important is about the data: how do we make sense of the data? We'll sharpen our tools about why how you measure matters, and why you shouldn't believe everything you see in data. Then we'll talk about how to put the flags and the data together to get something that is really an experimentation platform.

First, a recap on feature flags. Many of you already are using feature flags. They're a tool that empowers your organization to separate code deploy from feature release. For those who aren't familiar, a feature flag is a simple if/else statement. It's usually powered by configuration, a service, or an external tool, and it can be modified to target the code it wraps without doing a new deployment. You can target subsets of your population. Usually they're targeted at users, but they can modify anything based on session, service, request, or anything applicable to the code change you're doing.

Let's look at phases of a rollout. At the beginning, deploy doesn't cause any exposure to users. First we deploy: the feature flag is turned off and nobody is getting the feature. Then we gradually ramp up. At the beginning, we might ramp up only to employees; before that, we would have allowed access to the dev and test team. We might do some dogfooding with employees, and then start rolling out to customers, maybe with as little as 1% or 2% or a little more. Our goal there is error mitigation. We're trying to find bugs and problems we didn't already find in testing by doing testing in prod. We're even doing a little bit of that with actual customers because, no matter how much testing in prod we do, we're probably going to miss something that a real user will run into, and we want to run into that early before we get to a lot of users.

Then we're going to ramp up. Understanding the effect of your change requires lots of data. If you really want to prove whether a change you've made in code is making a difference in how users behave, the more data you can get, the better. That's called maximum power ramp: basically a 50/50 split. Then we might pause along the way to do scale mitigation. We might get to 60%, 70%, or 80%, and hold for a while through a peak traffic period to make sure we can handle scale, that we don't have bugs, and that we're making the user behavior change the way we hoped. Then we're released; then we're out.

Let's get into the meat of the talk, which is how you measure matters. I want to focus on this because it's a stumbling point for a lot of people who are used to using traditional charts and graphs. When you get into a gradual rollout world, a lot of this breaks down. Your metrics don't really work the same way when you're going out to 5% of your population.

Here's a systems dashboard from our own product. We were rolling out some changes, and suddenly we saw huge spikes. We spent a lot of time triaging: what have we changed, why is this happening, and what's going on? We'll come back to this in a minute, but suffice it to say that things weren't quite as they appeared.

Let's look at a more typical situation. Here, when a feature was taken to 100%, there was a huge spike up in something and a spike down in something else, probably latency and throughput. That's obvious to see. One of the first things we did at Split was overlay feature flag changes onto our graphs. When it's 100%, you can kind of figure out what happened. But think about that: we rolled out to 5% first, and you could not see the differences in the graphs. There was no meaningful difference versus the noise before them. That's why, when you're doing gradual rollouts, you can miss useful, important data if you're not looking through the right kind of lens.

Correlation is not causation. Ice cream sales and shark attacks both peak in the summer, but they're not related to each other. They don't cause each other. We'll come back to that. It turns out that the dashboard I showed you wasn't a feature change. It was a customer of ours undergoing a distributed denial-of-service attack. We wasted a ton of time trying to figure out what was going on and what was wrong with our system or code. It turned out nothing was. It would have been nice to have clearer information about what was going on.

Just looking at the dashboard is not going to be enough. Especially when other things are happening in the world, whether changes in your product, marketing campaigns, a global pandemic like what we're having now, or just nice weather, it can change what happens. We don't want to just visually look at graphs, or even just deal with arbitrary threshold alerts. We want a smarter way to look at it.

Science gives us a smarter way to look at it: the idea of a randomized controlled trial. If you randomize a population across a treatment and a control and then watch the differences between them, because you're randomly picking people, you'll have East Coast people and West Coast people, young people and old people, and anything else that could cause variation equally distributed. This lets you distribute the effects of outside factors between the two samples evenly, so that the things that are different are actually what you're measuring.

When you do this, the first thing you do is take the body of events that are happening and do attribution. You assign them to the different cohorts: people who had it, people who didn't have it, and that kind of thing. When you do, you may see patterns emerge. Our goal is to end up with a distribution of data. If the distributions are mostly overlapped, there's probably no difference. If the distributions are offset, the statistical analysis is telling us there's really a difference.

Stats is kind of weird; it's backwards. Statistical tests basically seek to disprove that nothing happened. The first thing they want to say is, likely you didn't accomplish anything, so let's prove otherwise. When you pass a test, you've said it is more significant than if it had been random. We can talk about that later if you really want to go deep in the stats geekery.

There are many ways for teams in modern apps to achieve comparative analysis. Most modern teams have a lot of telemetry coming in from their systems. They have a lot of data, and dashboarding tools have the ability to tag data from different releases as they're passing through different experiences. You can segment that way, and it's powerful and useful. Some teams have more than that: an in-house analysis warehouse or BI system. If you can pull data into that, you can do ad hoc queries and some pretty sophisticated stuff. The issue is that both of these are kind of ad hoc and not very statistically rigorous.

That's why more and more teams are moving toward an experimentation platform. If you look at what LinkedIn and Facebook and all those companies have been doing forever, they introduced experimentation platforms to make this process more reliable and meaningful.

That gets us to the third section of the talk. If you take feature flags that control who gets what, and you take the right kind of data and parse it in intelligent ways, you effectively end up with an experimentation platform, whether you call it that or not. I want to share some components and guidelines because you may want to assemble what you already have into this, or search out what you can bolt in that will do this.

The first piece is targeting, which powers the feature flags and records who got what. It's not just assigning people to a thing; it's also keeping track of who got what. Telemetry brings in data, whether data you're instrumenting explicitly or data you're already capturing and can gather. The statistical engine makes sense of the data. A management console makes it accessible to anybody who needs to get to it, instead of it being buried in a database somewhere.

Targeting needs to be fast. It cannot be in the way of the user experience. It has to happen super fast and not be a bottleneck. You have to be able to randomize because, to do a controlled study, you need to take whatever population you're choosing and randomly distribute incoming people into two or more cohorts. It also needs to be sticky. This is a paradox: you want to randomly put people in experiences, but you want the same people to end up in the same experience if they pass through the code more than once. You don't want me to log in today and get a blue screen, then tomorrow get a green screen and wonder what's happening. It needs to be randomized and sticky. It also needs to be reliable. It can never go down or cause a single point of failure. People who work in this do clever things to make sure it is never in the way and never slowing things down.

Some of you are probably asking how the system can be both random and sticky. This is done through hashing. Basically, you take an arbitrary piece of data like the user ID or some unique key for the user, and run it through a hashing algorithm with a seed. The same data going in will come out as the same hash value each time, provided you give it the same seed. You want a different seed for each feature, but each time a feature is presented, you go through the same hash. A person gets put into the same hash value, and you normalize that. Hashes evenly distribute people through the output, so you can normalize that to a number between zero and 99. Then you've got a bucket that is consistent no matter how many times a person comes through. As you ramp up and down in percentage, you're going to bring people in and out of the experiment.

For telemetry, any modern software already has some telemetry in place. It could be stored internally in the app, sent to a business intelligence tool, or automatically collected by another product. Collecting that is typically as simple as building a wrapper around your existing tracking and sending a copy of those events to the experimentation platform, in addition to whatever destinations it's going to. If you've already got an internal analytics warehouse, you might be able to skip that step because you can extract from that warehouse directly if the data is already getting in there.

It's really important that telemetry data includes a unique key for the user, the same thing used to assign them an experience. That's how you're going to do attribution and tie the data to the features.

For the stats engine, we get back to attribution, calculation, and analysis. In attribution, assignment data is combined with telemetry: who got what is combined with what happened. That's how you get the colored dots and figure out the distribution. Then you calculate the distribution, creating the curves. Finally, you compare those curves with a statistical test. Most commonly, the statistical test is called a t-test. It returns a probability, or p-value, for whether the two samples are the result of the same underlying behavior. In stats, it's kind of backwards: you're disproving that nothing happened, which is the same as proving something did happen. This only really works if you're accurately randomly assigning people to the different experiences.

Finally, the management console is where your team can manage rollouts and review metric results. Most people's first feature flagging tool is powered by a static configuration file, an entry in a database, or a headless service. The problem is that approach limits who has access to a rollout and requires a lot of technical knowledge to tell what's really happening. You want to build some kind of front end that people can get to, with a consistent way to access things. You can use access management to control who can get to what and do things. But simplifying the approach so it's not a wizard making SQL queries will increase adoption and the credibility of the results, because more people will see it and trust it.

I want to quickly show some examples. Walmart has something called Expo. They have a fancy UI so they can make sense of it without being in the database. They have practices like test to learn and test to launch; you can learn more about that in a blog post I put up recently on our site. They also have a powerful data manipulation pipeline to get data quickly moved from one place to another.

LinkedIn has something called LinkedIn Experimentation, or LiX, and you can learn more about that online. One nice thing they're doing is alerts. If you look at Accelerate, the book recommends that monitoring shouldn't be ad hoc; it should be proactive. This alert would fire if one side of an experiment has a markedly different site speed.

The last thing I wanted to show before discussion is a book called Trustworthy Online Controlled Experiments. It was created by Ron Kohavi, Diane Tang, and Yao Zhu. Ron worked at Amazon and Microsoft on sophisticated experimentation. Diane is from Google, and Yao Zhu is at LinkedIn. They have the lessons learned. They have written not a textbook, but a sort of recipe book. Imagine having beers with people who have been there before and are sharing what you need to know. It's very readable and well organized. I highly recommend it, and you can get the first chapter free at experimentguide.com.

This is all you'll see about my company in this presentation today: Split makes an off-the-shelf experimentation platform for engineering teams for releasing software. It is geared toward how we build software and release it quickly, limit the blast radius, determine what has an impact, and make every rollout some kind of experiment so we can see how well we're doing and adjust our trajectory.

If you want to learn more about Split, we have a great blog. Fully two-thirds of our content has nothing to do with Split. It's about what's going on in the industry and what you can learn here. I'm happy to point you to some stuff during Q&A.

Thank you very much. I know that was really fast, but my goal was to expand your thinking about what it means to use feature flags with data and to set the bar, frankly, a little bit high for you so you know what's possible. There are teams doing this all day long, and you don't have to leave this to the giant unicorns as a practice that only they have. We can all do this if we just know which boxes to check off.

Cool. I will see you live in just a minute. Thanks for following along. Let's do this.