Unleashing Deploy Velocity with Feature Flags
A lot of development teams have built out fully automated CI/CD pipelines to deliver code to production fast! Then you quickly discover that the new bottleneck in delivering features is their existence in long-lived feature branches and no true CI is actually happening. This problem compounds as you start spinning up microservices and building features across your multi-repo architecture and coordinating some ultra-fancy release schedule so it all deploys together. Feature flags provide you the mechanism to reclaim control of the release of your features and get back to short-lived branches with true CI. However, what you're not told about feature flags in those simple "if/else" getting started demos is that there is an upfront cost to your development time, additional complexities, and some pitfalls to be careful of as you begin expanding feature flag usage to the organization. If you know how to navigate these complexities you will start to unleash true velocity across your teams. In this talk, we'll get started with some of the feature flagging basics before quickly moving into some practical feature flagging examples that demonstrate its usage beyond the basic scenarios. Along the way we will pull in context and discuss how SPS Commerce has used feature flags to impact their day-to-day development patterns. This session is presented by LaunchDarkly.
Chapters
Full transcript
The complete talk, organized by section.
Travis Gosselin
[00:00:17.200] He's good. We got audio. We got visual. Ready to roll. I hope you had a great lunch, and I'm excited that you came to hang out here and talk feature flags.
[00:00:29.800] A lot of people, when we kind of mention the title of feature flag, think, "Well, are you talking about permissions in an application? What do you actually mean?" The scope of what we're diving into today is really exploring a DevOps journey for SPS Commerce. And this DevOps journey is specifically focusing on an application development pattern using feature flags, and how that can really create a ton of velocity in your deployments and enable so many other features along the way.
[00:00:56.900] So with that intro, if you feel the need to get to another session, feel free to, but that is where we're diving in. A little bit about myself as we get started. My name is Travis and I work as a principal engineer for SPS Commerce. It's probably not an organization you've heard of. SPS Commerce is a B2B organization that focuses on building the world's retail network, so connecting retailers and suppliers with each other and transforming the data to connect with each and every one of them individually.
[00:01:21.500] And in that organization, my focus is on developer experience. This idea of developer experience might be somewhat new to you. I'm betting that with our audience today and at this particular conference that you have a good sense of developer experience, but for all of us, it can mean something slightly different, or you might have a different broad perspective of it.
[00:01:42.200] I like this definition a lot for thinking about what is in scope when we talk about developer experience, which is the activity of studying, improving, and optimizing how developers get their work done. So we're really thinking about the day-to-day quality of life of engineers and their productivity moving code into GitHub, or wherever, and then out to production.
[00:02:00.300] And marrying that together, we take that persona of the developer and we bring it with your organization-specific best practices or your development principles to kind of build this persona of the developer experience space.
[00:02:13.700] And if it's something you're still kind of playing with or toying with on exactly what's in scope, or why this problem exists of developer experience, or why is it new, part of that comes as a result of this quote, which I think explains it in a really simplified way: developers work in rainforests, not planned gardens.
[00:02:34.800] If we pull that apart a bit, we can look back at the past 10 years, 20 years, even beyond for some of your organizations, and understand that a lot of the tools that you're left with as an engineering team, especially as a leader or staff engineer, are just what's available to you. It doesn't necessarily fit together with that other tool or with your entire CI/CD pipeline or your observability.
[00:02:58.200] Especially for us at SPS Commerce over the last 20 years, we have a plethora of different teams experiencing different levels of productivity and picking and choosing, fending for themselves, which tools they wanted. So we knew we needed to address that and evolve that. In order to remind on it, we think about developer experience in terms of horizontal capabilities across the organization.
[00:03:19.300] Specifically, these are horizontal fast tracks for maximum productivity, maximum efficiency. If we draw the organization like this in your standard kind of set of four verticals, we start layering in some of the capabilities that we are all going to recognize, whether that be building and deploying a new application from scratch. How do we ensure we maximize that capability to stand up those new applications?
[00:03:40.200] We think a lot about API design and APIs as a universal language of development, which is essential in our polyglot ecosystem at SPS. Inner source, how to share code effectively, and what should be shared is a huge quality-of-life perspective and has a ton of friction involved in it.
[00:03:58.100] But of course, today we're talking about how do I build and deploy a new feature to production, and specifically how do I use feature flags in order to accomplish that in a much safer, confident, relaxed manner?
[00:04:09.700] And so I want to start with the problem because this is really what drove us at SPS Commerce down this road initially, of thinking through this way and this pattern of thinking. I'm willing to bet that the same is true for many organizations, or at least you'll be able to relate to it to some degree at a past point in your career.
[00:04:28.300] First is we have a main branch. We work out of GitHub and we carve off feature branches and kind of practice standard trunk-based development. In that trunk-based development, we create these features and we merge them back in, assuming they pass peer reviews, code reviews, status checks on GitHub, all of those great things.
[00:04:47.300] But as it moves forward and we merge that in, we automatically create versions based on Git semantic version and the Git commit messages that engineers add in conventionally. That continues down through the standard CI/CD process and deployed out to dev. This is probably feeling pretty natural and not very unique.
[00:05:09.400] But where we get to the interesting part is when you want to move to production. Here, for different organizations, you're at a different spot in your CI/CD journey. Maybe you have a manual step before moving to production. We did have a manual step, and that manual step for us was a gate. A gate could be different things for your organization. It might be a product owner who says, "That feature that you finished and is out there, I didn't want that to go out until next Wednesday," even though it's Monday and you're releasing it. Or maybe your QA team discovered an issue in the development environment and has to halt or stop that release, and that leaves your pipeline in an inconsistent state.
[00:05:42.200] Maybe some of you, I know I've definitely at times this, I'll move on, let's look at the next feature, let's start building it. But that leads to the problem set then of having your feature branches more long-lived because you start to lack the purity or the confidence in merging it back to main, knowing that the pipeline is in an inconsistent state.
[00:06:00.600] At the same time, your fellow developer or another engineer on the team is making a critical bug fix that was just recently discovered in production. So you make that bug fix, merge it back into main, and release it. Then you realize at that point in time, "Oh, wait a minute. I can't actually deploy that critical bug fix to production because I would be taking the feature with it."
[00:06:19.200] All of that at the same time as you work in a team, an enterprise, an environment that has other people waiting on your feature for that particular API or whatever it is that you're deploying. They're waiting on it and they can't access it now as well due to some of the other blocking issues that we're seeing here in the pipeline. This makes it very difficult for you to reason about it, to understand the state of it and that context.
[00:06:40.700] And so the solution, obviously, if we consider it with a feature flag and apply those during our development process, we would encode that right inside the feature branch that we would be building. This isn't in my infrastructure. This is literally in my code as a code block. For the sake of this illustration, think of it as just an if statement. It's an if statement saying, "Should I execute this code or not?"
[00:07:03.900] You merge that back into main and immediately, even before we deploy or release, you're seeing some benefits. One of the benefits you're seeing is that your other engineers, as they rebase or pull the latest now from your main trunk-based development branch, are going to have that feature turned off. It means you can merge features in incomplete if you need to and work towards a more true CI perspective, where you're actually integrating more often, not staying isolated in your long-term feature branches.
[00:07:30.100] But of course, the problem we need to solve as we move forward and keep pushing this towards our deployment through the same kind of standard approach we did in the problem setup: you see now we no longer have a blocking component in moving to production. That's because we're not actually releasing, we're deploying, and the code is available there, but the feature is turned off.
[00:07:50.500] The reason is because we've abstracted that. It's behind an if statement that maybe, through an app configuration file, is disabled. Or even better, you're using a feature decision provider, the generic term for this service, this dependency that can sit outside your code base, that can make a decision on when to turn the feature on and off. That's where the gate exists. Now the gate is whether this is enabled inside the decision provider, which would be in production. Is feature A enabled for context?
[00:08:19.900] When we talk about context, we talk about those other services that need to be enabled. Before we get to that, though, that critical bug fix you needed to deploy, that's no problem. That's not a consideration that I have to worry about, that the pipeline would be not deployable. Those other services that did need access can be given access specifically based on recognizing context of a header authorization. It could be anything that you can imagine that you have access to in your system in order to indicate that they can use the feature but nobody else can, or that you can test it in production.
[00:08:52.400] We've come then full circle to see that deploy does not equal release. It shouldn't. These need to be decoupled in separate events that can take place by completely different people if necessary, in different systems.
[00:09:04.100] Thinking more about the theory then on what makes it a feature flag from an application development pattern, I'm a big fan of the quote from Martin Fowler, who defines it simply as a powerful technique allowing teams to modify system behavior without changing code. And without changing code is a pretty essential part of how we'll use feature flags.
[00:09:22.800] There are four types that you can kind of classify your flags into. The first is the release type, and we've just seen an example of standard shipping incomplete dev features that aren't ready yet, but are for the customer.
[00:09:35.300] Second type would be operational. Operational flags are for us, for development, for engineers, for operations. You might be doing some type of traffic shaping or some type of migration. We'll come to a good example of an operational flag a little bit.
[00:09:50.300] There's also experimental types. These types are your standard A/B style testing, which drive preference. Is this calculation better? Does this algorithm work? I want to test this in a subset of users.
[00:10:00.200] And fourthly are permission toggles. These are more similar to what you might have inside of an application in order to handle alpha tests or beta test rollout, or even specific permissions for particular customers.
[00:10:12.900] The flexibility we gain with feature flags is pretty comprehensive, and all we did was add an if statement with a dependency. In terms of our branching strategy, if you are dealing with long-lived branches, approach it differently. Think about CI differently.
[00:10:30.500] You're shipping faster. For me, the confidence gained in knowing that I can not just ship this quickly to production and not worry about stuff blocking the pipeline, but at the same time I can roll back through the click of a button, gives me immense confidence. Of course, like anything in DevOps, the more we do it, and we're going to do this a lot more often now that it's easy, it just becomes that much better.
[00:10:53.300] Testing in production was pretty big for us. We're not a massive organization, we're about 2,500 employees, and we didn't think necessarily that the benefits of testing in production would be that great. But for us, it's been immensely helpful. Even elimination of other environments that we used as staging grounds for certain personas or certain teams in the organization to test with have been eliminated now. Instead of a dev, test, stage, pre-prod, and prod, we can do it with an integration and a production environment for some of the use cases, and that eliminates a lot of friction, a lot of cost.
[00:11:30.300] Culturally, this is different as well, not just in how your development teams are thinking about it because they no longer have to push the features into the main branch or the trunk-based branch in order to have it execute and deploy in the right order. It doesn't matter anymore. You can release it independently of that event.
[00:11:47.700] And of course, when we think culture, we think about other people who now have access to participate in that event with us, especially for the release side. That brings us this idea of progressive delivery. You may have heard of this. This is a newer term as well, and it's defined as a modern software development lifecycle that builds upon the core tenets of CI/CD.
[00:12:05.500] This term was kind of established by some of the folks at RedMonk and for Microsoft Azure DevOps, where they're talking about their ability to release through segments of users at a time, or what they call ringed releases. Working together with LaunchDarkly, one of the SaaS providers for feature management, kind of delivered this new named pattern of progressive delivery.
[00:12:33.400] I think it's important that we realize that's what it is. This is not like, I'm not here to tell you you're doing CI/CD, now you need to be doing progressive delivery. That's not exactly the next tier to achieve. Instead, progressive delivery is refreshing because it's actually the named pattern or methodology for one way you could approach CI/CD, which is helpful to see it in that way.
[00:12:54.100] Like everyone else, our goals are to produce high-quality software, do it fast, and solve a business problem. In order to hit that velocity and think about how we use feature flags to do that, of course bringing in the State of DevOps Report and the DORA metrics are helpful to see that it enables just about every single one of these, which is incredibly interesting, that it was such a simple pattern to add in some cases.
[00:13:23.700] Of course, our deployment frequency: we can release on demand. We have no blocking components in the pipeline. We're not worried about cycling that with the release time that our product owners want. It's just simpler, and we can literally deploy anytime we want.
[00:13:38.200] Lead time for changes, taking the idea of a concept out of the backlog and getting it into production in less than a day, is pretty achievable. Of course, it depends on the size of the feature that you're coding or putting together, but not really a big concern.
[00:13:53.800] And then of course mean time to restore: can you restore your service as a high performer in less than a day? Well, if I'm rolling back a feature flag, it's probably less than 30 seconds in some cases.
[00:14:05.700] And of course the last one, which is my favorite, is change failure rate, because it never computed or made sense in my mind that I deploy more and my failure rate should go down. To me, it always made sense it should be the same. But of course this report tells us, from the data-driven metrics, that you practice it, do it more often, you get better at it, and your failure rate will be even less than 5%.
[00:14:29.400] I'm excited with these metrics here. This is our metrics from our continuous improvement team, who have been tracking our DevOps journey and our introduction of patterns like this over the last eight years. You can see the data showing exactly the difference that might be hard to see at the back. Basically you can see the opposite trends coming through, where we're doing about a thousand changes to production a month now, and that is about less than a 2% failure rate on the latest month of data that we have.
[00:15:02.100] With that being said, I want to dive into a simple feature flag example because I think examples are king. If you're really new to feature flags and I've just been talking about if statements, this and that, it's helpful to see the code. I know it is for me.
[00:15:14.400] Here's a simple feature flag example where we are setting up a new user, and as part of that, we have a create user function, and that user object probably has a first name, last name, and an email associated with it. So we want to add a feature flag. We add an if statement. That's our flag, and it's checking to see, is SendGrid email enabled? In this case, we're going to use SendGrid as a SaaS provider to send a welcome email to the new user in the system.
[00:15:38.600] There's under code, and that's it. That's your first feature flag. But it's not that simple, right? Number one, what is that function actually doing? How is it checking to see if the SendGrid email feature is enabled? We could hard-code a return to true, and at least that's centralized in my domain that I'm deploying. But going back to our original definition, this doesn't qualify as a feature flag because we're not determining or changing, modifying this from outside the code base. It's actually in the code base. That doesn't help us with the value that we want to achieve.
[00:16:08.300] So this could be an app configuration file, an XML or JSON file. But even better, as we described it earlier as a service and as a dependency for yourself, it could also be a separate service that exists that you are calling and asking based on a key, "Is this enabled? Is this ready for me to use?"
[00:16:30.800] With that in mind, a lot of times we're not adding net new code. We're augmenting existing code. In so many cases you might find yourself producing, in the simplest form, an if statement where you can see the old code that is present there, and we need to maintain both of those code paths. That's using a local SMTP email server.
[00:16:48.100] But if I were to deploy this to production, this doesn't gain us a ton of value. It's helpful that I can turn it on and off and unblock some of the pipeline, but in production, if I want to validate that my API key for SendGrid is actually functioning and working, I now need to turn that on for everyone because it's not user-specific. I turn that on for everyone, test it quickly, and then turn it off. Probably not what we were hoping for.
[00:17:10.100] The key there is we do need to pass along a user context, and that has information about the user like we said, first name, last name, email, and can make now more interesting decisions that this should just be enabled for Travis to test in production and verify.
[00:17:24.200] With that, we have some really important realizations. These are things that are like, okay, now that we've seen the simple example, maybe your feature flagging honeymoon is over. Maybe this isn't as great as it sounds. What are some of these things?
[00:17:35.300] First is shifting left. The idea is that yes, we're taking the complexity that we used to handle in code branches and we are putting it in our code. I'm a fan of that because I can do a lot more interesting things in code, but it does mean the maintainability of our applications are affected, and they can be affected a great deal. By that I mean you're maintaining multiple code paths, which means multiple paths of testing and unit tests that can continue to increase your build time and testing pattern time.
[00:18:05.200] But also the state of the system is interesting. It's no longer as simple as, was this feature deployed to production at that time? Now it's, was this feature deployed to production and was it turned on at this time for this user? If you're not prepared to handle that observability and that context in being able to debug the state of the system, you might get yourself into a bit of trouble.
[00:18:28.400] Just added development time in general is definitely worth noting here. Also, we're changing existing code. While it may seem sometimes that people throw out feature flags as a development pattern that is risk-free, it's not. I've broken production with feature flags. I would probably do it again. But after you practice it, you get good at it, you do it a lot less. You build abstractions around it, and there are great SDKs and tools from different providers out there to help you with that.
[00:18:53.400] And of course, use a feature decision provider. I'm a big fan of not spending time on differentiated engineering. So should you build this service yourself, this dependency? Should you not? Ultimately, that's your decision. But I'd encourage you to check out a lot of the great providers that are out there.
[00:19:08.200] This is a G2 quadrant here. You can see one of the premier SaaS companies is LaunchDarkly, that we touched on earlier, as well as CloudBees and Optimizely, and there's a bunch of other options out there that provide just tremendous amounts of value that you want to take advantage of.
[00:19:24.800] At SPS, we're users of LaunchDarkly. They've been a fantastic partner for us and have really helped us mature our practice towards using feature flags effectively. One of our directors of development said this about them, which is, "LaunchDarkly is a game changer in enabling us to rapidly experiment and receive critical feedback and keep the development pipeline moving while other departments prepare for launch activities."
[00:19:48.600] While that's fun, and of course that's a director-level answer, I really like this one from our lead engineer. As someone who focuses on developer experience, when I hear quotes like this, I know that it's a game changer for what we're doing. He says, "It just makes my life simpler and my deployments more relaxing." My whole goal is to provide developer happiness and developer experience kind of benefits, quality of life. And so that's a fantastic quote.
[00:20:14.900] LaunchDarkly has also participated with me in many of my personal events with the T-shirt that I've worn from them. That's me and my wife there. My wife wasn't too happy when I wore the T-shirt to the birth of my third son, but that's an ongoing discussion. And of course, present at all holiday situations too. So that's definitely been fun there.
[00:20:33.700] But diving back in, I wanted to explore two other patterns, the first being migrations. We don't have a lot of time to spend and explore here, but from my migration perspective, you can use feature flags to do more than just that release, shipping incomplete features kind of concept.
[00:20:51.200] In this scenario, we were migrating from a monolithic API to a monolithic database to a separate scheduling API that we had already ripped out. We're using the strangler pattern from a microservice approach to pull that out, and now we're using feature flags in order to rewrite that traffic accordingly over to the new API. We were able to do that, obviously in portion, route a percentage of traffic, or even just our internal users over initially first. Let them experience it before we degrade anybody external.
[00:21:20.700] Another way to handle that, of course, would be to use feature flags in a different way. For us, we had to do bidirectional data synchronization, and there is some fun stuff happening there to do some ETL. But we could have also used a dual feature flag in order to transition the write capability different from the read capability. Depending on your system and how high volume it is, that might be an advantage for you to just transfer initially the writing of the information to both destinations as opposed to just one, and then slowly you can then use a second flag to redirect your read traffic accordingly.
[00:21:56.600] For us, we had a couple mistakes that we could only realize with production load traffic on it, and that was related to an algorithm specified on the load balancer that just wasn't working right for the type of requests that we're making. So we were able to easily use feature flags to monitor that situation in production, flip it off, fix the problem, flip it on, with very little customer impact.
[00:22:20.200] Coordination is also a consideration. When we talk about coordination, it's this idea of using feature flags across your enterprise and your organization, and there's a whole other talk here that we could stem from. But the idea that, are you using one feature flag to coordinate a feature across multiple deployable units, or using three different flags in order to coordinate that, depending on how your team structure is set up? There's definitely a strategy that you want to think through for this.
[00:22:47.100] Part of that strategy is also considering exactly what is the user context that you will use across your enterprise. How will you find certain users? User context is something you want to consider, whether it be a user identifier, user details, a first name and last name. What information are you allowed to send to them if you're using an external provider?
[00:23:08.400] For us, we use a lot of information related to organization. Organization within the SPS Commerce network is how everything functions. We're transmitting to org, and so we turn on and enable features almost entirely at the org level.
[00:23:21.400] But if you were to use two different contexts in different services or deployable units, you can see how these are not equal to each other. If I were to turn on a feature flag and say, turn on feature A specifically for user ID 123456, that's going to work across the board. If I were to say turn on the feature for first name equals Travis, which is probably a poor way to do it, but for the sake of the example, you can see how that would only get turned on in the system that's actually providing a first name context. Understanding the integrity of your context usage in a distributed environment is pretty important to consider.
[00:23:59.500] You can have monetary impacts on that as well. In our work with LaunchDarkly, we were using different user contexts that could have been the same context, resulting in a pretty substantial increase in monthly active users, and they were very helpful in helping us identify that.
[00:24:15.800] There are many other scenarios. Log-level verbosity changing to debug-level logs in extreme production environments for a portion of time is helpful. Dynamic configuration, JSON blobs, multivariant values. You don't just have to make Boolean values for your feature flags.
[00:24:33.900] Kill switches, when we talked a little bit about that one with the migration example, being able to actually kill this and move back to the original operational intent or the old dependency, maybe even degrading a service intentionally by turning it off if you needed to.
[00:24:51.400] Sunsetting features, using this as a capability for your customer success teams to slowly remove a certain feature set and add a new one, just a set of teams, or even just developing a stake in the sand to say, "Hey, all new customers won't get this," and then we can stop them from seeing it and start to eliminate this.
[00:25:07.200] And of course timed features, holiday scenarios and different banners and things that you want to add from a time perspective.
[00:25:15.200] Unfortunately, that's all the time that we have for today. But I think if you're leaving with anything, the idea of separating your deploy and your release events can have a substantial impact to your org. It definitely has for us. The value that you get as a result of that user-context targeting, testing in production, and risk mitigation is pretty immensely helpful, I can say, in just about every team that we've introduced it to thus far.
[00:25:39.400] And I think if there is an additional, "the world isn't perfect" takeaway, it's not free. This is going to take some time for you to practice and get good at if you haven't done it, but it's totally worth it and totally helpful as a practice for moving towards CI/CD.
[00:25:55.500] So that's, I think we are close to time. We've got about 30 seconds for any questions or final thoughts, so you can find me.