Innovate or Die: Use Experimentation to Eradicate Uncertainty
This session is presented by Split.
Chapters
Full transcript
The complete talk, organized by section.
Henry Jewkes
Hello, my name is Henry Jewkes, and I'm the experimentation architect at Split Software, where we power engineering teams to build impactful products. I also host the "Adventures in DevOps" podcast. If you are enjoying this conference, you should definitely check us out.
This year, the world changed. The COVID-19 pandemic has impacted every single one of us in small ways, in large ways, in permanent ways. So too has it impacted our businesses.
In some cases, we have seen long-term trends take giant leaps forward. For most of us here, our work has shifted from offices into the home. We even attend our conferences remotely. The already booming e-commerce sector has scaled up dramatically. Those businesses that take advantage of it have thrived, while small businesses have struggled to stay afloat. And where only a year ago, the Academy was voting on whether streaming movies were eligible for Oscars at all, in 2020, almost all major movie releases have occurred online.
On the other end, many businesses that had been seeing rapid growth have all but disappeared. What little travel occurs, local or long distance, has moved to personal vehicles. Airlines were forced to clear flight schedules. And the ride sharing sector has gone from Silicon Valley success story to cutting drivers and employees alike. Visitors to hotels and casinos have similarly dropped dramatically as the world stays home.
To maintain relevance, it has become essential for businesses to respond quickly to market conditions. Our schools have secured suites of online tools, and teachers are learning to run video conferences and use virtual whiteboards. Those same ride sharing apps are doubling down on their food delivery offerings and are finding new ways to engage drivers by being the link in the chain for local personal deliveries. And manufacturing companies of all types are producing personal protective equipment, ventilators, and even sanitation supplies to help our frontline workers.
For us in DevOps, innovation means ensuring our organization is delivering value as fast as possible. By this time, I think almost all of us in DevOps have made it our duty to facilitate the move away from waterfall processes and towards a more flexible way to manage and develop our systems and software.
Elastic scaling, automated testing, and continuous build and deployment are so ingrained in a successful DevOps organization that it is almost automatic. We have incredible tools available to us and have the knowledge and training to spread the culture of continuous releases throughout our engineering departments.
As time goes on, we have facilitated accelerating release schedules from weeks to days to minutes. It is so much easier to provide robust, effective testing environments to ensure that the build and release process is automated to the point of being truly continuous, and to empower developers to work directly off of trunk and release new code as it is created. So with these pillars already in place, our job is done, right? Release management is solved. Let's all head to the bar. Of course not.
Migrating services and databases is still a very painful process. Developers will always push bugs, some small, some large, and many of those requiring rollbacks. And even if our changes are released perfectly without a hitch, we need to handle the cases where what was changed proves to be unsuccessful for our customers.
Fortunately, as with everything else, we can solve these problems with software. Many of you are likely already using feature flags. They're a tool that empowers your organization to separate code deploy from feature release.
For those not familiar, a feature flag is a simple if/else statement that is powered by a configuration, service, or external tool. It can be modified to target the enclosed code to a specific subset of your population. Usually, feature flags are targeted at users or customers, but they can also modify functionality by session, service, request, or database transaction. Whatever is applicable for the code change.
Feature flags come in many types. They can be a simple switch that enables the feature, either globally or for a specific portion of traffic. They can be ramped to a random population, allowing you to steadily roll out the functionality. Or a flag may also gate multiple variations of the feature, either to compare their impact or as part of a phased rollout strategy.
At LinkedIn, their team proposed that effective release strategies balance speed, quality, and risk.
Every release is a decision. The decision to provide the change for users, the decision to introduce this change to your code long-term. Speed refers to how quickly you reach this decision. The faster you decide, the faster you start delivering value. Quality is not whether the change is bug-free, but represents whether this is the right decision, whether the change accomplishes what we expect it to.
Risks come in the form of bugs, performance issues, security holes, but also of having worse results than what was replaced.
Now, traditional deployments maximize speed. The code is immediately active. But it makes that decision blindly, and it exposes the entire system to any negative ramifications.
On the opposite side of the spectrum, never changing your code minimizes the risk related to change, but also has a velocity of exactly zero. And in some cases, not taking action can be the greatest risk of all.
The ramping process divides your release into phases. Each phase protects and informs the next step in that process.
When deploying, the feature flag ensures the change is not active. That eliminates all risk at that time. Ramping begins by targeting a small percentage of customers, watching for issues while protecting the majority of your traffic.
You can target randomly, or you can use a specific whitelist to target beta customers or even internal users to be able to validate the change. Understanding the effects of the change typically requires data to be collected. This data can be collected most efficiently when the population is evenly divided, increasing the quality of your decision in the fastest way.
If you do decide to launch, you can either release the feature fully or continue to ramp to monitor how your system scales under load.
As a developer, the greatest advantage of a feature flag is the peace of mind it provides. Knowing that should an issue ever occur with a release, rolling it back is just one click away. There's no writing a hot fix at 3:00 AM, and no scrambling to get a rollback deployment approved.
Well, now that we are all on the same page about the capabilities and benefits of feature flags, let's see what this might look like in action. Migrations are one of the most common transitions we manage in DevOps. They're also one of the most challenging scenarios in release management. Whether you're changing data stores, shifting schemas, updating API endpoints, or breaking a monolith into individual microservices, migrations must be designed carefully to be successful.
The traditional model of migrations is a painful one. The service must be disconnected from the systems that rely on it. The data must be copied into the new service, taking minutes, hours, or even days. And only once that copy is completed can you then start reconnecting systems to the new infrastructure. And then watch and hope that everything went according to plan.
This approach inevitably results in downtime. It must be scheduled on off hours, teams work nights and weekends, and customers need to be warned. Then, only once the migration is complete, can you validate that it was successful. And if an issue is discovered, you need to have a recovery plan, because your data isn't reaching that retired service that you wish to roll back to.
Fortunately, there is a much better way. By leveraging feature flags, you can architect the transition ahead of time and then migrate in phases.
Start by sending your writes to both services during the transition. The flag lets you ramp up the load to the new service and keep the latest data in both systems the whole migration.
Then that migration can be performed while writes are still active, resulting in a final state where both versions of the service have the same set of data.
To validate, you can perform dark reads. This is a pattern where some portion of requests hit both versions of the service and check that the result is the same, reporting issues before customers are ever exposed to them.
Once the migration is proven to be successful, reads can be transitioned to the new service and the old infrastructure can be retired.
This allows even the most complex migrations to be done with little to no risk to the system. You only launch once you confirm the two systems match, and at each step, you are able to progressively ramp and validate, making sure that the service writes work, that the migration is successful, and that the reads work, and that the system scales every step of the way.
All right. Let's pat ourselves on the back. Not only is our deployment pipeline continuous, we have now created independence between those deployments and the release process. Our development teams can ramp those rollouts in stages to minimize risk, and we have that big red button to press if things go wrong.
Time for that beer? Well, no. Sure, we can ramp out those features, but how do we make the decision to roll out? How do we know it's safe or that the feature is successful? Similarly, how are we supposed to know when to hit that big red button? And with many teams running many releases, how do we know which of those releases is responsible for any given issue? Do we just kill them all to be safe?
"Oh, no. Easy," you say. "We have dashboards, so many dashboards." You're probably already measuring server metrics, service metrics, system metrics, requests, and clicks, and views. Today, every company has more dashboards than they know what to do with.
So when it comes to measuring a feature release, we can just turn it on and watch the dashboards, right? Well, if an issue does occur, you can expect to see one of these screenshots sent your way. Metric spike, everyone on the team scrambles, trying to identify what changed, and how it might be causing the issue.
One of the first tools that we built at Split was a way to overlay feature flag changes onto these dashboards. That empowers the team to correlate the release and the metric change much faster.
Unfortunately, the amount of data flowing to these metrics means issues for a small ramp percentage can be impossible to see.
Also, just because two events occur alongside one another does not mean that they are causally linked.
Every summer, the rate of shark attacks in the United States rises right alongside the sales of ice cream. But that doesn't mean the sharks have changed their diet to rocky road.
So back to that dashboard I showed earlier. This is a real view of an incident that we encountered sometime back. The team spent hours looking for a root cause, only to find that the source of the issue was external to our system and not even related to a release at all.
So unfortunately, just looking for changes on the dashboard often isn't enough to measure your releases. Other effects, whether they be denial of service attack, great beach weather, or a global pandemic, can have a direct impact on your data that needs to be accounted for.
To go beyond correlation and look for causality, science has provided us with the randomized controlled trial.
By enabling the change at random and measuring the behavior of both the exposed and unexposed traffic, we're able to distribute the effects of outside factors between the two samples, thus attributing any behavior difference to the change itself.
This process begins with attribution.
For every data point, we identify what feature variation the traffic received. As the data is plotted based on its exposure, patterns may emerge.
Those patterns can then be analyzed and turned into a distribution of data, allowing us to understand how that metric behaves within that particular sample.
Through the use of statistics, those distributions can then be compared, determining whether a difference exists beyond normal variation in the data.
This statistical analysis is the cornerstone for experimentation and A/B testing, but it has also proven itself to be critical regardless of the type of release.
It can prove that a bug was really fixed, if a new refactor is more performant, or if a feature actually provides value to customers.
There are many ways that companies can achieve this comparative analysis.
Most dashboarding tools offer ways to tag and segment data. This lacks the statistical rigor that is key to making decisions. But it is a stepping stone on the path and provides a far better data than the overall dashboard view would.
Teams that are already performing internal data analysis can store their feature flag assignments in the same analytics warehouse and process their results manually.
And we are seeing more and more companies are either building or procuring an experimentation platform capable of this data collection and analysis, in addition to feature management.
At this point, I've spent a lot of time describing how your team can benefit from better release management. But how might you build such a tool for your very own? The release platform is built of four core parts.
The targeting system, which powers your feature flags and records the assignments performed.
The tracking sensors, which capture key metrics, whether they be technical errors, load time, or throughput, or business retention, engagement, and revenue.
Next, the statistical engine is responsible for the attribution, calculation, and analysis, which we discussed earlier. And finally, the management console through which releases are configured, and analysis is shared.
Your targeting system must be fast to avoid becoming a bottleneck, must be random to remove bias in how users are assigned to each variation, and it must be sticky to assign a user to the same variant no matter how many times an experiment is evaluated. It also must be reliable, as under no circumstances can the targeting engine be down.
In the spirit of DevOps, the targeting system, and indeed the entire release platform, is best isolated into a microservice.
This service contains the logic for assigning a given identifier or key to a feature, and to store those assignments for later processing. This allows any part of your infrastructure to quickly and reliably use feature flags in their code.
Some of you may be asking how a system can be both random and sticky. Targeting systems achieve this through hashing, a process that maps arbitrary data, in this case, the traffic key, to a fixed value in a consistent and reliable way.
The same data will always result in the same hash, provided the algorithm is given the same seed. So to ensure each feature is released independently, a new seed should be generated each time a feature is created.
Modern hashing algorithms assign values uniformly, meaning that a key is equally likely to hash to any value. This behavior allows us to normalize the hash to a percentage and know that the population will be randomly distributed.
If a rollout is targeted at 55% of the population, any key which maps to less than 55 will be assigned the treatment, and the remaining will be assigned the control.
Any modern software already has measurement and telemetry in place.
It may be stored internally or sent to a business intelligence tool. It might also be automatically collected by another product.
Collecting data for your release platform typically is as simple as building a wrapper for your existing tracking and sending those events to your release service in addition to its other destinations.
For teams with an internal analytics warehouse, this step can be skipped, and the release service can read from that warehouse directly.
It is important that your telemetry data incorporates the key of the traffic generating the event, as this is needed to tie the data to your features.
For the statistical engine, we return to the attribution, calculation, and analysis steps. In the attribution process, the assignment data is combined with your telemetry to determine what events are relevant to the release.
Important to note that this process should be limited to the data received during a single phase of your rollout.
If you try to combine data before and after ramping, the experience of returning traffic can change over that period, and the telemetry cannot be attributed properly.
There are many ways to calculate the distribution of your attributed data. With sufficient sample sizes, the most practical approach is to calculate the summary statistics, such as the mean, variance, and size of sample.
Other statistical techniques may require different data collection, but in this process, these statistics can be collected very efficiently.
The final analysis step compares the two distributions using a statistical test.
There are a wide variety of tests available. Most commonly in release monitoring, analytics teams use a T-test, though I recommend reviewing your options and choosing the technique right for your team.
Statistical tests typically return a probability or P value that the two samples are the result of the same underlying behavior.
If that probability is very low, then it can be inferred that a meaningful difference exists between the two samples.
When the samples are randomly assigned to the release, we can conclude that the change is responsible for the difference we observe in the impacted metrics.
The last component of a release platform is the management console. This is where your team can manage your rollouts and review metric results.
Many people's first feature flagging tool is powered by either a static configuration file or an entry in a database. These approaches limit who has access to a rollout and require technical knowledge to make a change.
We found that once targeting is made available more intuitively, organizations find value in empowering product, customer support, and even sales teams to control or whitelist features for specific customers.
Access to such changes should be regulated as part of any good security model, but simplifying the system increases the likelihood that it will be used.
Then comes the challenging question of deciding what to build with this new release platform. What does it matter if you're releasing the new code 100 times per day, operating in one-week sprints, and crushing your deliverables? If you aren't building the right things, aren't you just making lousy software faster?
An incredibly common way to find your priorities is to conduct customer and market interviews. There are many companies for whom it is sufficient for their customers to be the sole source of feature prioritization.
After all, if you can keep your customers happy, that happiness often spreads.
It is worth noting, though, that customer requests are limited by their current experience.
They can help polish and optimize and identify gaps. However, true innovation often requires a spark of creativity, whose ownership should not live with your customers alone.
Then, obviously, we can look internally. Team members throughout the organization can suggest their best ideas, and those can be combined with customer requests.
To prioritize what should be done first, I'm a strong proponent of the impact-effort matrix. These should be scores filled out by as many team members as possible, helping to identify high-value targets and unproductive time sinks.
It is important to note, however, that humans are notoriously bad at estimating.
Trying to make a good guess at the amount of work required is a skill developed over entire careers, and knowing ahead of time what changes will be successful is surprisingly difficult.
In fact, the team at Microsoft's Bing search engine started off discovering that 80% to 90% of the features they shipped failed to have the success that they were expected to.
Even now, with a decade of experience, they report that more than half of experiments run have surprising results.
In a lack of meaningful evidence, it is common that prioritization will be driven from the top.
If you are just focused on making your boss happy, this isn't necessarily a bad thing, but rarely has any individual, no matter how senior, shown a perfect track record.
The secret, then, is that there is no secret.
True innovation is the result of trying things, often failing, sometimes succeeding. What is essential is to not have those attempts occur in a vacuum.
By collecting data on each attempt, you can inform future steps in a deep and meaningful way.
Successful avenues can be explored further, failures can be learned from and changed in subsequent trials, or abandoned.
But without meaningful data, and without a record of that process, you are left moving on gut instinct, a vague sense of what has happened before.
The real value of continuous delivery is the ability to continuously learn.
The goal is to fail fast, to learn faster, and to use that knowledge to steer your future decisions.
At this point, we have seen how feature flags can streamline the release process. We've reviewed some advanced techniques for releasing features across services, and we've explored how combining feature assignment data with your metrics can provide understanding not available otherwise.
Finally, we've talked about the process to run and prioritize experiments at your organization.
In DevOps, our job is never really done, but each year we can try to automate away the challenges of the year before.
I think tonight you all deserve a beer.
Thank you all for your time and attention, and thank you IT Revolution for inviting me to speak.
If you have any questions, I will be available to chat in the conference Slack channel.
If you've enjoyed this talk and want to learn more about the power of release platforms, please check us out at split.io, where you can learn how to kill the release night, how to automate your deliveries with data, and to turn every feature release into an experiment.
Thank you, and enjoy the rest of the day.