From Flagging Releases, to Flagged Releases - A Story of Release Acceleration from Vodafone UK

Log in to watch

Europe 2021

From Flagging Releases, to Flagged Releases - A Story of Release Acceleration from Vodafone UK

Robert Greville

Head of Web Engineering · Vodafone UK

David Ward

Head of Product Engineering · Vodafone UK

Natasha Wright

Engineering Senior Manager · Accenture

This session is presented by LaunchDarkly.

Chapters

Full transcript

The complete talk, organized by section.

Robert Greville

Hey, and welcome to our talk, "From Flagging Releases to Flagged Releases: A Story of Release Acceleration from Vodafone UK." Welcome.

My name is Robert Greville, and I'm responsible for Vodafone UK's web engineering, and I'm going to tell you a story. A story of wonderment, peril, danger, and delight. A story in three parts: a beginning, a middle, and an end that takes us on a hero's journey from flagging releases to flagged releases.

Before we continue, let's first just do some housekeeping. I'll be joined by two other intrepid adventurers. Firstly, David Ward, who's responsible for our platform team, agile practice, automation engineers, and BAs. And I'll also be joined by Natasha Wright, who's been leading our CI/CD team from Accenture.

Please, join us on our quest.

Before we talk about our journey, let me first tell you about our humble beginnings. Vodafone was the first company to make a mobile phone call in the UK, the first company to introduce text messaging, and the first to introduce international roaming.

We're a company of firsts, and today we deliver a plethora of products and services, from mobile devices to home broadband. Our business serves over 600 million customers around the globe, with 80 million of those customers here in the UK. We're a global leader in IoT, serving over 90 million connections. That's more than anybody else.

We help our customers all around the globe stay connected. We passionately believe in the power of new communication networks and technologies to change our society for the better.

Have a look at this video that tells you a little bit about us.

[Video plays.]

This is not where our story starts. Once upon a time, we were fighting to make change. There was a call to adventure. The world was changing around us. The speed and rate of change was accelerating, and we needed to catch up.

We needed some help, something that could help us with our challenge on the road that lay ahead: our ability to release code, but with pace and with quality.

In the beginning, Vodafone didn't really have much of an engineering team. It didn't have developers, guilds, tribes, or communities. We had isolated parts of our business making change and coming together all at once. Releasing was becoming harder and harder each and every time that we tried.

As we began to grow and build our engineering squads, we began to swarm around ideas and tooling that would help us hit our objectives and solve those challenges.

We use Azure DevOps. It helps us for all of our agile needs, our CI/CD practice. It serves us really, really well storing code, managing workflow, even documentation. And in the short term, we were using Azure DevOps to handle the turning on and off of code functionality through environment variables. But even that had its own challenges.

There were two culprits. All of our heroes needed to bundle work together and only go when all of them were ready. Getting work out became cumbersome: long, arduous, tiresome days and nights for our engineering legends.

Everything had to be done by a person, deploying code, usually late into the night. I remember fondly, gathered around a conference table in our office with fingers hovered over buttons ready to press, and then once they were, having to check everything manually, even placing orders into our shop ourselves to check that everything was working as we expected.

Outside of engineering, even people that wanted to make changes, such as our product management guild, relied on engineering support to turn those features on and off or run beta tests. There was no interface. The system was really unusable for non-technical users, so it meant that only developers could make changes or release.

All the changes as well needed that release. Even something as simple as setting something to true required a new release, meaning more time being spent on administrative tasks rather than actually delivering customer value.

So as this solution grew, we needed to deploy changes more quickly to production, but we were already getting further and further ahead of our stack. We needed to get more releases out, but maybe, just maybe, so our customers couldn't see those changes.

So surely, there must have been a better way. I'm going to hand now over to our middle section with David Ward.

David Ward

So here we were releasing once per quarter. We had a strategy for releasing in place, and that strategy was there to protect us. We'd been scared to release, frightened at the thought of causing an outage and having to spend endless hours in war rooms debating where the issue had occurred, what the root cause was, and ultimately, who was going to get the slap on the wrist this time for being the cause of the issue.

Our strategy was built on blue-green deployments. We had two identical versions of our entire platform in production running, with only one of those receiving traffic from our customers at any point in time. A release would involve pushing changes to the dormant production environment, running sanity tests to check for stability before then performing a flip of website traffic to the newly updated production environment.

These flips had come to protect us. If we ever had an issue with a release, we would simply flip back, and our chastening for causing a release outage would be significantly less. So the flip was there to protect us, and protect us it did for a time.

There were successes. Successes of decoupling that came with a new microservice architectural design. Successes of improved availability at lower cost from using cloud hosting. Successes of greater throughput and efficiency than ever from an improving agile delivery process. And successes of an improved release cadence. We were now able to release our platform every two weeks.

But these successes would also prove to be the doom of our existing release process. We were delivering business value faster than ever, and as a result, we were being asked to scale more than ever. The flip release process that worked well for 10 teams would not work for 20 teams.

We were now flipping every sprint, so every two weeks. That meant within any given two-week period, if any team wanted to deploy any change to any one of their services, they needed to have completed their development work and deployed that change to a dedicated platform environment for regression testing by day seven of the sprint. Our end-to-end test team would then have a few days to run full regression, triage, and resolve any defects with our development teams in order to ensure that the release was stable and could be flipped.

With more teams, more services, and more demand to deliver, we were pushing more and more into each flip to be released. These flips were fast becoming unwieldy, with many, many changes across many different microservices all going out in a single flip. Our releases had become microlithic.

The more we shoved into each flip, the harder it was for us to pass regression testing in the two-week period. Any issues with releases meant, yes, we could flip back, but this would only exacerbate the problem and build up more changes. With more changes in each release, it became harder and harder to triage issues as well. Our release cadence began to slow, and it became clear that we needed a new solution.

The solution began with something small, a seed laid out in the form of an OKR for our development teams to be challenged by, and most importantly, to own for themselves: each team has to release just once in the next three months independently from any other team.

Some teams succeeded, some teams did not. But the important thing was that each team tried and found out for themselves where the friction in releasing independently existed for them. We really immersed ourselves as a department in this OKR and celebrated each and every independent release. But this isn't a story about OKRs. That's a tale for another time.

We wanted to release smaller and smaller, faster and faster, but soon came to an issue that was common across much of our platform. Our automated testing was relatively immature, and we didn't have our test suites running in an automated fashion that would give us the confidence that we weren't breaking production. We also had a multitude of different feature flagging solutions, making it very difficult to not only coordinate end-to-end integration testing of our larger deliverables, but also release those in an independent manner.

We needed a solution that didn't involve putting the brakes on our business commitments and retrofitting automation everywhere. LaunchDarkly would be that solution for us, and we came up with a strategy to ensure that we weren't just going to add yet another feature flagging solution into the mix. We were going to ensure that this would be one feature flagging solution to rule them all.

The first challenge that we had to overcome was one many would have come across before. We had to deliver the integration and adoption of LaunchDarkly without impacting our current business commitments. In order to achieve this, our strategy was to spin up a standalone team, a fellowship, if you like, that would not only be responsible for being our LaunchDarkly gurus, but they would also be responsible for visiting each of our teams and migrating their existing feature flagging solutions over to LaunchDarkly before pull requesting the migration back into each team for them to own.

This worked really well for us, and as an intentional benefit, helped us to discover a few of the pain points involved with external contributions to our various services. We continue to use these learnings as we try and drive a culture of inner sourcing.

Before any pull request could even begin to be thought about, our fellowship of LaunchDarkly gurus had to go on an epic journey themselves. They created wrapper libraries as an abstraction layer, ensured everything LaunchDarkly was configured using Terraform, discovered and documented how we could organize our feature flags from our naming conventions to parent-child flags. They aligned and created our environment strategy in LaunchDarkly, sorted out ACL, resilience, flag hygiene, and maintenance.

It was quite a journey and one that we're just about to start reaping the benefits from. It's at this point in our story that I hand over to Tash to tell you a bit more about some of that journey, starting with the life of a flag.

Natasha Wright

Hello, everyone. My name is Natasha Wright, and I work as part of an engineering team at Vodafone Digital, looking after the LaunchDarkly platform, which we use as a feature flagging solution.

Here at Vodafone, we adopt an everything-as-code and a DevOps-centric approach in everything we do, and the LaunchDarkly platform is no exception. We are managing that platform using configuration as code at all layers of the stack, whether that be the projects that we deploy to LaunchDarkly, the environments to LaunchDarkly, and even the flags themselves. We're trying to manage them with an everything-as-code approach.

So I want to spend a bit of time today talking to you about that approach, how it's used, how we structure it, and then talk a little bit about how we use the LaunchDarkly product here on Vodafone and how it integrates with our development practices and our operational tooling.

So first, I'm going to talk about the configuration that we have for LaunchDarkly. As I mentioned, we're using an everything-as-code approach, and here in this instance, we're using Terraform to manage our LaunchDarkly platform. LaunchDarkly ships with a Terraform provider, so it seemed like the obvious choice for us to manage the product from the ground up using an everything-as-code approach.

We're using Azure DevOps as our DevOps solution, so we use that for both Git and version control, CI and CD. So I'm just going to take you through our Terraform repository now.

From a Terraform perspective, we manage our projects, our environments, user roles and permissions, API and access keys, and even the flags themselves using Terraform configuration.

When we look at our projects, we're using Terraform maps to manage all of that information. Our projects are logical divisions that we have here for development, and it means that we can assign environments to projects, whether those environments are test environments or development environments. Given that this is all in Terraform, we can continuously deploy this without impacting any of the work that's actually going on.

Here, as you can see from my screen, we have a particular project called Vodafone Consumer, and then a number of environments underneath it. Just to flip to my LaunchDarkly console now, just to show you what this looks like. Here on the side of my screen, we see a number of projects that I've got set up, including the environments underneath each of these. And you can see these are all isolated entities within LaunchDarkly. So deploying a flag or turning a flag on and off in one environment in one project doesn't actually impact another one.

We're also managing custom roles using LaunchDarkly, and we have a number of roles defined. LaunchDarkly ships with some roles by default. For our work here and our development, and more importantly, our release practices, we needed some more custom fine-grained permissions, particularly to have a good separation from a permissions perspective of who can do what in production.

So we actually created our own RBAC model in Terraform, which we define here in this repository, so we have different permissions for different roles. To take you through what this looks like in LaunchDarkly itself, I'm going to switch back to my LaunchDarkly screen. If I click on Account Settings and then Roles, what we can see are the custom roles that we've created. We have roles for our engineers, and more importantly, we have roles for our operational teams and our release managers. These are the people that can flip the flags on and off for production, meaning that we do have that logical separation of least privilege between an engineer that's developing a new feature and our release manager or our business manager that wants to turn on a new feature in production for our consumers.

Next, I want to talk a little bit about the flags themselves. We're actually managing all of our flags in Terraform. We've broken down our flags into different Terraform files for each of our feature teams, which means that each feature team can build and develop its own flags for its own pipeline of work that it has.

I'm just going to click on one of these now and show you what this looks like. We're actually using a map for all of our Terraform variables that we pass in. Each of our feature flags that we create will have a unique key. It'll also have a variation type. This basically dictates the type of flag that we're going to deploy and if it's going to have a true or false value. This is great when we have a block of code and we want to determine: should it be executed, yes or no?

But there's also other types of variation that we can leverage, and we are doing so here. In particular, when we think about our front-end interfaces, and maybe we want to toggle the look and feel of a particular component or change some inputs going into it. So we actually have an example here of some flags which have a variation type of number. What this means is that when that flag is turned on or off, a different input is passed into it, one that's not true or false.

So this is the roles, this is the flags themselves, and also our projects. For now, how do our applications actually pick up these flags for use? That's managed via SDK or API keys. We actually handle all of those using Terraform as well. When we run our pipelines to create new environments and projects in Terraform, and deploy our flags, unique keys are generated, and these need to be injected into our applications.

Now herein, we have a problem because we have a piece of secret material, which will allow our apps to access LaunchDarkly. So we use AWS solutions for all of these items of secret material. In this particular case, we're using AWS SSM Parameter Store to store these valuable pieces of token secret material.

We actually use Terraform to not only grab those keys directly from LaunchDarkly, but also to persist them to AWS themselves. We have a for loop here on the screen in front. Basically what we do is we grab every SDK API key that LaunchDarkly generates, and we then persist that into AWS Parameter Store at a location that our applications can pick it up from. This is great because it means that none of our engineers ever actually have to manually handle secret material, and the pipelines will do it all for us.

So that moves me quite nicely on to our pipelines. While we might have all of our code in Azure DevOps, this doesn't help us when we want to deploy it because it takes multiple Terraform phases that we need to look at here, deploying multiple different types of configuration. Thankfully, we have an automated pipeline, which we leverage to deploy all of these changes to our LaunchDarkly environment.

In the first instance, we have a feature flags Terraform debug pipeline. In short, this is a pipeline that's run anytime anybody makes a pull request or a merge into our master branch so that we can determine the changes that have been made are good changes. This is also a pipeline that can be invoked manually by our engineers when they want to test any changes that they're making.

So what I'm going to do now is I'm just going to click run and run that pipeline, and show you what it does. In the first instance, the pipeline will grab all of our latest Terraform artifacts. It'll bundle them up, and it'll publish them to the pipeline itself. Then the second stage is what we call a debug stage. What this will actually do is execute a Terraform plan using those Terraform artifacts that have been downloaded from Git. Then the pipeline itself will publish a plan file. We can have that as a version-control-tracked artifact that comes out. So if we ever introduce any bad Terraform changes, we can look back through our plan files and understand exactly which debug run it was, or which version control change, or which commit ID actually introduced that bad change.

Given that I'm running this Terraform plan against our master branch, which is already deployed to our LaunchDarkly instance, I should see this run and tell me that no changes will be required. This job normally takes about 20 to 30 seconds to run, so it should be wrapping up any second now. Then we can actually see the output of running this job. Now, as I said, this is something that any of our engineers can run, and it also runs when we want to make merge requests into our master branch to ensure that the code changes are correct.

If I just click on Terraform Plan here and go right down to the bottom, we can see it says, "No changes. Infrastructure is up to date."

So now if I go back to my pipelines, I can see that that's my debug pipeline, which is great, but I also have a DX feature flags Terraform pipeline, which is actually related to the code in my repository. This is the one that's actually going to deploy the code changes that I have there. If I just click Run Pipeline now, this will deploy everything that's in my master branch. If I click Run, this will kick off this pipeline.

In terms of the stages, the first stage is very similar. It grabs all the latest artifacts and creates a build, which it then publishes. Now we have a slightly different second phase. It's a deploy LaunchDarkly phase rather than a plan. What this second phase actually does is connect to our Terraform backend for our particular LaunchDarkly instance. We have a state file, which we keep in an Amazon S3 bucket. What happens is our pipeline will connect to that and essentially do a comparison between what's in the S3 bucket and the Terraform code that we've got, and then execute the delta of anything that's required. This executes the Terraform files, and then it will give us an output on screen to tell us any changes that are made.

So this is great. We've now got two approaches. One, give our developers the ability to make changes directly on the Terraform code of any new flags that they would like to create, and run a pre-pipeline so they can see any changes that might happen. Secondly, we have a deployment pipeline, which allows us to push those changes in an automated way, so nobody has to make any manual interactions with our LaunchDarkly instance. This is also great if we've got a brand new LaunchDarkly instance, if we needed to create it from scratch. We've got all of our configuration as code, so we can simply execute it against that instance.

So how do we actually use LaunchDarkly day to day from a development and operational perspective? We have a number of automated bots and apps that are deployed to Slack, which we use as a collaboration tool. In particular, we have a LaunchDarkly bot that's deployed to Slack, which allows our engineers and developers to manage not only LaunchDarkly, but the flags deployed to LaunchDarkly through Slack.

If I go to a particular Slack channel here, I can execute `/launchdarkly`, and what that'll do is it'll give me a help message to direct me to the various commands and functionality that LaunchDarkly has through Slack. Our developers will often use this as a way to manage flags in the development environments in a programmatic way to turn flags on and off.

If I wanted to have a look at a particular flag, I've just got a pre-flag to save me having to type it. If I just put that here, here is a flag that I'm looking at. It's in one of our development environments. It has a particular name, this is the environment it relates to, and I can see that this flag is currently off. If I want to have a look at this flag in the LaunchDarkly console, I can basically just switch back to my LaunchDarkly console, and I can switch here to feature flagging, make sure that I've selected the right environment. So if I go to Vodafone Consumer and then dev one, and if I look for that same feature flag, open banking, I can see that it's currently tracking as off.

Using the auto-deployed Slack bot, I can actually turn that flag on. If I click confirm, we can see the targeting for that flag has been changed to on. If I now refresh my LaunchDarkly window, I can see that that flag is now on. This is great because it means that our developers can handle and manage flags all through their collaboration tools. This is great because lots of our incident management tools, again, all plug into Slack, meaning that we can acknowledge incidents, turn flags on and off, all using that single tool, which is both persistent and allows us some traceability over who's made what changes.

To talk a little bit more about Slack and what we use it for, we also have a number of other integrations, things like when flags are deployed across our environments. We've got an example here that shows that LaunchDarkly had a new flag created across all the environments. Similarly, we also have traceability of user changes that are made. We have a deployment alert here that shows me that a user actually turned a flag on in our production environment. This is great because we have this traceability, and we can see when people are making changes.

Additionally, we have other integrations which are set up with the LaunchDarkly platform, in particular with Datadog. We're using Datadog for all of our monitoring, data aggregation, custom metrics, and alerting, and we have this integration set up by default in LaunchDarkly platform itself.

If I switch over to Datadog, we have a LaunchDarkly dashboard. If I click on that now, I'll be taken to a dashboard which shows me a number of events that LaunchDarkly is automatically sending to Datadog. This is using the default integration, and we've configured a custom policy to only send certain types of events. In particular, we have production events that we can see here, so we can see when people are changing flags on and off in production. Similarly, we can also see when people are changing flags on and off in development environments, and we can see the event here that I just generated by turning a flag on and off.

I'd also like to draw your attention to some custom monitors that we have here. There are certain things about LaunchDarkly where we want to track when and why they're being changed. In particular, if people are deleting projects, environments, changing production values. We have a number of monitors here. This is set up in Datadog, so it integrates with all of our existing operational processes, tooling, and incident management processes and tooling. This integration allows us to manage LaunchDarkly in the same way that we manage our applications and production environments.

All of these alerts, again, also pipe through to Slack, where we have some further integration there. We have a dedicated channel, which we've actually set up, which is called LaunchDarkly Notifications. What this tells us is any time that one of these monitors changes from green to red, i.e., it's gone from good to bad. So we can see when people are changing things in production. We can see when flags are being deleted. This again allows us to build up a story of when we've got changes to our environment, all integrating with our existing development processes and all of our incident management processes.

With that being said, that's everything I really wanted to show you today about LaunchDarkly and how we're using it. I look forward to any questions. Thank you very much.