Our Journey Towards Progressive Delivery
This session will talk about how the IBM Kubernetes Service builds and deploys micro-services to Kubernetes clusters across the globe.
Using standard tools and services such as Github, TravisCI, Kubernetes, and LaunchDarkly, IBM is able to deploy hundreds of code changes daily to thousands of clusters spread across the globe. By updating our development culture, adopting progressive delivery, and bringing our environmental configuration under control, see how we transformed our deployment pipeline from a slow monolith to a fast and agile set of micro-services.
This session is presented by LaunchDarkly.
Chapters
Full transcript
The complete talk, organized by section.
Michael McKay
Hello, my name is Michael McKay. I'm the delivery lead for the IBM Kubernetes Service. I want to thank you for attending my session here at the DevOps Enterprise Summit, and I'd also like to thank the organizers of this event, especially for having it in October, which gives me the opportunity to show off some of my Halloween decorations. You can see my background here.
Before we get started, just a little bit about who I am. I've been with IBM for 24 years. I just verified that this morning on LinkedIn, so I'm pretty sure it's true. I started my career at IBM helping to manage the IBM internal SAP deployments. Then I spent a sizable portion of my career building products for IBM Tivoli. And then finally, for the past seven years, I've been building and helping to operate the IBM Cloud.
What's cool about what I'm doing today is that it's really a combination of what I did early on in my career, helping to manage those SAP environments, and also building code and building products for IBM Tivoli. So basically now, I write code, and I have to run it and operate it as well.
One other tidbit about myself is that I have four kids and three cats. So most of the time, I find managing and juggling delivery at scale to be much, much easier than what I have to deal with at home, especially during the days of COVID here.
So, a bit about the IBM Kubernetes Service. We serve as the basis for most of the services at IBM Cloud. We're currently running in six regions across the globe and 35 data centers. We have over 110 control plane clusters to run our service. And those 110 control plane clusters will support well over 20,000 Kubernetes clusters for our customers. So on each one of these clusters that we operate and manage, and also our control plane clusters, we have various pieces of code, microservices, configurations that we have to ensure get pushed out and managed appropriately.
So just a little bit of background on how we got here. For one, just lots and lots and lots of trial and error. Because of all that trial and error, we learned a lot of things. We learned things that we should do. We learned a lot of things that we shouldn't do. And also, which is equally as important, we learned things that are just not important. Finally, willing to take some risks and trying out something new has been very, very beneficial for us, and it's unfortunately something that can be more difficult in large corporations such as IBM.
So when we started off four years ago, we really just had a small team here in the US, and we had all of our code deployed in one data center in Dallas. At that point, we just wanted to build something. And we just wanted to build something that worked. We didn't really put much thought into scale or how we were going to operate this thing, just basically what was going to be running today, and can we make sure that same thing is still running tomorrow with some added pieces to it.
Most of the team had a development background. And because of that, we still treated what we were building as a product, like a code delivery product, something akin to, I want to take this DVD and stamp them out and send it to my customers once a quarter. We weren't really used to running a service. Because of that, we were really used to delivering features over the course of months, not days or weeks like we do today.
Usually when I do this slide, I've actually updated this quite a few times in the past. I used to talk about culture change, and that's really not appropriate. I really talk about culture improvement, because we're not really changing our culture. What I mean by that: we're not just flat out ripping out what we had in culture and replacing it with something new. We're just really improving upon what we had before, because believe it or not, we still had a lot of things that we did in our culture that are equally as valuable today as they were when we first started this four years ago.
When we talked about culture improvement, I used to have a big, long list of things that we did, but what I find most important is what I've been calling democratizing DevOps. What this means is that, and I'm sure a lot of you have been in a similar situation, you've had a team, and that team was really led by just one or two or just a handful set of folks that kind of controlled everything. They would control what we were working on, how it got built, things like what code are we using, which libraries are you using, which code editors, who's working on what and when and how.
So moving more towards a DevOps democracy, as I've been calling it, basically means that every engineer should be equally involved and have a say in how they code, how they build, how they test, and not only that, but they also will all be part of building or deploying code into production. And this does wonders for how we build and operate the IBM Kubernetes Service. For one, it helped us to remove lots and lots of bottlenecks.
For example, in the past, we just had a couple of guys who would solely be responsible for pushing code out to production. This was not only slow and cumbersome, but this also meant that those two guys were then tied up and not being able to do anything else. So in today's world, the way we do this is that every team, every engineer is part of a squad, and each squad is now responsible for building, deploying, and delivering their own microservice.
The next step we did was really change the way we think about continuous integration and continuous delivery. We always say that it's really hard to do progressive delivery if you can only deliver once a week. So our previous approach to doing CI and CD was like most enterprises. We had lots of Jenkins jobs and lots and lots more Jenkins jobs, and we kind of had this, I don't really want to call it a Rube Goldberg project, but we had Jenkins jobs calling other Jenkins jobs, calling yet other Jenkins jobs, and believe it or not, I think we still called additional Jenkins jobs from that. So the jobs themselves were these huge monoliths, just like we were building and deploying our code.
Those big monoliths would be responsible for the developer checking code. That code would get built, that code would get tested, that code would get promoted to environment, it'd run some more tests, it would get promoted to the next environment, run some additional tests. The whole process for building and deploying the IBM Kubernetes Service as a whole would take about three hours. And because of that, we would only be able to deliver once or twice a week.
So as part of rethinking the CI/CD process, we had to set a few ground rules. For one, it had to be fast. It had to be scalable. Our intent was to reduce friction, and finally, we wanted to get the users involved in the process. So in the past, we've talked about, IBM used to have a term called visibility, control, and automation. I'm a huge fan of the visibility. I'm also a huge fan of the control. I'm not so much of a fan of the automation. I think automation does have its place, but in our process, what we found is that too much automation took our developers out of the equation, and therefore they became less part of the process, and they didn't understand what to fix or what happened or what to do when something would go wrong.
So part of our new solution, our CI process, is based like most other organizations now: you have your code checked into GitHub. That code, once it's checked in, will get built by Travis CI. And then Travis will build the images. It will run tests, lint the code, et cetera. And then finally, what it does is builds the images, uploads the images to our image registries, then all of our related Kubernetes artifacts are uploaded to cloud object storage. And finally, we update a feature flag service we use called LaunchDarkly. And then we use LaunchDarkly to help deliver and push out our code to all the environments.
So that moves on to the next step, which is the continuous delivery portion of our overall pipeline. And this is where we have the most new ideas. My team is actually called the Razee Squad. So we've kind of invented some new pieces of technology to help us deliver at scale. I've done some talks on this in the past, but I'll briefly cover what we do here today.
So on each of those over 20,000 clusters, we have a small piece of code, our cluster updater. Our cluster updater code will interact and talk back to LaunchDarkly. In LaunchDarkly, we can set rules which define how and when code is rolled out to our environments. What happens is that users check in code. That code is built, uploaded to the appropriate repositories, LaunchDarkly is updated. Then when the users want to deploy it, they go into LaunchDarkly, they find the deployment flag for their particular microservice, and then we have a set of rules for each of these microservices. Most of the rules are by region.
So then we can update that rule that says, "Hey, for AP South, I want to deploy version XYZ." And as soon as you select that new version, the cluster updater, which contains a LaunchDarkly SDK, will be notified, pull that down, and deploy that to as many clusters as are in AP North. So this process works the same whether we're pushing out code to one, 10, 100, or 10,000 clusters. We use the exact same process.
And on top of that, once that cluster updater has delivered that code, or delivered those new Kubernetes resources to those clusters, it will also send up basically the current state of the cluster, including which deployments are running, which configurations are applied, up to a service we call RazeeDash. The RazeeDash component then lets us view the current state of the clusters, what's currently deployed. What's important here is that we actually show what's currently deployed on the Kubernetes cluster, not just what we thought we deployed. We've had this trouble in the past where we would have a job which would push code to a cluster, so we assume that version XYZ was running. But then we actually looked on the cluster, we found out version 123 is running.
So for whatever reason, maybe the deployment failed, maybe someone logged directly onto the cluster and updated the version. Whichever reason, we didn't have an accurate representation of what was running on these clusters. So part of the cluster operator then will provide that visibility and provide an accurate inventory of what's running on all 20,000-plus clusters that we operate.
So the summary of this is that we basically switch from a push model, where we would have these Jenkins jobs pushing code out to the environments, to a pull model now where we take advantage of technologies like cloud object storage and LaunchDarkly to help deliver our code at scale.
The next step that we took here is, okay, we've got this new toy, we called it at the time, called LaunchDarkly. And it's one of those things where we just kind of think, what other ways could we use LaunchDarkly? We're already kind of using LaunchDarkly not the way it was intended today, using it not as feature flags, but as deployment flags. But as we scaled up the deployments and we're able now to deploy not a couple times a week, but 150 to 200 times a day, we found that operating that environment could become much easier if we could just tweak some knobs here and there.
So the next thing we did was we took LaunchDarkly, and we used that to help us provide some operational controls via these feature flags. One great example that we use LaunchDarkly for is to lock our clusters. So at any time, we can use a LaunchDarkly feature flag to lock deployments to any set or any subset of clusters. We could say that all clusters in US East, we want to prevent any new deployments. So we can just go through, update a rule that says, "Lock all clusters in US East." The second we set that in LaunchDarkly, literally moments later, if not just in real time, those clusters will be prevented from doing any further deployments.
So we have several other examples of how we do this, but this is our next step into how we're moving towards progressive delivery. And part of this is just better understanding and better learning how we can actually take advantage of these feature flags.
The next step: we've kind of gotten to this point where we are just delivering like gangbusters. We're delivering 150 to 200 times per day. Several dozen different teams are currently building and deploying code simultaneously across the environment. Because of that, it felt a little bit like the Wild Wild West. So what we needed to do, and we kind of consider this, call this our growing up or maturing phase, is that we need to put some controls around how we build -- not really how we build, but how we deploy code out to these environments. This also helped us provide some focus and really to put more controls around how we update the environments.
So what we did in this case is we basically built an application which sits on top of LaunchDarkly, and this application allows the users to go in in a controlled environment, select which flag they want to update, pull down the new variant associated with a specific rule in the flag, and then submit a request to make that change done. That request will then trigger a ServiceNow ticket to be opened up, and our operations team can then review that. For example, if we open up a change for an EU-managed server or EU-managed cluster, we can have a team in the EU data center actually approve that request.
Once they've approved that request, then it's still up to the developer to go through and click the button to apply the change. As I mentioned before, we still want to have the developers have a hands-on experience. So we do allow the developer to control the rollout, but there's not necessarily this kind of underlying automation which is automatically just driving it through the environments. Again, this is on purpose. This is by design. So once that ticket's been approved, once the developer then clicks the button to start the deployment, they can then monitor logs and they can actually test the application in the environment they deployed this to, to make sure nothing went wrong.
The cool thing about this process is that the application that we're using to update the rules and create these change tickets is intimately knowledgeable about what you're actually changing. It knows the Git commit of the version that's currently in the environment, and it knows the Git commit of the version that you're trying to push to the environment. Because of that, the ServiceNow ticket that we opened up has all kinds of data that the operations team can then use to determine how risky this change is. But also additionally, in the future, when we want to go back and see what changes occurred during a specific time frame, we can easily see exactly what changes went into the environment at what time. So not only do we have what changed, but we know who changed it. We can also provide links to things like tests that were run in the previous environment that that ran in.
So all in all, once that code is deployed to the environment, the user can then click the final button there to tell whether or not it succeeded or failed. And as soon as they do that, as soon as they click Succeed, that will complete that change and close that change ticket record. So again, the developer is still driving the code through it, so they're still able to control the deployment. But we've now kind of put in some guardrails and additional sanity checks just to make sure that we have the appropriate audit trail and controls in place to make sure that everything continues to run smoothly.
One of the things that is very interesting about this is that LaunchDarkly themselves has now incorporated this logic directly into their offering. So eventually, we will sunset what we call our Razee Flags application and begin to use just the integrations that are directly built into LaunchDarkly and their integrations with ServiceNow.
The next step of our journey here, I'll call it Feature Flags Act One. So after we've been using feature flags now for about a year or so to control our deployments, and we have been using some feature flags to do some operational changes or operational updates as well, this is where we really started to realize, well, wow, there's a lot of power to these feature flags, and what else can we use this stuff for? Which is ironic because what we started using it for now is kind of what LaunchDarkly was intended for. That's what the whole total product is designed for.
Because of that, we started having a couple dozen teams that we have start doing feature flags. So each one, for example, our API, our UI, our billing, all these different microservices would then start to implement their own, quote, features. The problem was that a lot of these features were the same. So we ended up with lots and lots of duplicate flags, all had to be managed independently from one another. And because of that, there's some confusion about which flags need to be updated when, who's added to which segments and flags, et cetera.
But having said that, with those drawbacks, we were still able to deliver and update several very large features through a progressive delivery model using these feature flags. This is something that even today, honestly, I'm quite amazed that we can do: at any time you look at our environments, we have beta code, we have even pre-beta code running in our production environments. We continuously deliver new code to these environments and continuously improve on not just existing features, but new features that we haven't yet rolled out to our customers.
And so since we started using this model, we've kind of gone on a roadshow across IBM to say, "Hey, guys. This is really cool," just kind of talking about how I feel we are now doing progressive delivery and rolling out features and actually managing features, rather than just pushing these features out typically as really large code deployments.
So finally, the next step is what I would say feature flags done right. So just like our deployments early on, our feature flags themselves were getting out of control, as I mentioned before. All these teams had various different feature flags. Again, it was kind of the Wild Wild West, but this time with the user flags. So what we ended up doing is, for one, we sat back and really adopted feature management and feature flags into our overall development process. What that means is that at the time that we think of a new feature or think of a new capability that we want for our product, the first thing we do is go out and we create that flag in LaunchDarkly.
For example, we have a new offering called IBM Cloud Satellite. It was announced earlier this year and is currently in beta. Now, the interesting thing about this is that we've had Satellite code running in our production environments since back in May, since we first started thinking and talking about Satellite. And the cool thing about this is that we've had one flag to control access to Satellite, where in the past, we would probably have a half a dozen or more different flags for each of the various components of Satellite to give access to users for those pieces. So now, if we want to have a user get access to our new Satellite offering, there's one place we can go in LaunchDarkly, and we can add them and give them access to that capability across the board.
The next part is, again just like the deliveries, we had to add a little bit of control around how we manage and update these feature flags. Because of that, we've now added change management or a change management requirement to updating these flags. Before this, just random members of the team would go through when they had updated segments, or they would change a flag, which could potentially expose this new feature to customers that we didn't intend it to, or may even change the behavior of certain production environments that, again, we no longer intended to. An important part of this is that if we were to ever go back and look at the history of a service, we weren't able to really tell who had access to what and when.
So now we've also integrated our change management process around this, albeit this process is a little bit more manual than our deployment flags. We do just require our users to go through and open up a change ticket manually before they make any changes into LaunchDarkly. This is one of the areas where the new feature that LaunchDarkly has provided with direct integration with ServiceNow will help us tremendously.
And so, final thoughts, I guess closing thoughts here. So just like every other project, we're not even close to being done yet. Again, this is just a journey towards progressive delivery. We learn more and more every day about how to better ourselves, how to deliver the IBM Kubernetes Service more efficiently, quicker, and more reliably.
So a few things that we have looked into that we definitely want to do: for one, how can we give users access to the feature flags themselves? So today, that's a very manual process. If a customer wants access to some feature, they generally have to come to us directly and say, "Hey, can you give me access to this?" At which point, we will go in LaunchDarkly, update that feature, and add that user to that particular capability or new feature. There's a lot of other products and services that have the notion of labs, where folks can go and say, "Oh, I'd really like to try this new feature," and allow the users themselves to opt into these new features and to try them out and to give us some feedback as well. And so this is one of the next steps that we are looking at doing.
And one other piece is that for the most part, most of our services are, I guess, feature-flag-enabled or we're delivering it very progressively. But we still have a few pieces that aren't really in the mix yet. For example, one of those is our documentation. So today, if you go to cloud.ibm.com, you may find some new features that you have access to, but you may not see the documentation for that, or vice versa. This may be even worse: you may see documentation for a particular feature, but you may not see the capabilities in your experience.
So at least my nirvana is that we want to get to the point where we literally have a flag which controls everything. Everything would include access to the UI, access to the API, access to the CLI, and even access to the documentation as well. So when you're visiting cloud.ibm.com, whether you're just browsing the documentation or actually using the tools and capabilities of the platform, it'll be very seamless across the board.
Having said all that, I very much appreciate you listening to my talk here. If you are interested in what we're doing or for more information on what we're doing, I'll provide a link to our open source project. One of the things I failed to note, and I'm going to put my shameless plug in here, is that we've actually open sourced our delivery process and our delivery model, and we call it Razee, and you can go to razee.io and find out information for that.
If you have any more questions about how we build or how we operate, I'm always happy to talk about this. Just reach out to me on Twitter or contact me directly via email at mckaymic@us.ibm.com. Again, thank you very much. Thank you for attending, and enjoy the rest of your day. Thank you.