Migrating a Monolith to Kubernetes

Log in to watch

San Francisco 2017

Migrating a Monolith to Kubernetes

Principal Site Reliability Engineer · GitHub

Last year, a small team at GitHub set out to migrate a large portion of the application that serves GitHub.com to Kubernetes. This application fits the classic definition of a monolith: a large codebase, developed over many years, containing contributions from hundreds of engineers (many of which have moved on to other things).

In this presentation, we'll cover our motivations for this migration, the factors that led us to choose Kubernetes, the strategies we used to empower a small team to make a change that affected a large engineering organization, and reflect on what we learned in the process.

Chapters

Full transcript

The complete talk, organized by section.

Jesse Newland

All right. Hey, everyone.

I'm Jesse. A little bit about me to get started here today: I'm Jesse, and I go by `jnewland` on the internet. Over the past 16 years, I've primarily been working in the field of keeping websites online. For the past six years, I've been focusing on github.com, the website.

Over those six years, I've swung back and forth on the engineering/management pendulum a bit, working as an operations engineer early on, becoming an informal tech lead near the end of GitHub's no-managers phase, then briefly taking a role in management.

In that same period of time, I've moved around a bit, leaving my home office in a 100-year-old Victorian in Savannah, Georgia, for an industrial loft in Potrero Hill and a desk at an office in SoMa.

Recently, I've compromised on both of these fronts, moving into a role focused on technical leadership and moving to Austin, Texas, the land of 1,000 tacos. I'm extremely pleased that GitHub has afforded me the opportunity to evolve my role over time as both I and the company have grown and changed, as well as the opportunity to create my own ideal working environment via their embrace of distributed teams. It's pretty great.

Enough about me. Why am I here today? What do Kubernetes, monoliths, DevOps, and enterprise all have in common?

Like many of you here, my job is to affect change in a technical organization. It might be from a different perspective, but I think it's more similar than you might think. Like many of your organizations, GitHub is growing, maturing, and evolving. As a result, we often find that our solutions don't scale to fit the needs of our growing organization.

Like many of you, we're on a journey of continuous improvement. I fundamentally think that, on a universal level, we're more alike, my friends, than we are unalike, and we're all one big human family with so much to learn from each other. That's why I'm here at this conference today and this week: to learn from all of your experiences and to share one of mine.

We reached a milestone on this journey just recently, and that milestone was the completion of a project to migrate the application that powers all HTTP requests to github.com and api.github.com from a legacy architecture, where application processes ran on long-lived physical hosts, to a Kubernetes-based architecture, where requests were served from application processes running in short-lived containers.

We've been really happy with the results of this migration and are very much enjoying our experience with Kubernetes thus far.

To zoom out a bit just for context, Kubernetes is an open source system for automating deployment, scaling, and management of containerized applications. I'm not going to read this because I copied and pasted it from Kubernetes' website, but it is very much built on years and years of experience from Google and years of experience from the community, and has done an incredible job of encapsulating a bunch of industry best practices in a bit of software.

That all being said, I'm not here to tell you that you should adopt Kubernetes or any technology specifically, and I'm not even really here to go too deep into the technical details of our migration. This blog post on our engineering blog has a ton of details there. If you'd like to go deeper into any of the technical details that I mention in this talk, feel free. Please do hit me up on Twitter or find me at the GitHub booth in the expo hall later if you have any questions I didn't cover today.

But to be extremely reductive, Kubernetes is just a technology, right? Kubernetes is a super-dope technology, to use the Hightower Rating Scale. But it's not a panacea. No tool will fix your or my broken culture, right? Tools may help, but they're one part of the puzzle.

Please take everything I say here today and everything anyone else says at this conference with a grain of salt, and make sure to make a decision that's right for you about what technology and process to adopt. But in the interest of learning and sharing, I'd like to share an anecdote from our ongoing journey and give you a little perspective on how we went about this process.

Today, I'd like to talk about the motivations for this work, how we approached the project, where we are today, what we've learned, and where we're headed next.

To start, why? Why did we choose to migrate a large monolithic application to a container orchestration framework?

As is the case with any decision, context is crucial. Please allow me to set the stage a bit.

The monolith: in this talk, when I refer to the monolith, I'm speaking of the Ruby on Rails application that lives at github.com/github/github, which is kind of a mouthful. Inside our organization, it's been colloquially referred to as github.com, the website, which I find kind of endearing.

It's a 10-year-old Ruby on Rails application, and if you've worked with Ruby on Rails and worked with applications of that lifetime, it's exactly as you might expect in a lot of regards. It's contributions from hundreds of engineers, each individually working on thousands of product deliverables as a part of that.

The Ruby on Rails framework and the approach we took to development, putting most of our application logic in this one repository, provided huge payoffs over the vast majority of the lifespan of this app in terms of the velocity that it's afforded us. It's really easy to get going and really easy to add functionality to this existing application.

But as time has gone on, the app has become increasingly complex, and it's become harder and harder to understand who might be accountable for any piece of functionality in the application itself.

That covered some context of the application. I'm going to shift a little bit to the production environment in which it's running.

This application has traditionally been deployed on servers running in GitHub's "Metal Cloud," which provides our engineers with an API for provisioning physical servers in our data centers. It feels a lot like the VM provisioning APIs you get from cloud providers today.

This means that our monolithic Rails application and many of its supporting services, like databases and other things, have all been running directly on very high-powered hardware: tons of RAM, tons of cores, high clock speeds, SSDs everywhere. We've over-provisioned, and that helped significantly during certain parts of our growth.

This hardware is also incredibly reliable, which is convenient, but has often unfortunately led to software being designed in an optimistic manner, if you will.

The private cloud also has incredibly low-latency networking and high-throughput networking, which was crucial to us early on as we were the frequent target of denial-of-service attacks. It looks pretty cool, too. All told, it's a pretty great place to run software, but there are some drawbacks.

The primary unit of compute in our Metal Cloud's model is an instance, or one server. An instance setup has been historically tightly coupled with our configuration management system, meaning that the process of provisioning an instance usually required provisioning the instance and then running Puppet on it for the transaction to be complete, if you will.

The result of that tight coupling was that while our provisioning process was API-driven and, as a result, pretty testable, any tests of that kind of provisioning and then configuration management run presented a pretty brutal feedback loop to anyone working on a system where they hope to assemble multiple computers together to provide a service to people, which is what we do.

As a result, many systems and really most instances have been provisioned by humans performing an API call via a ChatOps command, and most load-balancing configuration either needs to be manually updated or an update triggered after an instance has been created or destroyed.

As a result, many of our systems are human-built. And as a result, a high level of effort is required to get a new service into production.

All of those things being said, other context that's important is that GitHub has been rapidly changing and growing over the past several years. I think it's fair to say that over the past few years, the steady state of GitHub has been one of change, and change in many forms.

Our customer base is growing. There are a lot of development shops out there that don't use GitHub, and new software businesses are being formed each day, right?

Our customers are growing. Our existing customers use GitHub more and more as their businesses grow and expand and as they hire engineers.

Our ecosystem is growing. Our users are building new and exciting ways to interact with code on GitHub and are creating thriving businesses built on top of our APIs.

And our organization is growing. We're hiring engineers, forming new teams, and adopting new techniques and practices around the company.

On top of all of that, we're shipping new products and improving existing products. Pull requests are not in their final form, I'll have you know.

Along with all of that, our customers are very reasonably expecting increasing speed and reliability. Every new feature that we launch, every new product that we launch, is held to the same standard of GitHub, which you all hold to a very high standard because it's a crucial part of getting your job done every day.

At various times over the past few years, we've seen indicators that our current approach was struggling to deal with all of these forces. Around the time that we started this work, a build-and-run approach had been successful in a few small groups, but it hadn't yet become commonplace around the org.

SRE's tools and practices for running services had not yet evolved to match what our engineering organization demanded and wanted. Our tools provided an interface for engineers to build and run their services, but with more care and feeding and handholding than either party desired.

As a result, engineers found that adding new functionality to an existing service was easier than setting up a new service that encapsulated their new functionality. I'm sure this sounds familiar to some of you.

Unsurprisingly, the monolith kept growing. Along with that growth came slower tests. And along with those slower tests came increasing deployment duration and slower deploys.

The human-managed nature of most of our compute instances and systems resulted in an infrastructure that didn't change that much from day to day, which meant that regular growth and occasional changes in traffic trends required manual work when an auto-scaling group might have been able to solve the problem instead.

That static infrastructure was incredibly inefficient. It was often provisioned to handle peak load when there are peaks and valleys throughout the day, of course. Our non-Kubernetes fleet is underutilized in a way that makes me sad to this day. I'm working to address that.

On top of all of that, a few efforts to leverage some of the more advanced functionality available from cloud providers ran aground after we realized how tightly coupled some of our applications were to their environment, to the high-performance compute, the high-performance networking, and to our specific APIs that we had built inside the company.

As a result of all of these forces pushing together and combined, we noticed some downward trends, both in the GitHub user and the GitHub developer experiences. Needless to say, we were not happy with these and aimed to move mountains to change that.

Fortuitously, late last year, the planets aligned in a way that made all of these problems visible at once, both to us and also to some folks in leadership. Oddly enough, this was as a result of a Hack Week proposed by a new engineering lead in the organization.

Considering all of the context that I've just given you about the deployment and production environment at the time, let's think about how a Hack Week might go. What would an engineer do if they're asked to ship a new and innovative feature in a week?

Would they use 20% of that week to learn Puppet if they hadn't been familiar with it already, work on a Puppet manifest, provision their instance, configure a load balancer? Or might they reach out to us on Thursday and ask nicely for help?

Or they'll do what most of the engineers actually did during this first Hack Week: they chose option three and built their Hack Week feature as a PR against the monolith.

After this Hack Week, we reflected a little bit and realized that what we'd observed was a microcosm of the larger problems with our approach to running software in production.

We realized that our incentives weren't aligned with the outcomes we desired. Engineers are motivated to ship new features, improvements to existing features, and bug fixes. And we had given them an on-ramp into production to do that. But that on-ramp went in the exact wrong direction. It went towards a future that none of us desired.

The path we wanted engineers to take to get something into production, the build-and-run model with the tools that we had at the time, was so high-effort that it may as well have been a wall. We knew at that point that we had reached a fork in the road and that some decisions had to be made.

Around this time, between the first and second upcoming Hack Week, we started talking about this more seriously, and we decided we needed to make an investment in our tools, in our processes, and our technology.

We decided that to support the other ongoing changes in the organization, we would work to level the playing field. To support the decomposition of the monolith, we decided that we would work to provide a better experience for new services. And to enable SRE to spend more time on interesting services, we decided to work to reduce the amount of time we needed to spend on boring services.

At that stage, we had to help with most all services that got into production. They needed a little bit of assistance or product from our org. To reduce the time we spent on boring services, we decided to set a goal to make the service provisioning process entirely self-service.

And to bring the infrastructure-building feedback loop down from the really soul-crushing feedback loop required to test any VM-based infrastructure, we decided to base this new future on a container orchestration platform.

And to build on the experience of Google and the strength of the community we observed with Kubernetes, we decided to build this new approach on top of Kubernetes using its platform.

If this were a movie, we'd cut from this last scene, a boring meeting where we made a bunch of decisions, into a montage of Red Bulls, close-ups of keyboards, stand-ups, high fives, and then we'd fast-forward to the last 10% of the project or something. But this isn't a movie. It's an experience report.

So if you'll indulge me, I'd like to take you through a little bit more of this journey and reflect on the approach we took to this cross-team project.

Upon reflection, this project went through a handful of discrete phases. We formed what we have been kind of colloquially calling passion teams. I'll explain that more in a second. We built a prototype. We picked a big target. We put together a product vision and a product plan. We did the work. Everything else started with a P, sorry. And then we paused and regrouped a bit.

Our first step in this project was to assemble an ad hoc team of engineers from a few different parts of the org. We've written about this approach a little bit previously, and in fact, someone from GitHub spoke about it at this event last year, but we found it incredibly useful when approaching projects like this.

The formation of an ad hoc team that consists of members from several parts of the org is an intentional effort to curate a diverse set of skills, experience, knowledge, and perspectives that a small team that works together all the time might not have on their own.

For this project, we chose members from SRE, our developer experience team, and our platform engineering team to ensure we had a nice set of perspectives around the company.

We informally decided that this team would exist for the duration of the project, and we created the great trifecta of project resources that we use frequently: a GitHub team, a GitHub repo, and a Slack channel.

From there, the first thing on this new team's roadmap was a prototype, or a strategy for not crying under the bed during the next Hack Week.

To keep ourselves focused, we chose a few goals for this prototype. By the next Hack Week, we'd provide engineers with everything they needed to deploy a new prototype onto a Kubernetes cluster and view it in their browser. We'd give them a bare-bones Kubernetes cluster, some integration with our load-balancing tech, a deployment strategy or enhancements to our internal deployment tool to support Kubernetes, and some basic docs and a handbook to get started.

One of our goals was to intentionally leverage the loose standard of quality that comes with Hack Weeks. We built a prototype for other people's prototypes, and that was a great standard to have.

We also wanted to use this experience to validate our perceptions about the level of effort required to adopt a container orchestration framework for some workloads. We had played around with this technology a little bit before, really casually, and felt like we could integrate it and solve some problems that we had today with not too much effort.

We also wanted to validate a hypothesis that if we presented another option for getting things into production, getting software into production, that our engineers would flock to it. Given the situation at the time, we expected the level of interest to be pretty high.

We'd also use this experience to learn more about Kubernetes itself and to seek feedback from engineers that used this new approach. If engineers took this path, we'd use their experience to help shape the next iteration of this work.

And we also wanted to make sure that this project got some attention. Building this as a part of Hack Week, where people would be demoing the results, seemed like a great way to leverage it for marketing.

I'm happy to report that this prototype, this part of the project, was a wild success. We built out the functionality and documentation we needed with a little bit of time to spare. We had a handful of projects launched during Hack Week with almost no SRE involvement.

We received tons of positive feedback from the engineers that used this platform, even in its infancy. We learned a lot about the rough edges that existed and parts that we thought were rough that some of our application engineers didn't really mind at all.

In fact, several of the projects built during this Hack Week and launched on this cluster live on today in different iterations of this compute platform. I'm pretty happy that our prototype has been functionally preserved in terms of functionality, and people are able to use the thing that they've deployed and have evolved it over time to run in our new environment today.

With this successful prototype under our belt, we paused briefly to determine our next step. As I had mentioned in the title of this, I guess, we decided to migrate the monolith. We decided to migrate github.com, the website itself.

Of course, this is a decision that we considered really heavily. There are a handful of significant pros and cons, and some of the things in favor of this.

Our experience with Hack Week showed already that small applications and prototypes were a great fit for Kubernetes, but we didn't want to commit to this as a tech to run most of our infrastructure if we couldn't prove it to work for a high-throughput and complex application. If we were able to make Kubernetes work for this environment, we knew that we'd inspire confidence in others, both inside the organization and externally.

Another reason in favor of choosing this app to migrate is that the GitHub GitHub application, while pretty complex, as I've mentioned, is well known to many engineers at GitHub. A few of the engineers working on this project had been working on this application for some time and had considerable experience with some of the rough edges of it. We knew that we could leverage their experience to smooth out any bumps we ran into during the process.

We'd also previously made significant changes of this sort to the monolith before and had some pretty well-worn testing strategies available to us. We knew we'd be able to route traffic from staff specifically to a new platform, which has been useful for us in testing previously, and then small percentages of public traffic to gain confidence along the way.

Around the same time that this all started happening, we also had an overlapping need for a little bit more velocity in another area. We had previously had these three statically configured lab environments that some people may call staging, some call smoke-test environments. These are environments where developers could easily test their changes before deploying them to any customer traffic at all, even a small percentage.

These three statically configured environments soon became rather busy. During East Coast or West Coast business hours, they were frequently fully booked and often drifted into failure states as a result of the feature being tested. I'm sure this sounds familiar to some of you as well.

We also had a need to provide engineers with the ability to respond to changes in traffic and other capacity events, and realized that Kubernetes would give us basically exactly the knob we were looking for for this one type of problem we were aiming to solve.

While there were a lot of factors influencing us to make the decision to migrate this monolith, there were a few negative trade-offs to consider as well.

The first one is pretty straightforward: it might not work. A significant trade-off we considered was that by choosing this highly visible application, we ran a real risk of betraying internal trust in the platform and external confidence in our system at large if something went wrong.

The wealth of testing strategies we had kind of counterbalanced this trade-off a bit. They gave us the opportunity to fail small before we failed big.

Another trade-off we considered is that the migration to Kubernetes might, in fact, make some of the problems that we face today worse. Development could get harder. The site could become slower or less reliable.

Considering all of those factors, we made a choice to put together a project plan outlining all these trade-offs and to seek feedback from others around the world, see what they thought. We took a brief detour to articulate our vision for the future and our plan to turn that vision into reality.

In the scope of larger organizational transformations that often happen when people are hoping to adopt DevOps practices, projects like this provide an opportunity. They're high-impact and extremely visible to others around the organization.

As a result, we knew that communication was crucial. We didn't want to waste this opportunity to impact how people developed and ran software at GitHub. We wanted to communicate very strongly internally why we were making this choice and what we were aiming to get as a result.

Communicating change is really hard, and I imagine it's a huge part of what you all think about as a part of your DevOps transformations. Over the years, myself and a few others at GitHub have observed a few characteristics of well-communicated changes in our environment, and I want to run through a couple of those with you today.

One of the big things we've observed is that projects that are proposed in a tech-first manner, "Migrate to Kubernetes," "Adopt a container orchestration platform," or something of that nature, often end up failing because they aren't focused on a particular goal or outcome, right? They're often implicitly focusing on some side effect of adopting that technology that's not well stated or well communicated.

Determining what our goal is and stating it strongly in all project communication was something that was important, and we did throughout the project.

We've also casually observed that direct, to-the-point communication is associated with project success at GitHub. Meandering around the point hasn't served us well or served projects well in the past at GitHub. But even while being direct, it's important to know that humans are at the other end of any communication.

We found it's pretty important to write in a manner that you might use when speaking with a friend, not in a manner that you might use when submitting a journal publication internally.

Another element here: in project kickoff communication, we found it to be super helpful to be upfront about the alternatives you've considered when adopting a certain, or going down any path at all. It helps others understand why one direction may have been chosen over the others.

It's also important to remember that in the scope of any project plan, doing nothing is an alternative. Doing nothing often comes with serious negative trade-offs, right? But it's worth mentioning those to help others better understand your motivation. Sometimes when framing your motivation in that form, "What happens if we do nothing?" it helps people understand and buy into your decision.

In the spirit of heading off questions by providing info up front, we've also found that including some information about how this change will make it into prod helps reduce others' perceptions of risk significantly.

Once you're confident in writing all this down, one of the key elements for us has been giving this project plan a URL, making it available to as many of your coworkers as possible. For us at GitHub, that means opening a pull request, adding the project plan to the project repo, which is generally open to our entire technical organization. In your case, it might be a wiki, or that might be a proposal in some other format.

This last point is really crucial and I think was very important to me and others as a part of this effort. Simply thinking about the way you'd like to change something isn't enough, nor is telling someone in chat or in person at a conference, or even speaking about it at a conference, or opening a thoughtful pull request like I've just described.

If you want to change the opinions and actions of others in an org, active communication and repetition is key. For us, that meant repeating this message at engineering all-hands and one-on-ones, in our internal social network thing that's kind of like Yammer, holding office hours, and curating a frequently asked questions page.

All that being said, the communication finally had the desired impact we wanted. We had some vocal support from execs, additional engineering resources, project management resources, which were definitely appreciated.

At this point, as one of the other engineers on the project said to me pretty frequently, "All we have to do now is not be wrong." Seems okay.

So how'd it go?

It went pretty dang well. I'm here speaking about it. I wouldn't be speaking up here if it didn't go decently well.

As a result, the application that powers web requests to GitHub and API at github.com is now running in containers on one of a handful of Kubernetes clusters.

One of the key decisions we made early on was not to split out too many things from the monolith as a part of this migration. Given that this application had been developed over many years, it had a few dependencies, if you will, and it resulted in a pretty big container.

A lot of people internally were unsure about this and had read lots of best practices that encouraged them to optimize their container size and make small containers, and I think that's totally reasonable. But luckily, most of the changes to our application only really touched one layer, which resulted in a median build time of around 100 seconds. That's not exactly what I'd like. I'd like it to be a little bit better. But other than that, we haven't observed any other real negative trade-offs of that big image. So I want to let you know that that's a thing that's okay to do.

Once we had this container built, we built a really bare-bones Kubernetes cluster and used it to build the dynamic staging environments I mentioned earlier.

If I were to go to GitHub, the source code, right now, change the style of the footer to be red or something, and push that up to GitHub, as soon as all tests pass, I'd be able to deploy that to a thing called Review Lab. That meant that myself and others around GitHub could go to a URL and see that version of the site, see my changes compared against production data. They're able to do that in just over a minute after their tests pass, which has turned out to be pretty great.

This is a big highlight of the project in terms of engineering velocity. Engineers gave us great feedback about this. They really enjoyed working with it, and it did wonders in terms of internally marketing this effort.

These Review Labs have been seeing steady and increasing usage. Today, or in the recent weeks, on an average working day, engineers run about 50 changes, test about 50 changes, using these environments per day.

Alongside the work to spin up experimentation environments, we'd also been working on a larger deployment of GitHub on Kubernetes. To gain confidence in that environment, we built a little switcher in our staff-only toolbar that let other GitHubbers opt in to our Kubernetes-backed environment.

This was really great and helped us gather crucial data about performance and error rates and find some edge cases that we hadn't identified in testing previously. It helped us build tons of confidence in this new system.

Once we had gained enough confidence there, we switched this on by default and made it opt out. This was a really big effort. Again, the blue logo in the header helped with internal marketing as well. People really liked that. And it helped us fix a lot of bugs that we hadn't caught originally.

Following this period of testing, we started running some controlled experiments where we would route small percentages of traffic, sometimes even to a specific URL inside the app, to the new architecture to gather more data about performance, error rates, performance and error rates relative to the previous version, and a lot of additional data.

That was not the button I meant to press.

Cool. All right. Wow.

From there, these experiments continued to grow in size as our confidence increased. I guess in mid-July, we got to 100% of requests served from Kubernetes.

As of today, all of the requests that you make from your browser or from your API client to GitHub are going through a Kubernetes cluster. It's not running all of the workloads of the application. Much of our data storage and other supporting services are running in our traditional architecture. But the application processes that serve the website itself and talk to those supporting services are running in that cluster.

Along the way, through this effort, we made sure to keep our goal of supporting the eventual decomposition of the monolith in mind. That meant that all the functionality we built to support the deployment of the monolith was available to other services.

That means that every service that migrated to Kubernetes internally at GitHub would be able to use a canary deployment to test their changes in a small scale before rolling out to production, the larger percentage of traffic. Every service would be able to have a maintenance mode. All these little bits of functionality that we needed to build to support how we were deploying our most crucial application were extracted and were made available to the smallest application inside GitHub.

This has done a ton to sweeten the pot, if you will. So much so that engineers are presently starting to default to Kubernetes when building a new service. We've been successful on that front. As a result, at last count, 20% of all of our services are running on Kubernetes clusters. We're tracking this number as a KPI for the SRE organization and aim to push it as close to 100% as we can get over time.

I'd like to quickly summarize some of the things that we learned as a result of this change and the outcomes that we've observed.

First, the positive outcomes. Let's start on a high note.

Setting up services now is considerably easier than it was before as a result of this effort. We're extremely excited about this outcome and how we hope it will shape our software over time.

Since the required effort is so low, new services are regularly deployed with little to no SRE involvement. That's the idea, right? We'd like people to get things into production on their own time and run it themselves. In situations where SRE is involved, it's often just as a casual approval on a pull request to give some confidence to the engineers that are working on the project.

As a result of adopting Kubernetes, we now have access to its really incredible suite of APIs, which allow service maintainers and cluster maintainers to inspect the running state of their application and cluster in a way that was really hard for us before. We'd spent a bunch of engineering effort writing APIs similar to the ones that Kubernetes provides us, but these came for free as a result of this adoption.

In addition, these APIs can be used to mutate the state of the system, of course. Applications can make an API call to increase their own capacity or to clean up a stale lab environment, for example. This is a really big win.

Combined, these APIs present us with a cloud-native platform to build against. In my opinion, Kubernetes is a really interesting and exciting cloud-native platform. It may be a little bit different than how you might think about building against, say, a cloud provider's infrastructure API. But Kubernetes is a cloud-native platform, if you will, that's built in the open, and it's really emerging as a standard in the community for application runtime.

This open cloud-native platform has allowed us to reduce our architectural lock-in in ways that we didn't really even fully understand when we were starting out on this project. By modifying our monolith to run on Kubernetes and by adopting Kubernetes as our standard compute platform, we've made it easier to run our applications on any other cloud provider that offers a Kubernetes-based product.

We're super excited about this future, and I think this is the greatest trick Google has ever pulled. It's a nice marketing move.

Another benefit that we've observed is that the Kubernetes platform feels more open-source-friendly than some of the other technologies we've previously used to build systems.

The burden has been previously pretty high when considering how to open source infrastructure automation. Previously, a lot of the automation for our systems involved building against a Debian operating system, our Puppet infrastructure, our Metal Cloud APIs, and it wasn't very exciting to share.

Even if you were building against public cloud APIs, the matrix of OS versions and package derivatives and configuration management providers that you might have to choose to make your open source infrastructure automation palatable to anyone else in the room seems pretty daunting.

The problem space to consider when building systems on Kubernetes is much smaller, in my opinion, and many of those details are abstracted away in container images that, well, contain all that chaos in a single unit.

That's led to the creation of some awesome tools like Helm, a package manager for Kubernetes that lets engineers build templates to support repeated and testable installation of systems, not just software systems, on any Kubernetes cluster that conforms to a set of requirements.

As a result, the community is flush with charts, which Helm calls packages, that you can use to install commonly used systems on your Kubernetes cluster today. These aren't ordinary system packages that install a daemon and a systemd unit file or anything like that. They'll set up several MongoDB databases and turn them into a replica set if you'd like, for example.

I'm super interested and excited to see how open source systems building evolves over the next few years. We're in for exciting times here, and I'm really excited to change how our organization is working to be a part of that.

Now that we've talked about some of the outcomes we saw from this work, I'd love to reflect on some of the things that didn't go smoothly.

Challenges that affected SRE primarily: as is the case when you adopt any software, getting comfortable with it takes time, right? Kubernetes was kind of challenging in this regard due to its complexity. This required a bit more time and effort than we initially expected.

Early in the phases of this project, we were slightly sidetracked by some Docker instability issues. Nodes would occasionally panic during periods of high rates of container churn, or in certain situations, the Docker daemon would just hang, where all operations would time out.

We've been able to work around this with a DaemonSet that runs on our cluster that identifies, drains, and reboots affected nodes, but we're excited to explore other OCI-compliant runtimes soon.

Another significant challenge we continue to face is the change that we're asking application engineers to make, both in terms of the way that they get something into production and how it works once it's there. We'd spent a considerable amount of time on communication to head this off, but anytime an engineer asks for a server, maybe two, when bringing up a new system, we know that we still have more work to do here.

From the perspective of application engineers, the biggest challenge has been change. Again, in the past year, we've asked them to take a completely different approach to getting software into production, which has required that all of them spend time and effort to incorporate that into their plans. It hasn't been much change, but it's been some.

Kubernetes has changed the language with which we speak about systems in production, introducing terms like pods and such, and that comes with a bit of a learning curve. Along with that, there's a new way of describing production systems, be it YAML, JSON, Jsonnet, or whatever, and a new way of deploying them. All things we've had to help engineers learn along the way.

A technical challenge that application engineers, I've observed them facing, is the challenge of dealing with shorter process lifetimes. Our previous deployment approach consisted of running systems on this unfortunately reliable hardware, if you will. As a result, an engineer could expect a process to live for a long time and be co-located with a temp directory.

That's no longer the case, which forces engineers to think about things like what happens on process shutdown and what happens if that process shutdown is ungraceful. These are good challenges to have, as it's forcing us to build better software as a result, but they're challenges just the same.

After considering all of these outcomes and applying a healthy amount of hindsight bias, there are a few elements of this project I'd definitely repeat next time.

I really enjoyed the wide range of perspectives that we had in the problem space early on. Including folks from all around the organization really helped us communicate effectively as a result as well.

At certain points of this project, I felt like we were belaboring our communication and spending maybe too much time on it. But after reflecting a bit, I feel like the strong communication that we prioritized was a key to achieving the wide-reaching impact we desired. I'd definitely do that again.

I also enjoyed picking a big target for transformative work like this. Choosing to migrate the monolith amplified the impact of the project, and it increased the relevancy of the work in the eyes of others around the org. We weren't just migrating any application to this new environment. We were migrating one that they worked on, so folks paid attention.

This almost goes without saying, but as a way of balancing the impact of selecting a critical target for this project, this gradual rollout strategy that we took was incredibly effective, and I'd definitely do that again.

As is the case with any project, there are a few things I would do differently, though.

One thing I'd do differently is more consciously consider how a passion team like this winds down. Where do maintenance responsibilities go? How do we combine roadmaps of that team, what's left, with the roadmaps of other teams where the members go back? These are all good questions, and we're working through a lot of them right now, in fact.

In that vein, I'd like to have had an internal blueprint to work from for a cross-team project like this, so it felt more regular and less novel. We were definitely working to establish some of these practices as we went along. I hope to turn this talk into something like that internally to encourage more things like this.

And as I was talking about earlier, I wish that we had, I guess, more fully understood how the approach to building systems that Kubernetes provides was so friendly and so conducive to open source, that we have several projects that I feel we could have developed in an open-source-first manner that are now just waiting for time to be allocated for them to be converted to open source or open sourced as an effort.

I hope that in the future, we'll start defaulting some of these projects to be open source first, and we'll allocate the time to open source the ones that exist today.

Following this work, what's next on GitHub's DevOps journey?

It's really more of the same, to be honest. Like always, we're going to continue to actively seek feedback from all the stakeholders of our runtime environments. We're going to ask engineers if it gives them what they need to run things in production. We'll ask SREs about the experience of running the compute platform, and we'll seek feedback from leadership to see if they're seeing the outcomes that they'd like for our engineering org as a result.

We'll continue to relentlessly focus on automating work that scales with traffic or organizational size, because we'd rather be solving interesting problems than boring ones, frankly.

And we'll continue to build services that leverage this platform and encourage application engineers to do the same when it makes sense.

Using this common platform, we're going to continue to focus SRE's efforts on building improvements that benefit all services we run, not just the one that we're working on at the time. And using the platform's open source nature, we hope to open source more of what we build so that our efforts have an impact on the community beyond GitHub.

But most of all, like all of you, we're committed to this journey of continuous improvement. I feel incredibly humbled to be a part of a community that's so focused on improvement and on the common good, and I'm really humbled to have had the opportunity to speak with all of you today.

Thank you so much for your time and attention. As I mentioned earlier, I'm `jnewland` on Twitter. If you have any questions about our migration, our use of Kubernetes, you want to talk about computers, tacos, whatever, please hit me up, and I'll be at the GitHub booth following this if you'd like to talk there instead.

Thank you.