The Nth Region Project: An Open Retrospective
For the past year, a small team of engineers and I have had one job: allow New Relic to run an independent European region for data sovereignty reasons. That means taking around 500 services written by around 50 teams that have historically been assumed to run in just one deployment and changing them to work anywhere. And, at the end of the process, we needed to be able to spin up new regions quickly and sustainably operate them with our existing staff.
The talk will be in two parts, because a project like this isn't purely technical or organizational. We needed to choose technical changes that turned building out a new region from a many-month-long process for all teams into a project for one small team. We decided that the key was to move all services to run in containers, and have them all do service discovery via dependency injection. The reality of working at a medium-sized organization meant we had to have a lot of coordination and buy-in. I'll talk about how our roadmapping process both hindered and enabled this project to work at all, and how we used test buildouts and teardowns to integrate early and often.
This wouldn't be an open retrospective without talking about what didn't work well, which was primarily organizational rather than technical. We've learned some lessons on how to run large-scale projects that will hopefully help us on our next one, so I hope that we can provide some hard-earned lessons.
Andrew has worked on a wide range of projects, including the NRDB distributed event database, charting, the autocompleting NRQL query editor, bare metal hardware provisioning, and supporting multiple regions. He lives in Pittsburgh, Pennsylvania, USA, where he also sings classically in the Mendelssohn Choir of Pittsburgh.
Chapters
Full transcript
The complete talk, organized by section.
Andrew Bloomgarden
In this talk, I'm presenting an open retrospective, talking both about the technical and organizational aspects of the project and what we learned along the way. So think of this as a case study in the way one medium-sized company deals with the challenges of operating at scale.
Most of you have probably heard of New Relic, and some of you are probably customers of ours, but let me introduce you to what we do from the perspective of an engineer at the company. New Relic is an observability company that builds software that our customers use to analyze how their software actually works in the wild. That means that our customers send us a ton of data that we process and store, and then we present it back to them in the form of alerts, curated UIs, and ad hoc queries, all for a variety of different products: Browser, Mobile, APM, and a few more.
Last year, as part of our partnership with IBM, we agreed to build out a new region in Europe in IBM's data centers in Frankfurt in order to be able to let European customers keep their data in the EU. And note that this isn't the same as some kinds of multi-region projects where it's just purely for redundancy. We needed to actually keep the data in a certain place. And this was a pretty scary proposition to us in engineering, so despite our fairly well-run organization, let me explain why.
When I started at New Relic eight years ago, we had just one application. It was the UI and the data collection tier all in one, a true Rails monolith. The company was two years old at the time, hadn't really had the time to build out a whole ton of technical debt. Shortly after I started, one of our engineers split out the collection tier into the aptly named Collector. It was written in Java, and it was a couple orders of magnitude faster and more efficient than the initial Rails implementation. If we had stopped here and said, "Great, we're going to build out a European region," technically and organizationally, we would have been totally fine. This wouldn't have been a huge project. It might not have been the right decision for the business, and given that we didn't do it, it probably wasn't. But changing a couple code bases, two-year-old company, not a huge deal.
But we didn't do that. Instead, in eight years, we've done a lot of things. We've scaled a lot, introduced new product features, and to handle that growth, we've had to continuously re-architect our software. Today, we're handling around 30 gigabits per second of data inbound, 15 million Kafka messages a second, writing around 600 million. I think this is now up to around a billion events per minute, actually. Slides are a bit out of date. We have around 50 engineering teams and hundreds of engineers working on that.
Along the way, one transition we made is that there's no way that this can work with a central operations team. All of our teams are on call for their services. And this is what our architecture looks like today. Like most service-oriented architectures, this reflects both technical requirements, like actual products doing actually different things, and organizational structure, like this team worked on that, and this other one worked on that thing. Architecture is at the point where a single easily understandable diagram can't faithfully represent it in all its detail, and there's no way that our original one- or two-app architecture would have scaled this far.
So we made the right choice, but we never really considered that we'd ever have to run in more than one region. That was always just a potential future that never seemed to arrive, so when the business said, "It's time to build an EU region," we knew that this was going to be a very painful exercise. And when we confirmed that there was a really high chance that this wouldn't be the last region the business wanted to build, we said, "Okay, this is going to be a slightly different project. We're going to focus on building tools for building regions." Even though we know there's a lot of manual work that's going to go into this round, we want to make that an automated process the next time. The aspiration is that one small team can be in charge of clicking a few buttons and making some changes and supporting a whole new region. We called this Project Backpack as we mostly Americans were finally getting to go on a trip to Europe.
I'd said that we ran in one region, and we didn't really have experience building out multiple regions, but that's not technically true. Every year, we do a disaster recovery exercise where we prove to ourselves and to our customers that we can successfully rebuild the entire New Relic stack in a new environment. So we did actually have experience building out new regions, and we knew that it was just incredibly painful. We did prove that in the event, in a real emergency, we could drop everything and we could recover, but the exercise took a lot of effort across the entire engineering organization. So we knew when we started this project, that's what we'd have to look into. Why was that so painful?
In the eight years I've been at the company, we've had to solve a bunch of problems. This is something that's very familiar to organizations that have tried to adopt DevOps. We had to figure out how to deploy many services, support a polyglot environment, have some kind of service discovery system, some sane secret management, and eventually, more recently, having some better container orchestration. So we were just going to tack a new thing onto the list, and we were going to leverage all of our expertise in all of those earlier things, and we would be fine. That was at least the theory.
But the reality is a little different. Will Larson, who's worked at Digg, Uber, and now Stripe, wrote an article earlier this year about migrations. He said, "Migrations are the only mechanism to effectively manage technical debt as your company and code grows. And if you don't get effective at software and system migrations, you'll just end up languishing in technical debt."
It turned out that we just weren't really very good at large-scale infrastructure migrations. They just kind of seemed successful because we were so fast-growing. We would do a bunch of new things. They would use the new practices. Those would be great. The old thing didn't really come along for the ride, and it wasn't really a problem until it maybe blew up in production, and then we'd realize, oh, we have a problem here.
Now, a variety of vintages might be nice in a wine cellar, but it's not what you want for large-scale software systems. Our many vintages included, dating back to 2010, applications deployed via Capistrano and Puppet. In 2013, we realized that Docker was key to us scaling, having way more services than we had before. That was the very early days of Docker. It was kind of on fire a lot of the time, but it was crucial for us. But we wrote an internal tool to deploy that, Centurion, kind of worked. Had an in-house service discovery system that we abandoned after a couple of years, but it was still used in a bunch of production software. Vault, we started using in 2016, and in 2017, we started building out a new container orchestration platform based on Mesos and our own internal tooling on top of that.
The reality was that for disaster recovery, we just had to account for all of these different things. We had to copy-paste configuration, tweak it, find all the places where there are little edge cases in the code, and it got so bad that for our most recent disaster recovery exercise when we started the project, we just closed out an entire pull request for our Puppet environment because we couldn't trust ourselves to move forward with it. All of this is to say that if you have a DevOps environment like we did, you do it because you think you can move faster. But it's very easy to let yourself sort of mask over the problems that you even still do have, and then come a large project request like this, find yourself in the position where you're not actually able to execute on it.
That said, we had the actual requirement. We have to build this region, and we've got to make it so the next one's not going to be bad. So we had to look for what were the high-leverage interfaces, what were the things that we could implement now that maybe wouldn't check all the boxes, but would put ourselves in a better place for it the next time and make this one better as well.
First, we needed to solve service discovery. As I mentioned, we had an internal system that we didn't really use well, and we needed some kind of system. But another way we were thinking about this problem was in terms of static analysis. Excuse me. Let me get some water.
If you're building and deploying a new region, you're deploying all of the hundreds of services that you already have, in our case, and you don't really know everything about them. You know that these kinds of things all exist in production today, and therefore they're all necessary for production to work, but you don't really necessarily know the relationships between them or little subtleties about them. So if you're deploying Service Alice to your new production environment, it would sure be nice to know that it actually depends on Service Bob without deploying Service Alice and maybe a few layers depending on top of that before you realize that this bottom-layer thing was never deployed or was broken in some way and didn't actually work. With static analysis, with the ability to say, "I know for a fact that Alice depends on Bob," you can just deploy Bob first, deploy Alice, and you're in great shape.
Our original service discovery system just didn't have that property. It was buried in code. You could guess that something depended on something else, but you couldn't be confident. It could be buried many libraries deep, which wasn't really helpful when you were trying to do a from-scratch build-out. So we wanted to figure out a way to encode that information in our software so that if one service depended on another, we knew for a fact that that was the case.
We had a configuration system for our deployments that we usually passed hard-coded information in, but we introduced an abstraction layer. We could say, okay, we're normally passing in these hard-coded values. We're going to, say, ask for something that says where Bob is, and we can replace that for you. Then we can sniff out that information at deploy time to understand what was actually going on, that Alice depends on Bob.
This also let us solve a pretty related problem, which is how do you provision credentials? We had a solution for where you put credentials: Vault. It's a really useful tool. But how do you get credentials there in the first place? Again, we have hundreds of services, hundreds of databases. These dependencies are somewhat known, somewhat not. We needed a way to say, okay, I'm going to take this kind of information, I can encode it in a URL, and then I can get that same kind of information extraction, this abstraction layer, that lets me know that this service actually has an authenticated database on my DB. This is service discovery's dependency injection. Services declare the dependencies with a standard format. You can put credentials in there, and static analysis is actually possible.
Next, containers everywhere. This was key for us. They're a great interface, most importantly, between teams and machines. Originally, when we had a few number of services, we would have to go and make all sorts of configuration management changes on machines to deploy new services that had slightly different dependencies. This was killing us. So we knew since 2013 that containers in the form of Docker were crucial for us.
But we could push that forward. We can make it better for the new region, push that into our container fabric, into our Mesos platform. We could say, okay, we're setting ourselves up even better for the future. We're going to put stateful services there as well, so that even if we can't necessarily orchestrate them today, we do have some experience running some stateful services, like our Cassandra clusters in this. We're going to push more services into that. In fact, all of our services are going to be in containers, and eventually, we're going to orchestrate all of them, just not right now.
Next, one important thing is to standardize on a better operating system for a modern use case: CoreOS, not CentOS. CentOS was great for us when we managed via Puppet, but we had sort of aged past that. It wasn't really helping us today. CoreOS allowed us to encourage the behavior we wanted to see. Configuration is very limited. You don't have a package manager, and that means that anything you want to do, you want to do it in a container, which is the behavior we want to see. It has a first-boot configuration system that's really well matched to the configuration it does support, and that means we can stop using Puppet, which we just haven't managed very well. We were even able to use the Cloud Config Transpiler that they published to make assertions in our machine provisioning code that things were happening. So we can assert here that our New Relic infrastructure agent is installed on every machine in our clusters, just as part of our provisioning process.
Finally, we use Terraform because some infrastructure is just sort of fiddly customization. We needed a way to make that repeatable. We needed to know that, okay, yes, this S3 bucket is different from that one, but the next time we go and build this out, we're going to build them out in the same fiddly different way instead of accidentally making them the same and running into problems at runtime. Importantly, if you use Terraform, maybe you don't, maybe you do, you can develop your own providers, and it's relatively easy. We found that the investment in doing that just paid off repeatedly.
I've just spouted off about all the requirements that we had just tacked on to our initially simpler project of EU region. You may be thinking that this is kind of like a second-system level of project. You've just turned this smaller project into this huge, large project, which is a great way to have a project fail. And you're right, this was risky. But we did try to ameliorate some risk. We tried to choose things that weren't actually necessary. For example, we have our load balancing infrastructure via F5 hardware load balancers in our US region. We couldn't give those to IBM, but we could run virtual F5s in containers on CoreOS, and that let us say, "Cool, F5s are just the same in the US, not part of this project."
The reason that we did this was, okay, let's say we built out the European region, and we just didn't make the infrastructure investments we needed to, and just business as usual. We might have ended up with twice the ongoing operational load. That might have been killer, especially if you spend the same time to build out a new region, you do that a couple more times, you have burned months and months and months and months of time, and you now have five regions to support that are each eating the same amount of time. This is just lighting money on fire. We might drag the company down under endless operational toil or just be forced to turn down business opportunities by not building those regions in the first place. So we just had to strike the Goldilocks balance. What is the right work that we need to do to make it so that future regions were going to be better and so that we were set on the right path for success in the future?
That's all the technical stuff we wanted to do, and we knew that pretty early on in the project. We hoped that we would do that discovery, then we'd fan out all the work that we had to do because every team was going to have to do something to their own software. They'd all do that. We'd integrate it, build out the regions a few times, we'd test it, and then we'd release it to our customers.
That was the hope, but here's the reality. Now it's time for the retro. What actually happened? I wouldn't be standing up here if the project had gone perfectly because I honestly think that if this went perfectly, it's just a boring talk. Also more importantly, none of the things we did technically are all that unusual these days. This is the kind of buzzword bingo that you'll see at many conferences. So what's more interesting to me is that we made these changes at some real scale. I'm going to go over some lessons we learned along the way and sum each one up with a set of things that we'd start, stop, and continue in future large-scale projects, whether at an infrastructure layer or a product level. I hope that these are all lessons that can be relevant to you, even if you're working at a smaller or bigger company, as you consider large-scale projects of your own.
First, quick ramp-ups. How do you prioritize work? The Backpack project needed work done by every team at the company, which meant that we needed some way for all those teams to agree to do the work. Let's talk about how road mapping works at New Relic.
First, all of our engineering teams are autonomous. They have their own road maps set by the team and their product manager, and those road maps are supposed to meet the team's own goals, as well as to meet the broader requirements of the business. Those broader requirements come from a group called the Product Council, which quarterly meets, or more often if necessary, hopefully not, and publishes a list of up to five high-priority projects that are going on across the company in order of priority.
Teams are supposed to contribute to those projects in order if they can. Some of those might be something that only one team can effectively work on, like Team A needs to get this better. Cool. What that means for everybody else is just make sure that Team A can do everything they possibly can. Make sure they're not blocked. If they're blocked and you can unblock them, do it. That's the most important thing you can do. Then there are other projects like Backpack, which is just a high-priority cross-cutting project. And then we have other features that we want to make a press splash with, that kind of thing.
This process actually does work, but there's a catch. What happens if the project needs resources to move forward and the Product Council just hasn't prioritized it yet? That's what happened with Backpack. There was a month or so of delay past when the project should have been prioritized but wasn't, and then suddenly it was all systems go. We were prioritized and almost all teams at the company were knocking on our door trying to figure out what they actually had to do.
In retrospect, this sudden swing from no support to near-total support really wasn't good for the project. We suddenly had to transition from operating relatively independently, solving some problems in the IBM infrastructure, doing some strategizing, whatever, to, oh, wow, there are, like, 40 teams talking to me right now. We'd had some organized discovery work, helping teams figure out what they had to do, and that was really well organized, but even still, it left the central engineering team scrambling with our sudden success and just getting attention. We weren't ready. We weren't ready with documentation, service discovery, or other core tooling, which were critical to the success of the project.
Worst was that we just didn't have an easily digestible philosophy to help people make decisions. If you're going to embark on something like this and you want everybody to be able to make their own decisions local to themselves and have it be the right kind of decision, you have to tell them what they should be thinking of, how they should be trading things off, so that they can prioritize the right things locally. We just didn't have that philosophy available. We talked some internally, but it wasn't good enough. So in the future, we're going to start preparing for what happens when you get this high priority, going to produce a project philosophy document to try to help people make those decisions, and we're going to continue prioritizing important work across the company.
Very related problem: the problem of moving goalposts. I mentioned that we had this big discovery process and we were going to specify this work up front: containerization, move to our Mesos platform, service discovery. But there was also later work that we kind of knew was coming, and we just didn't talk about. For example, our initial instructions for teams asked them to make their services ready to receive URLs in the service discovery format, but not actually use the tooling because the tooling wasn't ready yet. We didn't mention that the tooling wasn't ready yet because we didn't want them to delay starting the work until it was ready. This logic made some kind of sense at the time.
We thought, okay, changing that last step of using the central tooling was going to be pretty trivial. It's just changing a couple configuration files, but it's still work that we didn't really specify. We thought that this was a good idea for a couple reasons. First, each team has a hero role that passes around, typically via the on-call rotation. They're supposed to handle small requests from other teams, so we thought that the smaller stuff to come could be handled by whoever was the team's hero. Second, we didn't want to exhaustively list work that couldn't be done yet, especially since we didn't know what the work precisely looked like yet. So we didn't want teams to say, "Oh, okay, I can't start it until you know exactly what I'm supposed to do." We'd rather teams make mistakes than follow up.
In retrospect, this was just a mistake, most especially in not communicating that there would be follow-up work, even if we didn't know the full extent of it. And the fact that the teams here are rotated just meant that there was a lot of context switching necessary, since the person that did the work initially might not be the person who was responding it this week.
That was one kind of moving goalpost. There was another. We did three test build-outs as part of this, where we tore the environment down and brought it back up again. Everything was changing basically every time. Our goal here was that we were going to iteratively improve the infrastructure side of the build-outs, figure out some changes we needed to make while it was in a production environment.
If you're working on an infrastructure project, it's great to be able to say, "Oh, that's not actually really production. I can make a cross-cutting decision and just have it come into effect without having to realize, okay, now I need to manage a really slow rollout." But the reality of the build-out then was that teams would do work. We encouraged them to do the work in the US first, because we were making the same improvements there. Then they'd wait a few weeks or days, depending, because they couldn't test it in the EU yet. Then the Backpack team would try to deploy their work, and then we'd realize something was broken. We'd go to them. They'd say, "Oh, no, everybody's blocking the project. This is a mess."
In our minds, the goalposts weren't really moving here because the goal was always your software works in the EU, but at the same time, that's a really ambiguous statement. And given that things on the ground were changing, it's not unsurprising that things just weren't always working correctly. So in the future, the most important thing that we could do to fix this is to use a steel-thread approach, validating the design using a sub-project that tests it thoroughly.
For example, if we could find a slice of our system that is top to bottom, some product to data storage and authentication and everything we need, but it's, say, 20% of the services of the company, we could have tested things with those 20% instead of making the 80% come along for the ride and then realizing that we screwed something up. Now, this might have actually been the wrong decision if we'd gone this way because it could have lengthened the wall-clock time of the project. If we had said, "Okay, we're going to spend three months with the 20% and then six months with the 80%," maybe those 80% teams wouldn't be done in time. Maybe we did get a benefit by starting everybody at the same time. But there's probably a better balance to strike than what we actually did.
In the future, we're going to start having that steel-thread test case. Going to be more honest and transparent. We're going to stop hidden work, even if only by acknowledging that some unknown future work exists. And that's because we are going to continue to avoid complete waterfall planning. Agile's really important. We have to be able to react to changing conditions, changing realizations. We don't know everything up front.
Next, communication is hard. I know this is surprising to everyone in the room. New Relic has a strong culture of internal blogging as a means of broadcasting ideas and posting updates on projects. If you want to influence the company, this is the way you do it. Every step of the way, the Backpack team wrote documents about the ideas behind the project, the changes we needed made in software, plans for build-outs, and what we just did each week.
But the trouble with an internal blogging culture is that everyone's doing this, so there's a ton to read and a ton to discuss, and you as an individual basically have no way of knowing exactly what you should be reading. You can read the things that you think you need to read, but you don't know what you should have been reading all that time. This is true across the board. It's really hard to know how to keep up on everything.
We had town hall events to broadcast updates, but people would be on vacation, they'd miss that. We tried having a checklist application that wasn't looked at or when it was, if we made a change to it, people would have to notice a change was made. Tried to have some automated linting so that we could sniff out problems in code before teams ran into them in production. That wasn't used super well because it was kind of out of band. Wasn't great. And emails, people just don't read them.
And I say, blog posts don't get read, emails don't get read. Everybody's always reading these, just you can't count on it. You can't count on an individual having read any individual thing. So the most important thing we could have done was have some kind of centralized documentation, and this is something we realized later in the project and implemented. This probably would have helped teams and individuals catch up on what they missed without having to hunt through the blog post history or watch recorded events, which is something that no one is going to do.
So we'll have some kind of centralized documentation. We're going to have a user-readable change log of requirements, and Git commits are not good enough. Nobody's going to look at those. If you want whitespace changes to show up, people are probably going to ignore that. Going to have some kind of better linting. We're going to continue to blog internally. It's really useful for us, and we're going to continue to just communicate using as many channels as possible because we know that one is not enough.
Next, local maximums. I'd mentioned that we'd gone through all these phases of incrementally improving our infrastructure, but we didn't really have a standardization at any point along the way. One consequence of that was that teams were heavily incentivized to make the system better locally. What I mean by that is, let's perform a thought experiment. Travel back, say, three years in time. You're a team at New Relic. You own 20 services. You deploy them all frequently, but the tooling's not that good. You kind of want to be able to deploy all 20 at once or say, "Okay, when I deploy this one, these other ones have to be deployed too," or, "I want to move from staging to production automatically." So rather than just twiddle your thumbs, you just built that.
This is great. You get three years of productivity benefits. You're able to move faster. It was the right decision for you. So maybe you throw in a couple other features too, like service discovery. Three years later, you've had that benefit, but you don't have the standardized platform that was built in the meantime, and now someone comes along and says, "Cool, if you want to get on our European region project," which, by the way, you have to because we need to ship this thing, "you need to adopt the standardized tooling." So then you're kind of in a little bit of a problem. You have some kind of transition pain, but there's a promise of, oh, that standardized tooling is great. We have a build and deploy tools team working on that. They have your interests in mind. This is going to be better for you.
But reality is actually a little different. That standardized tooling is going to be worse for some teams in some ways. It's going to not quite match what they want to see, but the benefit is for the company as a whole, not them. So it's really like there's this future tooling benefit that everybody wants, but we need everybody to move to it now, yesterday, not when it checks all the boxes. This really isn't a good way to make friends, to basically force all of your engineering teams to simultaneously adopt shared tooling, but it's kind of critical.
I don't have a great answer here other than just having more empathy for teams stuck in this situation. Communicating well in advance could help, too. If the tools team had known that there were a couple gaps that 20 teams had that they could have closed to make this easier, that might have helped. And also, frankly, we just need to stop making assumptions in general about how teams or individuals will react. So, I said, okay, this team built out this tooling and they've had it for three years, and you might assume, oh, they're going to be really reluctant to give it up because they built it and it's working so well for them, when in reality, they may just hate it. The team may have completely cycled throughout that time, and they may just be stuck with some legacy stuff they don't like. Whatever reason, maybe they just want to get rid of it. So just stopping those assumptions is critical. And we are going to continue to make standard tooling better.
The best decision we made was leaning on what we have. Excuse me. We had our in-flight projects already. We had our container fabric, which was our Mesos platform. We had our Grand Central build and deploy tool system. I mentioned, or had on a slide earlier, our containerized database platform. All of these were crucial in making this project possible. So by saying, okay, everyone move to this, we were able to basically get a lot of bang for our buck. There was a huge uptick in adoption rate as part of this, which wasn't necessarily good for the teams involved supporting those central things.
On the other hand, part of our design goal here is to make it so that platform teams are the ones bearing the brunt of the work, not spreading it across a bunch of teams that aren't actually equipped to do that work. In the future, we're going to start making clear which priorities are highest for infrastructure teams so that they know that this is coming. And we're going to look for high-leverage work a small number of teams can do because it's really, really useful when you have those platform teams that are able to contribute to large projects like this.
Finally, the last lesson learned: the importance of a pilot phase. Our original plan was discovery, fan out, test ourselves, and release it. We realized that there were just way too many unknown unknowns in this environment to live up to our own expectations for reliability. So we changed it. We delayed GA significantly and opted instead to run a pilot phase for a very limited number of customers. That allowed our teams to be on the hook for reliability without the same consequences when things just inevitably went wrong, whether in the underlying cloud or in our software, whatever. We knew things were going to go wrong. We wanted to be able to learn how to deal with that before we were live in production for everybody. This was just absolutely the right decision. So in the future, honestly, we just got to stop magical thinking. I have no idea how we thought that this was okay in the first place.
So sum up, where are we now? First off, the project did work. It was painful at times, but we do have an EU region. If you're interested in it, contact us. We can hook you up. Our disaster recovery exercise has seen a ton of benefits. We used an order of magnitude fewer engineering hours this year to do it. We've had less busy work in general service operations. A lot of the boilerplate configuration is gone, and we've laid the groundwork for future improvements. We have 95% of our services in our container fabric.
There's a meta benefit as well. We're an observability company serving the modern market, and so the more that we are on the bleeding edge of things, the more that we are able to see the gaps in our own product and cover them before our customers encounter them. It's a really nice thing to have. We've learned a lot. We now know how to run large-scale projects better. So in the future, we're going to continue just a little bit of magical thinking and trying bold things like projects like this.
Thank you to all of you for listening and to all the hundreds of people who worked on this project. Thank you.