Systems Architecture at Massive Scale: The Unreasonable Effectiveness of Simplicity
Systems Architecture at Massive Scale: The Unreasonable Effectiveness of Simplicity
Chapters
Full transcript
The complete talk, organized by section.
Randy Shoup
[00:00:17.200] I'm Randy Shoup, and I want to talk today about large-scale architecture, the unreasonable effectiveness of simplicity, and a little bit about my background. I used to be an engineering leader at eBay in the early days. I've been an engineering leader at eBay again for the last two years as Chief Architect and VP of Engineering. I spent a bunch of time in the early days of Google Cloud running engineering for Google App Engine at Google. I led engineering at Stitch Fix up to and through the IPO, and then I went to WeWork and led a part of engineering up to and through the not-IPO. That's a conversation for adult drinks at another time.
[00:00:58.200] For those of you who were in the plenary session that I did the other day, there is a little bit of this where I'm going to tell some of these stories again, but I promise that we're going to learn different lessons from them as we talk through them again.
[00:01:08.300] eBay has been around for 27 years, and it's had 27 years of software. eBay's architectural journey began back in 1995. The founder, Pierre Omidyar, was playing around with this new cool thing called the web over the Labor Day weekend, and he built the first thing that became eBay much later. It was in Perl. Every item was a file. It lived on a little 486 tower underneath his desk, and that was the beginning of what became eBay.
[00:01:42.200] V2 of eBay's architecture was a monolithic C++ monolith inside a DLL in Microsoft Internet Information Server. It grew, at its worst, to 3.4 million lines of code in that single DLL. They were hitting compiler limits on the number of methods per class, which was 16K for Microsoft C++ at the time. This is a pretty unpleasant thing to work in and to experience as a developer.
[00:02:14.400] They completely retooled the architecture starting about 2002 to what they cleverly called V3. That was a migration, not really to microservices quite yet, but to what we would now maybe call many applications in Java. There were about 220 different clusters of servers, each of which would serve a set of pages that were some particular domain of eBay. Search pages were served by one application, buying pages by another application, payment and billing and selling, etc., times 220. Those things lived over a sea of shared databases. In 2012, they started a migration to V4, and that was what we now call microservices, and there has been a subsequent evolution since then.
[00:03:03.300] Amazon has gone through a similar evolution. It started almost exactly the same time as eBay, 1995. It also started with a monolithic monster. This was a Perl Mason front end over a C back end. They were a four-gigabyte application in a four-gigabyte address space, a 32-bit address space, so the entire application didn't even fit in main memory at the same time. They were regularly breaking the linker as they were trying to build the software, and they had so much memory leakage that they were restarting about every 100 to 200 requests. Because it was so painful, they were only releasing maybe once a quarter.
[00:03:42.500] Then they did a major service migration from 2001 to 2005, built services in modern languages of C++ and Java, and they did not do any shared databases. Every service was owning the domain and owning the data at the same time. Then it's no surprise that in 2006 they started expanding into AWS and taking over the world's online retail.
[00:04:06.500] A couple of lessons that we can learn from this is that no one starts with microservices. I strongly recommend that you do not do that. But past a certain scale, almost everybody co-evolves at large scale to something that we would now call microservices, and I think that is not a coincidence. We're going to talk mostly in this talk about all the reasons and benefits of migrating in that direction.
[00:04:30.100] The other lesson I like to say here is: if you don't end up regretting your early technology decisions, you probably over-engineered. You could imagine some other mythical company in 1995 that, instead of building a monolith like eBay and Amazon did at the time, they built some beautiful distributed system, but at the same time they didn't meet customer needs or provide value. I think it's perfectly legitimate, and as you can see from these internet leaders, to expect to evolve your architecture many times over a long lifecycle.
[00:05:00.100] At large scale, I want to talk about lots of aspects of simplicity and contend that every time we do a good re-architecture, we're making things simpler, not more complicated. First I want to talk about simple components. Then I want to talk about simple interactions between those components. I want to talk about driving for simple changes, and then I want to talk about doing evolution in a simple way as well.
[00:05:21.400] Let's start with simple components. The first architectural tool we have in our toolbox is modular services. Just as Amazon did back in that 2001 timeframe, we want those service boundaries to match the problem domain, and we also want those service boundaries to encapsulate both business logic and data at the same time. We want all interactions to those services to go through the published service interface. There's no secret back door to your data storage, and that interface bounds the formal contract between services.
[00:05:55.600] Similarly, from an architectural perspective, those service boundaries are going to encapsulate architectural -ilities like fault isolation and performance, and they also, most importantly right now, are a security boundary as well. Within those services, we want to disconnect our domain logic from the stuff that isn't our domain logic, like I/O. We want to have stateless domain logic, ideally written as a pure function from the functional programming style, that matches the domain problem as directly as possible. That means it's locally testable and testable in isolation of other things. It's deterministic. If we do domain stuff at that level, that code tends to be more robust to change over time because it directly aligns with the domain that we're trying to build.
[00:06:46.500] From a performance perspective, we want that processing to be straight-line processing: no fanciness, not a lot of branching if you can avoid it, just straightforward synchronous processing on a single thread. We want to separate domain logic from I/O, like I mentioned. This is like hexagonal architecture, the ports and adapters pattern, and from functional programming the phrase is a functional core with an imperative shell around it.
[00:07:15.900] This is slicing the domain in a vertical sense, but I think we need to slice it in every dimension as we grow larger. Sharding is a way of slicing our problem domain in a depth sense. Shards partition the services by the data space. One shard is going to be a certain set of users, another shard will be another set of users, something like that. Those shards provide value because they also encapsulate architectural -ilities like resource isolation and fault isolation, and they provide availability and performance benefits. In the modern cloud world, those shards are autoscaled, so we either create new shards or expand and contract those shards as our data and processing needs grow.
[00:08:02.100] That's two of the three dimensions. Let's talk about the third dimension, which is horizontal slicing. Here I want to talk about several ideas, the first of which is service layering. In one of these large-scale service architectures, you don't just have a single service vertically and that's it. You have a graph of services where there are top user-facing services which depend on other services, which in turn depend on other services, etc. Those common services are abstracting out and providing widely used shared capabilities. That service ecosystem is actually more like a graph than a layer, but I like the mental model of layering.
[00:08:38.300] Those services are going to grow and evolve over time. Over time we're going to find new common things, factor those out into their own services, and my mental model of how teams and services grow over time is that they divide like cellular mitosis. We start with a domain, it gets more complicated, and it subdivides into two daughter domains, etc.
[00:09:01.300] The other vertical idea or tool that we have is a common platform, and this has been a big theme at this conference. At large scale, we definitely want to be providing a paved road of shared infrastructure, where we can provide standard frameworks, where it's easy to get things going, that prioritize the developer experience. Lots of places do this well, but I think particularly Netflix and Google are really good at providing this paved-road common platform.
[00:09:28.200] Why are we doing this? We're doing this to separate out concerns. We're trying to reduce the cognitive load on the individual development teams and bound their decisions: give them a limit on the number of decisions they have to make in order to make forward progress. One of the things I find super interesting is that large-scale organizations are often investing more than 50% of their development effort in platform capabilities. As you get larger and larger, the ROI from investing in these common shared platforms becomes larger and larger over time. You'll see the big leaders spending what you might think is an outsized amount of effort and resources in these platform capabilities exactly because, if you just do the nuts and bolts, that's the place where your investment pays off. If I can make an investment in one team that pays off in 10 teams, 100 teams, 1,000 teams, that definitely makes sense.
[00:10:23.100] That was simple components. Now let's start talking about simple interactions. The mental model that really inspires me in this area is the Reactive Manifesto. That's a way of expressing, at a high level, functional programming as expressed in distributed systems. The first idea in terms of intercomponent interactions that I want to mention is being event-driven. Rather than communicating changes synchronously between components, we want to prefer, when we can, to communicate state changes as a stream of events. An event is a statement that something interesting occurred in service A, and that gets consumed by service B. Ideally, that event represents something semantic within the domain.
[00:11:16.400] Why are we doing this? We're doing this to decouple domains and teams. We've talked so much at this conference around coupling and coordination, and this is a way of breaking that coupling by abstracting out the communication through this common event interface. Also, as I'll talk about in more detail, it allows service A and service B to be asynchronous from each other. This simplifies not only the interaction between the components, but it strongly simplifies, in my experience, what it takes to build either of those components. If my mental model as service B is that I receive events from service A, I do some processing, and then I produce a set of events that are consumed by C or D or E, that's a simple and straightforward mental model for me and my team.
[00:12:00.300] The next idea is collecting the data storage that we have in the form of an immutable log. Store state as an immutable log of events rather than a set of mutable state that just keeps changing all the time and we can't go back. This often matches the domain we're doing straightforwardly. When I was at Stitch Fix, we would send packages to our clients, and we had a very clear immutable log of all the steps that would happen along the picking, packing, shipping model. That was a lot better than a single record in the database that said what the current state of the package was.
[00:12:40.700] This log encapsulates architectural -ilities as well, in terms of durability. It allows us to be traceable and auditable in the things that we do. It's replayable if we do it right, and it's very explicit, understandable, and comprehensible for the teams that are managing it. Of course, if we're having a log that grows without bounds, there are standard techniques about compacting those logs into snapshots for efficient storage and efficient processing.
[00:13:09.700] The next idea is about embracing asynchrony. We want to decouple the operations between part A and part B of our system in time. That allows us to decouple availability: system A can be up while system B is down. It allows more complex processing than we'd be willing or able to do in a synchronous way, and it allows us to make independent changes on both sides of that asynchronous boundary in a way that doesn't affect the other side. Again, just like the event-driven model, I think this strongly simplifies the mental model and the cognitive load for somebody who's implementing one of these components.
[00:13:45.800] The interesting idea here is that now that we have this asynchronous idea, we can actually invert what we might initially think of as a synchronous call graph, calling from A to B to C to D in a direct synchronous flow, and have D produce events which are consumed by C, which produces events consumed by B, which produces events consumed by A. I'm going to give some specific industry examples that illustrate this point really well.
[00:14:15.600] The first is a recent change to the Netflix viewing history. There was a great article about a year ago about what they did there. The purpose here is storing and processing playback data for members. They get about a million requests a second at peak, so it's a pretty heavily used system for lots of different use cases at Netflix. The original synchronous architecture was synchronously writing to storage, straight through from your browser or TiVo or whatever. If there was any back pressure or delays in the system, that flowed all the way back to the consumer side, which is pretty unpleasant. They re-architected in an asynchronous way. Instead, they're writing to a durable queue. There's an async pipeline to process that queue in lots of different ways for lots of different use cases, and they materialize views along the way to get more efficient reads.
[00:15:08.900] The best example is one that Scott Havens from Walmart, or ex-Walmart at the time, gave in the 2019 DevOps Enterprise. I'm going to replay that a little bit for you because it's just so amazing. The problem statement they were trying to solve was item availability: is this particular item available for sale to this particular person? You can imagine there is a big set of complex logic involving lots of different teams and domains across Walmart. The original synchronous architecture, and I'll show you Scott's diagram in a moment, was a graph of 23 synchronous service calls in the hot path, any one of the failure of any one of which would completely invalidate the results. It would give the wrong answer to the user.
[00:15:53.200] As a consequence of their requirement of 99.98% uptime for the end customer and 300 milliseconds, each of those individual 23 services had about 50 milliseconds to get their job done and needed five nines of availability. It was pretty expensive to build and maintain a system that looked like this. As Scott said in his great presentation, the point isn't to look at the details here. The point is the complexity and how much red is on there, because that describes the hot path.
[00:16:23.400] The rearchitecture that Scott and his team led was inverting each of those services one by one to use asynchronous events rather than a synchronous call graph, idempotent processing, so if they got events multiple times they would still give the same results. Each of them would store their state in an event-sourced immutable log, and they would materialize views of data along the way so you didn't have to descend the chain all the time when you wanted to learn stuff. They turned that horrible diagram into something where there were only two synchronous systems in the hot path, each of which had much easier availability and latency tolerances than the original system. I think this is a great model for building large-scale systems.
[00:17:15.500] The next thing I want to talk about is making changes in a simple way. The first idea here is incremental changes. We want to take everything that seems in our head to be a large change and turn each one of those things into a small set of incremental steps. Each one of those steps, we want to maintain forward and backward compatibility of all the things we might be changing: data, interfaces, etc. It's super common for those of us that live in this world to have multiple service versions of the same version existing in parallel at the same time. Those transitional states are the norm rather than the exception.
[00:17:51.100] The next idea, which I think is going to be preaching to the converted in a place like this, is continuous testing. The pitch I like to make to my teams is that the reason why we do testing is not to slow us down, but to help us go faster. This picture, which I took in Budapest, is of a funicular. It's one of those trains that go up super steep slopes, and they have a rack rail in the center that helps the train not to slip back. My mental model of tests is every time we add a new test, it's one more chunk forward on that rack rail that allows us to move forward but never slip back. Tests help us go faster because they provide the solid ground upon which we move forward.
[00:18:36.400] Tests help us make better code because they give us the confidence to potentially break things and refactor in a merciless way. Tests also make better systems because the earlier we can catch issues, the earlier we can resolve them. I also strongly believe that tests make better designs: by thinking about testability and designing for testability up front, it helps us make better software. When we say something's testable, it means it's modular, it has separation of concerns, it encapsulates stuff. The things that are testable are exactly the things that are good software.
[00:19:11.200] I got that idea from Dave Farley and his excellent book Modern Software Engineering. Michael Feathers has another great thing to say about testing, which is testability in good design. I love this quote: all of the pain that we feel when writing unit tests points to underlying design problems. Again, as software developers and engineering leaders, the more that we can encourage our teams to test, the faster and better we're going to move.
[00:19:41.400] The best example that I've seen of that in my career was at Stitch Fix, where they had been doing test-driven development from the beginning of the software, and basically as a consequence we didn't have a bug tracking system. It's not that we didn't have bugs. We definitely produced bugs, but we had such a low level of outstanding bugs that when new issues would come up in production or along the way, we just fixed them because we didn't have this big long backlog of stuff. Almost every other place that I've worked, we have this backlog of thousands of other bugs that we're never going to get to. This kind of inbox zero is the place that you can get to if you practice test-driven development along the way.
[00:20:22.600] The last thing I want to say in this section is about continuous delivery. By deploying our services, or at least having the ability to deploy our services often, we're able to make lots of changes and make much more solid systems. We're able to release smaller units of work, easily code-review them, roll them forward, roll them back, and increase the rate of change that we can produce in our systems while, at the same time, substantially reducing the risk of the changes that we're making.
[00:20:56.200] I talked a little bit about the continuous delivery journey I led at eBay in the plenary session. I'll briefly review that. When I came to eBay about two years ago, I surveyed all the teams and tried to figure out what we could do to improve software development at eBay in general. It became super clear that despite having issues in testing, architecture, monitoring, issues all over the place, the best place where we could put our efforts was on delivering smaller units of work with lower lead time and a faster number of deployments. As a consequence, we were able to double the productivity of the engineering teams we worked with. That's measured in Flow Velocity: doubling the number of features and bug fixes that a given team, holding the team size and team composition constant, was able to achieve.
[00:21:50.300] Again, I was Chief Architect, so you'd think I would make architectural changes as the first thing, but this was actually a prerequisite for us to be able to make larger-scale changes. If we can't make small changes safely, then we probably shouldn't be doing really big changes and assuming they could be done safely.
[00:22:06.500] In the last couple of minutes, I want to talk about simple evolution. As we saw in both examples from eBay and Amazon in the beginning, it's common. In fact, I've never heard of a counterexample of a system that lives for a long time: it does need to be re-architected over time. Why? Because things are changing over time. The landscape is changing. Maybe our users have grown 10x, 100x, 1,000x, a million x. We need to be able to do evolution in a safe and simple way. I want to talk about evolving individual services and then evolving the system overall.
[00:22:48.300] First, I'll give examples from when I was at Google. Service evolution at Google is very fluid, and it's something that I really like about the ecosystem at Google. It's very common to be creating and extracting new services as we recognize the need for new capabilities, but services only justify their continued existence at Google by being used. As soon as a service stops being used, we shift resources to something else, redeploy the people on that team, and migrate people away to the next thing. We deprecate services pretty aggressively when they're no longer used. That also happens externally, I'm afraid. Those domains are growing and dividing over time, like I mentioned, in the form of cellular mitosis. The lovely engineering quip that people say at Google is it feels like every service at Google is either deprecated or not ready yet.
[00:23:46.200] The last thing I want to talk about is migrations from one kind of architecture to another. I love this quote from Martin Fowler: the only thing a big-bang migration guarantees is a big bang. I don't recommend that. I'll quickly walk through my mental model, and I've done this a bunch of times, about how to do an incremental migration of a large-scale system.
[00:24:10.500] Step zero is pretty obvious from a DevOps perspective: pilot the new thing with something that actually matters. We heard this from Airbnb and several other examples in the conference today. Choose some initial end-to-end vertical experience that actually matters and build that, or rebuild that, in the new way. Use that as an opportunity to learn about the new architecture, whether it's moving to cloud, moving to containers, moving to serverless, moving to Haskell, whatever. This initial step is actually the hardest step because you're learning how to do something new and unlearning the old ways that you are used to in your old architecture.
[00:24:49.700] Now that we've got a successful pilot and learned from that, now and only now should we start rolling out the migration across the system and doing a much larger migration. The maybe counterintuitive approach that we can do here is by prioritizing business value. When eBay moved from the V2 to the V3 system, once it did the pilot, it started with the highest-ROI pages first. You might think that sounds really risky and stupid, but no: if you have a limited budget, and you always have a limited budget whether you know it or not, for that migration it guarantees that the things we did were the highest bang-for-the-buck things we could do. It allowed us to front-load a bunch of the work in terms of discovering potential issues that were problems.
[00:25:35.400] Of course, as you're doing this, you never want to have your company stop doing new work. You're always going to have to think about how to do new feature development in parallel with this migration. But my tip, because I've done it the wrong way many times, is any individual step for any individual component should either be a migration step or a feature step. Never, in one step, try to combine rebuilding a thing in the new way and also adding new features. If you can separate those two ideas, you can take small steps.
[00:26:08.100] The last idea, which maybe should be obvious, but if you've never lived in something like this maybe it isn't, is that there's going to be this residual thing that's left. If you've done it in this ROI-driven importance order, that residual thing that's left in your monolith or old-style system is almost by definition the lowest business value. It probably doesn't change very often, otherwise you would have migrated it, and you can decide to migrate or not as you decide.
[00:26:37.300] That was a quick tour of some of my thoughts around simplicity and architecture. We talked about simple components. We talked about simple interactions. We talked about simple changes, and we talked about simple evolution. Thank you very much. I look forward to seeing you in the rest of the conference.