Fabulous Fortunes, Fewer Failures, and Faster Fixes from Functional Fundamentals

Log in to watch

Las Vegas 2019

Fabulous Fortunes, Fewer Failures, and Faster Fixes from Functional Fundamentals - Scott Havens

Senior Director, Head of Supply Chain Technology · Moda Operandi

Learn how real-world enterprises that adopted functional programming principles and adapted them to their line-of-business systems achieved greater resiliency, faster time-to-delivery, and lower total cost of ownership.

Scott is the former Director of Software Engineering at Jet.com and Walmart Labs for supply chain technology. He specializes in data-intensive systems.

Chapters

Full transcript

The complete talk, organized by section.

Host Intro (Gene Kim)

All right. To motivate the next talk, I want to tell you just a little bit about something that has influenced me a lot. About three years ago, I learned a language called Clojure, and it changed my life. It was probably one of the most difficult things I've learned professionally, but it's also been one of the most rewarding. It brought the joy of programming back into my life. For the first time in my career, as I'm nearing 50 years old, I'm finally able to write programs that do what I want them to do, and I'm able to build upon them for years without them falling over like a house of cards, as has been my experience for nearly 30 years.

The famous French philosopher Claude Levi-Strauss would say of certain tools, "Is it good to think with?" And for reasons that I will try to explain in the next five minutes, I believe functional programming and things like immutability are truly better tools to think with, and it has really taught me how to prevent myself from constantly sabotaging my code, which I have been doing for decades. I'm going to make the astonishing claim that these things have eliminated 90% of the errors I used to make. So I'm going to try to motivate why. So about a year ago, I found this amazing graphic on Twitter that describes the difference between passing variables by value versus passing variables by reference. So when I was in graduate school in 1993, most mainstream languages supported only passing things by value.

So which meant that if you passed a variable to a function and you changed it within the function, you would only change your local copy. So often this means that you would have to return the new state, and if this was a structure or a large object, it means you would have to do a lot of copying and pasting. This is tedious, error-prone, and very time-consuming. I often found myself complaining about this, wishing there were a better way. And it turns out you could eliminate this by using pointers. But actually, pointers are now considered so dangerous, few languages besides C, C++, and Assembly even let you do it because it is that dangerous. In 1995, I got introduced to a huge innovation programming languages that was called passing values by reference.

This showed up in C++, Java, Modula-3, which allowed you to change the value that was passed to you as a parameter, and it would change the reference that you passed it in from the caller. And this seemed really great. I loved it because it was such a time saver, because it lets you write less code. But three years ago, I changed my mind. So Clojure is one of a category of languages called functional programming. Haskell, F#, they're all part of these, have the same sensibility. They don't let you change variables. Functions need to be pure. The functions always return the same output given the same inputs, and there are never any side effects. You're not allowed to change the world around you. Now, you're not allowed to read or write from disk. Even reading from disk is not allowed because it's not always the same.

And so this is one of the biggest aha moments of programming for me, because it taught me how terrifying passing variables by reference should be. Because when you see this, what you really should be seeing is this. It's like, why is my coffee cup changing? Who is messing with my coffee cup, and how do I make them stop? The point here is that it's very difficult to understand your code and to reason about what is happening when anyone can change your internal state. You may have heard of Heisenbugs, where even the mere act of observation changes the result, and these are the hallmarks of multithreading errors, which is considered to be one of the most difficult problems in distributed systems. I'm fixing my coffee cup, and I can't figure out how to get it to fill up again, all right? And because I need to replicate the problem.

So in the real world, uncontrolled mutation makes things extraordinarily difficult to reason about because other people can put anything they want in your coffee cup. John Carmack, he wrote Castle Wolfenstein 3D, Doom, Quake. He gave this amazing keynote at the QuakeCon conference in 2013 saying, "A large fraction of the flaws in software development are due to programmers not fully understanding all the possible states their code may execute in. In a multithreaded environment, the lack of understanding and the resulting problems are greatly amplified to the point of panic if you're actually paying attention." And so the point here is that in the real world, it's not just your coffee cup. You are operating in a universe of coffee cups, and if you zoom out, there are many, many more coffee cups around that. And if anyone can change your state because they have a reference to it, it becomes almost impossible to reason about.

Under these conditions, it's almost impossible to understand what is actually happening and how to make things truly deterministic. And this is one of the beliefs that functional programming truly taught me, is they have a belief that uncontrolled state mutation is at the very limits of what humans can reasonably understand and to be able to test and run in production. And so programming languages that pioneered functional programming techniques like this is Haskell, OCaml, Clojure, Scala, Erlang, Elm, Agda, Coq, ReasonML, is becoming increasingly popular. And what I find so exciting is that these concepts are now showing up in infrastructure as well. Docker is immutable, right? You can't change containers. If you really want to make a change that persists, you have to make a new container. Kubernetes uses this concept, not in the small, but in the large for systems of systems.

If you see Apache Kafka, chances are they're using it for an immutable data model that says you're not allowed to rewrite the past. It turns out version control is immutable, right? You get yelled at if you actually rewrite history. So I'm going to introduce the next speaker, which is Scott Havens. And as we were talking for this slide, he said, everyone knows now, as Dr. Dijkstra said, is go to statements are considered harmful to a program flow. And he said that it is without a doubt that uncontrolled state mutation will surely, within our generation, be considered the next go to. So one is for code, one is for data. So the next speaker is Scott Havens. Until very recently, he was Director of Software Engineering at Jet.com and Walmart Labs. His remit was to rebuild the entire inventory management systems at Walmart, the world's largest company.

He earned this right by the amazing work he did building the incredible systems that powered Jet.com, a company that Walmart then acquired. It powered the inventory management systems, order management, transportation, available to promise, available to ship, and tons of other critical processes that must all operate correctly to compete effectively as an online retailer. He is now senior director, head of supply chain technologies at Moda Operandi, an upscale fashion retailer. And I hope what he presents will blow your mind as it blew my mind, showing that functional programming principles apply not just in a small inner program, but can be applied at the most vast scales, such as a Walmart enterprise. With that, Scott Havens.

Scott Havens

Supersonic. Magnifique. Très, très cool and très, très chic. Supersonic. Good morning, DevOps Enterprise Summit. I'm really excited to be here today and talk about something that's really near and dear to my heart. My name is Scott Havens. I'm a senior director and head of supply chain tech at Moda Operandi. It's a fashion e-commerce company that was founded in 2010. Our mission is to make it easy for fashion designers to grow their business and for consumers to recognize their personal style. I joined Moda because fashion supply chains are notoriously challenging. And I'm excited about how we can use technology to improve time to market, lower costs, and even help designers predict next season's fashion trends before the season starts. However, I just joined two weeks ago, so I'm going to supplement a lot of my discussion today with my experiences prior to Moda.

Before Moda Operandi, I was an architect at Walmart, the largest company in the world by revenue, over half a trillion dollars a year, and by number of employees, 2.3 million, to be precise. I was responsible for designing and building supply chain systems, like inventory management for Walmart, including the 4,500 stores in the US, e-commerce, like walmart.com and owned brands like jet.com, and international markets. I joined Walmart via the acquisition of jet.com three years ago for $3.3 billion. At the time, the largest e-commerce acquisition to date. One of the reasons that Walmart had bought Jet is because the Jet tech stack looked transformative. It was cloud-native, microservice-based, event sourced, and fundamentally, it was based on functional programming principles. It looked cool. But not everyone is convinced by just cool.

They didn't know if Jet's techniques were just the latest buzzwords or if they provided real-world benefits. Well, it wasn't long before we were fortunate enough to get the chance to demonstrate these benefits. And when I say fortunate, what I really mean is that disaster struck. About three years ago, in the middle of the night, I got paged for a system alert. I woke up, hopped on the phone bridge and our pager duty Slack channel, and started looking into it. Almost immediately, I was joined by coworkers from several other teams. It turned out our production Kafka cluster was down. And if you're not familiar with Kafka, it's a very scalable pub/sub messaging system. We use it as the primary method of communication among all of our back-end services. Before too long, we realized that the cluster wasn't just down, but it was dead. It was an ex-Kafka cluster. Every single message in flight was gone.

Customer orders, replenishment requests, catalog changes, inventory updates, warehouse replenishment notifications, pricing updates, every single one just gone. We were going to have to rebuild the cluster from the ground up. Now, this could have been catastrophic. This could have been the end of the grand Jet experiment, enough to convince our new Walmart compatriots that Jet's technical tenets sound good on paper but don't work in a real enterprise compared to tried and true systems. So what happened? Well, first, we rebuilt the cluster. New brokers deployed in minutes via Ansible scripts. While this was happening, we coordinated with all the teams who managed the edge systems, the systems that are exposed to the outside world, like merchant API inputs and customer order inputs. These edge systems, like all the others, are event sourced.

Each of these teams reset the checkpoints in their event streams to a point in time just prior to the outage. All of the events after that point were re-emitted to all the downstream consumers. And when these checkpoints were set back, there was some overlap on messages that had already been sent and processed downstream. But the downstream systems were all designed and had been fully tested to handle duplicates and act idempotently even though they were out of order. These downstream systems were hit by a flood of messages, but we were able to just scale them out in seconds, some automatically, some manually, to handle the throughput and stay entirely within our SLAs. In the end, this potentially catastrophic event was little more than a minor annoyance. No data was lost, and not a single customer order was delayed.

Walmart was happy that their $3 billion wasn't wasted on worthless tech, and it afforded us, coming from the Jet side, the opportunity to examine the rest of the Walmart technology ecosystem and see where we could provide value. Now, Jet, as a startup, had the advantage of being completely greenfield and focused in their business. Walmart, on the other hand, had built their incredibly successful and wide-ranging business over many decades, requiring a number of different stacks and technologies. What we found was an organization and an architecture of enormous complexity and cost. Now, I'm not going to attempt to capture the entire mammoth business that is Walmart or even any other e-commerce company here. Instead, I'm going to dig into just one common small piece of e-commerce website functionality. Our customer, Jane, wants to buy a cocktail dress for an upcoming party. She wants to know if it's available in her size.

It doesn't have to be a store or a warehouse nearby. It can be anywhere as long as it can be shipped to her. This item availability is served via an API. When she checks her favorite e-commerce site, it can't be down or take too long to load because competitors' websites are only a click away. So our item availability API has an SLA of, let's say, 99.98% uptime. That's just shy of two hours a year of permissible downtime at, say, 300 milliseconds latency. And what factors will go into this item availability? The first ones that may come to mind are the inventory in the warehouse and any reservations that may exist from existing orders. But there's a lot more to it than that. In addition to the warehouse inventory, you might have the store inventory on the floor or the inventory in the back room of the stores. If you are a marketplace, you might have a lot of third parties, could be thousands of different third parties' inventory.

You have to look at the item and see if you're even eligible to sell it on this site. Just because it's sitting in a warehouse doesn't mean that you're permitted to sell it. And there is warehouse eligibility. Perhaps a certain warehouse that has your item isn't permitted to ship to a certain area or is not allowed to sell on a particular website. There are sales caps where you might have limits for a particular timeframe of how many you're just permitted to sell, like maybe a cap of 1,000 during some kind of discounted special. And there are back orders from all the orders that already exist that weren't able to be filled originally, but the customer still wants them. And for every single one of these factors at a large enough organization, there are going to be legacy systems that have duplicates of all this information that you need to consider as well. So how do we add all of these things together to give Jane her answer?

A common model is via service-oriented architecture, or SOA, in which we decompose each of these factors into a service. You call each of these services on demand in real-time to get the information you need. What does that look like here? Well, now I have the pleasure of showing you one of the ugliest diagrams I have ever made. And don't worry, I don't expect you to memorize this or even be able to read it. The complexity is the point. You can still see at the top the website calling the item availability API. Each of the item availability factors that I listed is represented somewhere on here by a service which may depend on other services. To give you a sense of scope, each one of these boxes is a whole system or multiple systems, each maintained by one or more whole teams. So let's walk through what happens when Jane looks for her dress. At the top, highlighted in red, the customer-facing website calls the item availability API.

That general API calls the global item availability API, which checks its cache, doesn't find it, and falls back to other services, which call other services, and more until we can finally compute the answer for Jane. So let me save you some time on the math. To get the dress availability in under 300 milliseconds, 99.98% of the time requires 23 service calls, each of which has five nines of uptime and a 50-millisecond marginal service level objective. Without every single one of these services working correctly, it is impossible to know if an item is available. You're better off not even guessing with partial information. It's better than risking telling the customer the wrong answer. To be blunt, an outage in any one of these services takes down the entire availability API. Because each of these systems has business logic that is so tightly coupled to so many other systems, it's extremely difficult to properly test them.

Unit testing covers a tiny fraction of the space of potential errors, and relying on integration tests to fully vet something this complex is absurdly costly and absurdly ineffective. Further, each of these systems was fundamentally designed internally in a traditional manner. As changes happen, the current state, usually stored in a relational database, is mutated in place. And there's an expectation, correct or not, that servers are reliable and will only be shut down or restarted with permission, and we all know how well that works. So how do we go about tackling these problems? Can we take what we learned at Jet and extract lessons? Further, these problems probably aren't unique. Can we ensure that these lessons are broadly useful to anyone or any company that might suffer similar problems? Well, the jet.com way of approaching these problems was to look at them through the lens of functional programming.

So let's walk through these principles and learn what the implications are for system design. There are many principles, but I'm going to focus on just a handful today. I'm going to start with immutability, the idea that the inputs don't change. The functions that take these inputs produce outputs that are also immutable. State is not directly mutated. We embrace purity. We avoid writing functions that produce side effects. No writing to disk or network until the last possible moment, and we strictly control those side effects when we do. This makes it easier to reason about the code and test the code. The external world outside the function can't affect the results, and the function won't affect the external world. This makes the function very predictable and repeatable. Given input, the output will be the same every time. And that repeatability unlocks a principle called the duality of code and data.

It's a fancy way of saying that the code and the data are interchangeable. A function that accepts parameter A and computes output B could be replaced with a lookup table with a key of A and a value of B. Conversely, a really big lookup table takes up gigabytes of space that maps A to B could be compressed into a function that computes B from A. And you can go back and forth between the two. Gene did a great job introducing some of these principles and showing how they work in the small when you're writing code. We took these same principles and applied them in the large, changing how we design systems and systems of systems. Let's walk through some of these results. Starting with immutability, you get message-based, log-driven communication. The first part of this, the message-based part, is pretty ubiquitous. Systems communicate with each other via messages synchronously over HTTP or asynchronously over some kind of queue.

Log-based pub/sub systems like Kafka, AWS Kinesis, and Azure Event Hubs take this a step further. Not only do the messages themselves not change, but they are ordered and retained for an extended period of time, even after you've consumed the message. The consuming services keep track of their own progress via checkpoints into the log. So what does this mean? Imagine you suffer an outage that causes you to lose the last day's worth of transactions, or even worse, you've introduced a bug in your code that corrupts data. You can deploy your fix and reset the checkpoint to the point in time before the bug was introduced. This will force your consumer to replay all of the subsequent messages, re-consuming them with corrected code and fixing your corrupt data. This approach drastically improves your mean time to recovery on an entire category of production errors.

At Walmart, we replaced HTTP calls, queues, and even enterprise service buses with Kafka. And at Moda, we're using AWS Kinesis for the same end. With immutability, you also derive event sourcing. Events are facts about something that happened in the world. Once an event occurs, it always will have occurred. It doesn't change because, by definition, it's already happened. In an event source system, events are first-class citizens. The canonical data store consists of ordered streams of events. The current state is secondary, a consequence of the events. You use the stream of events to build the current state by aggregating over all of them. Bank accounts are an obvious example of this approach. Your account balance, your current state, is the result of summing over every deposit and withdrawal that had ever happened. Event sourcing, storing the events this way, is extremely powerful. It effectively gives you a time machine.

You can see the state for any point in time, and you can walk step by step through everything that's ever happened. This is fantastic for troubleshooting. You can validate behaviors that people are observing. People may report that they saw a problem at a specific time. It could be days, weeks, even months after the fact, and we can go back in time, re-observe it, and perform a root cause analysis. Further, event sourcing unlocks entire new areas of analytics, where we've found that our marketing teams love having this kind of data of everything that's happened over time, and our operations and audit people love knowing exactly everything that's ever happened. With purity, our goal of purity means that we isolate computations from the real world. We write all business logic as stateless functions with zero external dependencies. That means zero I/O.

Instead, collect all state you need upfront and pass it into your business logic as parameters. That statelessness, that isolation of the computation, gives it predictability, and it gives it atomicity. There are no random outcomes, so-called Heisenbugs, and there are no partial results. Real-world failures, and in the cloud, you are constantly dealing with real-world failures, may keep your code from running, but it will never affect correctness or consistency. Now, because the business logic doesn't have side effects because they are pure, it means that 100% of the domain logic is unit testable. You can provably identify every single path through the business code and write unit tests for it. Not just write unit tests, but create executable specifications. You can define invariance from your specification explicitly as properties. For example, we say that inventory counts should never be negative. That is an invariant.

These properties can be checked automatically with large numbers of randomized inputs extremely quickly. Spec-based and property-based testing frameworks that do this are available for most languages, but to work well, they depend on your code being stateless. And if you do this well, integration tests are only needed for establishing basic connectivity between services. You can test much more thoroughly in less time and for less cost. Now, you can't remain pure forever. Once your business logic is complete and you have a result, you have to do something with it. But don't do any more changes than you absolutely have to in this process. Write it to one and only one place. You may be tempted, and this happens all the time, in the same process to write to a database and then notify a downstream consumer about that change, maybe via a queue. Don't. This is called a dual write.

In a distributed environment like the cloud, failures can and will happen at any point. As soon as one of those writes succeeds and the other fails, your system is now in an inconsistent state. Dual writes take all the hard work you did to get guaranteed outcomes and tosses it out the window. Instead, the safe way to accomplish this is via change data capture. The result is that an event is published downstream if and only if it's been committed to the database. This ensures eventual consistency. In failure scenarios, you may fall behind publishing, but you'll never lose events. You'll never lose events in your data store, and you'll never lose telling your downstream consumer. Different databases support this in different ways. Walmart now uses the Azure Cosmos DB change feed for this, and at Moda, we use Kinesis Streams. So by applying these principles, we've established a pattern for designing systems that looks like this.

We receive immutable messages over Kafka that are consumed by a microservice running stateless domain logic that emits these immutable events into data streams. The events are then published downstream to any consumers over Kafka again. But we're not done yet. When we're employing immutability and purity, we can take advantage of the third principle and replace real-time compute with data lookup wherever feasible. When you know the set of possible inputs in advance, or you've seen specific inputs before, you can replace the often expensive runtime computation with a precomputed cache of the result. For instance, in event sourcing, if you try summing over the first 1,000 events in a stream more than once, you'll get the same result every time.

Particularly for long-running streams that are millions of events long, it makes sense to save a snapshot and use that as your starting point next time, instead of retrieving and summing the entire stream. This costs you a very small amount of storage for the snapshot. And congratulations, you've just exchanged a computation for data. That gives us a final pattern for system design that looks like this. What has changed from the previous diagram is that we've added a service that consumes the events from the Kafka feed, builds updated stream snapshots, and then updates the cache. Further, we're publishing all of those snapshots via change feed to Kafka as well. Downstream consumers will have a choice. They can consume all of the events as they happen, or if they only care about the latest state, they can consume that feed instead. One of the first teams to use this pattern at Walmart was called Panther.

Panther is an inventory tracking and reservation management system. On the supply side, it aggregates and tracks all sources of inventory. It includes the Walmart and Jet-owned warehouses, and all partner merchants and their warehouses. And on the demand side, it acts as the source of truth for reservations against the available inventory at those warehouses. When a customer is checking out, the contents of their cart are reserved to make sure that no one else will order them. If there's only one left, that's pretty important. If the inventory is not available at that point, the reservation fails, and the items must either be resourced from a different location, or different items must be selected. The primary goals of Panther were to maximize on-site availability while minimizing order reject rates due to lack of inventory.

There were a lot of secondary goals as well to improve the customer experience by reserving inventory early in the order pipeline. We wanted to enhance insights for the marketing and operations teams by providing more historical data and better analytics. And we wanted to unify inventory management responsibilities typically spread across multiple systems. Of course, along with these business goals, our solution had a lot of non-functional goals, like high availability, geo-redundancy, and fast performance backed up by SLAs. We found a lot of success with this architecture. The entire team, started by a single engineer in July 2016, had only three team members when Panther went into production by Black Friday that same year. That's only five months later. After one year, the team still only needed five engineers. Once in production, we found it very easy to add features. With inventory tracking, staleness of data is an issue.

Simply put, if a merchant last told us their inventory months ago, there's no way we would trust that. So we wanted to implement a feature that expires the merchant updates after a certain amount of time. Just zero it out. The results were immediate. We dropped our third-party reject rate in half from 0.8% to 0.4%. And what's great about this is that this was done by a single engineer who is new to the company with light F# training and no cloud microservice background. Went from design to production in three weeks. So with the success of Panther, we started rebuilding a number of our supply chain systems following the same principles and patterns. But we didn't stop there. I want to revisit my ugly diagram from earlier. This is the one with all the nested synchronous API calls. We looked at this mess of dependent services. There may be dozens of teams and a lot of deployments on heterogeneous stacks on thousands of servers.

But as far as the front-end shopping site is concerned, looking up the availability of Jane's dress may as well be a function. It calls other functions that call other functions and eventually returns a single end result. This is a call graph. It's code. It's distributed, unreliable, stupidly expensive code, but it's code. And if we remember the duality of code and data, there's a way to exchange that code for data. Maybe that data will be more reliable and less expensive than this monstrosity. It turns out it is. The systems modeled after the Panther architecture, all stream events and state changes as messages over Kafka. We can use these message streams to invert the dependencies. Instead of the dependent service pulling its needed inputs in real time, the source system can push the data changes. The dependent service consumes these changes as it happens, updates its own state accordingly, and pushes its own changes downstream.

We can convert from a primarily synchronous service-oriented architecture to a primarily event-driven architecture. All of the same item availability factors are represented in this diagram, but now almost all of them are hooked up asynchronously. Messages are flowing in this diagram from left to the right. We're trading the real-time computations, the real-time calls, for precomputed data throughout the supply chain systems. How does that affect the hot path, the moment that Jane looks for her dress? Well, a moment ago, like I showed in the SOA model, I highlighted that hot path in red. So let's look carefully to see what it looks like here. That's it. To achieve the same SLA, we need only two service calls, not 23, both of which have only four nines of uptime, not five, and only 150 millisecond SLO, not 50.

All of the event-driven systems still need uptime and processing time SLOs, but they're no longer in any customer hot paths. They are completely asynchronous. Three nines uptime and end-to-end processing in seconds or even minutes is sufficient. So how does all of this affect cost? This is going to vary among organizations, but we can ballpark it. First, an event-driven system. Three nines uptime, mid-end latency is about as cheap to operate as any system we're likely to see. If we increase our uptime by 10X to four nines and drop our latency by 400X to 150 milliseconds, for a lot of orgs, you're looking at an order of magnitude higher cost. To push your uptime to five nines while tightening the latency even more, for most organizations, that is an obscene amount of money. How do the total operational costs compare once you've replaced all of these things? Well, with the functional event-driven approach versus... Yeah.

Now, I'm not allowed to give you precise numbers, but I can tell you that for walmart.com, this difference is millions of dollars per year. And notes just went off the screen. Can we get those back on, please? You may have a lot of objections to this. You may be thinking, "Wow, this sounds really great, but there's no way we can do that." Well, let's talk through some of the more common reasons. My dev team isn't skilled enough. Well, I've trained up not just senior devs, but junior, mid, and senior-level engineers from all kinds of backgrounds, Java, C#, JavaScript, Ruby, Python, and they've all succeeded at this. We don't have the technology to do this. Well, you can follow these principles in any language. And if you're talking about the infrastructure that you need, every cloud provider has some kind of results available, or some kind of infrastructure available for messaging. It'll make your app too complex.

Well, that might be true for some systems, only if you're talking about the most basic ones. Gene talked about a system that he built that was really just a toy system, but wanted to see what he could do to simplify it. We're running a little low on time, so I'm going to just walk through this really quick. He found that it turned out to be a lot more practical than he expected. The last ones are it'll cost too much, it'll take too long, it's too dangerous, and it's just too much. I have this enormous creaky bailing wire and duct tape spaghetti code monstrosity. It grew uncontrolled over the years, if not decades. Has dozens, hundred of thousands of people trying to keep it working. Well, I recognize this is a pretty big shift in mindset, but there's an old joke about this. How do you eat an elephant? The answer is one bite at a time.

There are small steps you can do right now to take these principles and apply them regardless of what your systems look like now. You can identify just one dual write somewhere in all of your systems and figure out a way to eliminate it. Consider using change data capture to do so. You can encourage property-based testing in just one system. Most of your devs won't find it that different from regular unit testing. And you can switch one web service to also publish events. You don't have to fully commit to event sourcing, just publish your changes as they happen. Then switch one consumer to read the events rather than make HTTP calls at runtime. This is a very easy way to bite off a small piece and ensure the safety of the system while you do it.

So if you have an architecture that looks like this and you don't have someone, an architect, who is talking about how to move to something like this, you're doing your organization a grave injustice. My mission in life is to reduce the amount of entropy in the universe, or at least our little corner of it. So if you want to help me in this journey, if you want to replicate what we've done or you have new ideas, here's how to reach me. I'm scott.havens@operandi.com or scotthavens on Twitter. And I'd be remiss if I didn't say that we are hiring. So thank you very much. Have a great day.