Forging a Functional Enterprise: How thinking Functionally Transforms Line-of-Business Applications

Log in to watch

London 2019

Forging a Functional Enterprise: How thinking Functionally Transforms Line-of-Business Applications

Director of Software Engineering · Jet.com and Walmart Labs

Scott is Director of Software Engineering at Jet.com and Walmart Labs for supply chain technology. He specializes in data-intensive systems.

Chapters

Full transcript

The complete talk, organized by section.

Host Intro (Gene Kim)

The next speaker up is Scott Havens.

Claude Levi-Strauss has spoken extensively about functional programming, how it is a better tool to think with. And I think one of the best examples of this, not just as an academic exercise, but in an industrial setting, is Scott Havens.

He's a director of software engineering at Jet.com and at Walmart Labs. His current remit is to rebuild the entire inventory management system for Walmart, the world's largest company. He earned this right by the amazing work he did building the incredible systems that powered Jet.com, a company that Walmart had acquired. It powered the inventory management systems, the order management systems, transportation, available-to-promise, available-to-ship, and tons of other critical processes that must go right for any online retailer.

And he built this using F#, a functional programming language, and utilizing all of these functional programming techniques. And what he will show you will blow your mind. Scott Havens.

Scott Havens

Go back one.

Good morning. Scott Havens. I'm a director of engineering at Jet.com and Walmart Labs. Today, I want to talk with you about functional programming principles and the value you get from applying these to whole systems and even whole organizations.

First, some quick background. Jet.com is an e-commerce retail company based out of Hoboken, New Jersey, launched publicly about four years ago with the plan to take on Amazon.com head-on. Roughly a year later, almost three years ago, Walmart acquired Jet.com. Walmart is the largest company in the world, both by revenue, at over half a trillion dollars a year, and by employees, with 2.3 million.

Walmart.com now focuses on the mass market, while Jet.com has been refocused for the urban millennial customer, where Walmart has made fewer inroads. I'm an architect at Walmart Labs, the tech org for Walmart. I'm responsible for designing and building supply chain systems like inventory management for Walmart, including Walmart.com, Jet.com, and all of the Walmart stores.

One of the reasons that Walmart had bought Jet is because the Jet tech stack looked transformative. It looked cool. But not everyone is convinced by just cool. They didn't know if our techniques provided any real-world benefits.

Well, it wasn't long before we were fortunate enough to get the chance to demonstrate these benefits. And when I say fortunate, what I really mean is that disaster struck.

The middle of March, two years ago, in the middle of the night, I got paged for some system alert. I woke up, hopped on the phone bridge and our Slack PagerDuty channel, and I started looking into it. Almost immediately, I was joined by coworkers from several other teams. It turned out that our production Kafka cluster was down.

If you're not familiar with Kafka, it's basically a very scalable pub/sub messaging system. We use it as the primary method of communication among all of our back-end services. Before too long, we realized that the cluster wasn't just down, but it was dead. Like the proverbial parrot, it was an ex-Kafka cluster.

Every single message in flight was gone. Customer orders, replenishment requests, catalog changes, inventory updates, pricing updates. Every single one was gone. We were going to have to rebuild the cluster from the ground up.

Now, this could have been catastrophic. This could have been the end of the grand Jet experiment, enough to convince our new Walmart compatriots that Jet's technical tenets sound good on paper but don't work in a real enterprise compared to tried-and-true systems.

So what happened? To answer that, I want to go back and walk you through those Jet.com software tenets. Not just what Jet built, but why. Much of it is founded in principles of functional programming. Lessons learned from programming functionally can be used to design systems, and lessons learned from that can be scaled up for the architecture across your entire company.

Then we'll get to the most important part, the real-world benefits to building and operating these systems. How will it make your DevOps teams faster, better, and cheaper?

So what are the principles of functional programming? There are many, but I want to focus on just a handful today. First, immutability. Inputs don't change. Functions that take these inputs produce outputs that are also immutable. State is not directly mutated.

Second, actions over objects. Think verbs before you think about nouns. And finally, embrace purity. We avoid writing functions that produce any side effects. No writing to disk or network until the last possible moment, and we strictly control those side effects when we do.

This makes it easier to reason about the code and test the code. The external world outside the function can't affect the results, and the results can't affect the outside world. This makes it very easy to think about and reason about your code.

So what does this mean for designing systems? First, message-based communication. Luckily, nowadays, that's pretty ubiquitous. Systems communicate with each other via messages synchronously over HTTP or asynchronously over some kind of queue.

Not nearly as ubiquitous is event sourcing. Events are facts about something that has happened in the world. Once an event occurs, it always will have occurred. They don't change because, by definition, they have already happened.

In an event-sourced system, events are first-class citizens. The canonical data store consists of an ordered stream of events. The current state is secondary, a consequence of the events. You can use the stream of events to build the current state by aggregating over all of the events. Bank accounts are an obvious example of this approach. Your account balance, your current state, is the result of summing over every deposit and every withdrawal event.

Next, we have verbs before we think about nouns. We don't build entity services. There isn't a monolithic customer account service like you might expect if you're coming from an object-oriented mindset. Instead, with a functional design, you start with the thing that you're trying to do, write a function to accomplish that, and host that function in a microservice named for it.

Then we isolate computations from the real world. We write all business logic as stateless functions with zero external dependencies. That means zero I/O. Instead, you collect all the state that you need up front and you pass it into your business logic as parameters.

That statelessness, that isolation of the computation, gives it predictability and it gives it atomicity. There are no random outcomes, so-called Heisenbugs, and there are no partial results. Real-world failures may keep your code from running, but they will never affect the correctness or the consistency.

The benefit of that is that if you know all of the set of possible inputs in advance, or you've seen specific inputs before, you can often replace the entire, probably expensive, real-time computation with a precomputed cache of the result. For instance, in event sourcing, if you repeatedly try summing over the first 1,000 events in the stream, you'll get the same result every time.

Particularly for long-running streams full of millions of events, it makes sense to save that snapshot and use that for your starting point next time. The cost of that is only a small amount of storage. So congratulations, you've just exchanged a computation for data.

Finally, no dual writes. You might be tempted in this same process when you write to the database to then notify downstream consumers that you've made that change. Don't do it. There are too many problems in distributed systems, especially in the cloud, where if you get a failure in one of the writes and a success on the other, you're now in an inconsistent state. There are much safer ways to do this, and I'll get to those techniques in a second.

So let's dig into a system in the real world that was designed using some of these principles. At Jet, most of our systems are named after superheroes. So let me introduce Panther. Now, I couldn't get the rights to this from Disney or from Marvel, so Jason Cox, if you're out there, maybe we can work something out for the next time.

Panther's responsibility is for inventory tracking and reservation management. Our primary goals with Panther are to maximize on-site availability while minimizing reject rates. Of course, we also have goals of improving customer experience through making the orders earlier in the process, enhancing insights through advanced analytics, and unifying inventory management responsibilities that traditionally have been spread over multiple systems.

Of course, along with these business goals, our solution had a lot of non-functional goals like high availability, geo-redundancy, and fast performance backed by SLAs.

So I want to look at the core flow of a system. This is the basic building block of systems at Jet. I'm going to walk through each of these pieces in turn.

The first part is executing commands to produce events. This has five steps. First, you ingest the message. This could be asynchronously over a queue, or it could be synchronously via HTTP. Then you deserialize the message to a strongly typed command. Now, if your language doesn't support strong typing, like JavaScript or Python, well, I'm sorry. No, seriously, just parse the message, make sure that it has everything that you'd expect from a command before you move on.

A command is a message that expresses intent to change state. It's named with an imperative verb and usually a direct object. For example, update inventory or reserve inventory or expire reservation. It has a straightforward flat structure. The name of the command, in this case, reserve inventory, expresses our intent. It then has an ID or a set of IDs that help you identify what you're working on. In this case, the item ID and the warehouse ID. Then it has extra information for the command. In this case, we need the quantity and the order on which we're working. And then finally, it includes some metadata, such as who is trying to perform the action or when it happened.

Now that we have our command, we retrieve the current state from the database. Using the identifier or identifiers from the command, we need to retrieve the aggregate from the data store. Now, we can do this by retrieving the entire event stream and summing over it, or we can retrieve that aggregate from the cache that we mentioned earlier.

Once we have that state and we have the command that expresses our intent, we need to execute the command against the current state. At the execution of the command, usually we'll do some kind of validation. In our case, reserve inventory will need to be validated to decide, hey, do we have enough quantity in stock to actually do the reservation?

The execution of the command will produce some kind of event. An event indicates that something has happened. It usually flips the syntax, where it switches from reserve inventory to inventory reserved. Validation failures are events too, with the format of command name failed. So in this case, if we couldn't reserve inventory, we would emit an event, reserve inventory failed.

Now, with the output event in hand, we need to commit the event to the event stream. Remember, because our business logic is pure, everything up to this point has been done entirely in memory. Nothing has been saved. The event is only provisional until it's committed.

So we need to attempt to append the new provisional event to the stream. If the write succeeds, then the event has officially happened. There are a number of reasons, though, that the write could fail. For example, there might be a conflicting write. We probably have more than one command executor running in parallel, so we'll usually have optimistic concurrency control on our event stream writes. The compute nodes could crash. The data store could have an outage. These are all fine. Until the event is committed, we have zero side effects. We're safe at any point to completely restart processing from the beginning and try executing that command again.

So once we have an event and we've committed it to the event store, how do we fill that cache that we mentioned earlier? How do we build that state? Well, the first step is to publish the events that you just wrote to the event store. I noted a few minutes ago how dual writes are one way to try this, but they are just terrible for consistency. Instead, the safe way to accomplish this is via change data capture.

Different databases support this in different ways, but the idea is that when the event is written, if and only if it's been written, does it get published downstream. Probably to something like Kafka or some other queue. Once it's been published, you can consume it with a snapshot event stream service. This has four steps. First, you receive the latest batch of events for the stream. Second, we need to get the latest snapshot from the stream, if such a snapshot exists from the data store. You apply the sequence of events to that snapshot. Now, if you're a practitioner of functional programming, you'll recognize this as the perfect opportunity for a left fold. And then finally, you just save the snapshot back to the data store.

And now that we understand the fundamental command execution and event sourcing pattern, let's zoom out to the whole Panther system. What we see is the core flow in the middle and some pieces at the beginning and the end that are piping information in and taking information out. At the beginning, we have our inputs from other systems that are being statelessly mapped into domain commands. Whereas on the output side, we have our events, or we have our snapshots, that are similarly being mapped and filtered for our downstream consumers to consume.

This kind of architecture has a number of benefits, and I'm going to try to walk through a few of them.

First, we have resiliency. Now, what does resiliency mean to me? Well, it's a few different things. One, you want to make sure that the system will, even in the case of partial outage, continue to serve customers. And two, if the outage is due to some kind of problem, that you're able to recover either trivially or it's even self-correcting.

So what kinds of problems do we expect to see with a system like Panther? Well, one, we have to deal with a lot of out-of-order messages. We might see from a lot of different sources that our messages are coming in and out of order. There are a lot of different sources, and we might even see that a warehouse has chosen to cancel before we've even seen that the order has been received.

We have several downstream consumers of inventory data, like pricing and search engines. They might lose our data because of their own bugs or because their data store blows up. Event-driven designs make it easy to replay. Even if those messages are no longer in Kafka, no big deal. Panther can republish from our own event store any of the changes that have happened even since the beginning of time.

Then third, we might have to deal with random restarts. In the cloud, this happens all the time naturally, but you might even have one more level where you have Chaos Monkey in there just restarting your systems randomly. Again, all of our services are stateless microservices. So they can be restarted at any point, and they'll just pick up from where they left off with no problem.

We find that it's very easy to add features in this system. We had as a goal to minimize our reject rates, and we found that a lot of merchants were sending in inventory updates and then leaving them for months. We cannot trust data that's come in months ago. So we just want to zero out that inventory. We cannot rely on it. So we built a feature to do so.

This had an immediate effect on our results. We went from a 0.8% reject rate down to a 0.4% reject rate. The best part was this was written by a single engineer completely new to F#, completely new to cloud-based microservices in under three weeks from design to production. In fact, the entire Panther system started in July 2016 by one engineer, went to production by Black Friday of that just a few months later. After the first year, it still only had five engineers.

With these kinds of systems, you get a time machine. With your event streams, you can go back to any point in time with replay and find exactly what happened. This is great for any kind of troubleshooting or auditing capabilities you need. If someone has reported an event that happened days, weeks, even months ago, you can go back to that exact point in time, re-observe the problem, and perform an RCA. Your customers will love you for this.

Functional systems are really, really good at scaling. Scaling in practice happens along two dimensions, compute and storage. For compute, our microservices are deployed and scheduled in Nomad and Kubernetes clusters, all independently deployable, all independently scalable. For the storage side, event sourcing is very compatible with cloud-based NoSQL data stores. Our own implementation is built on Azure Cosmos DB. We can scale the throughput within seconds to 10 or 20 times normal, and when we're done, scale it back down.

We've had to use this in practice. Early on, a third-party merchant made a configuration change and started flooding us with updates. Overnight, our entire system went to three times our normal throughput due to this one vendor. As soon as we detected this, we were able to scale up and deal with the problem for the next day or two until they fixed their problem. We found this to be so useful that we have since introduced automatic scaling for both our compute and our storage.

Testing is super important with functional systems. 100% of our domain logic is unit testable. The function that executes commands and produces events has no external dependencies and has no side effects. Similarly, the function that applies events to state and produces a new state has no external dependencies and has no side effects. These are pure functions.

We can provably identify every single path through the business code and unit test it. In fact, not just write unit tests, but create executable specifications. We can define invariance from our specification explicitly as properties. For example, we may want to say that inventory counts should never be negative under any circumstances. These properties can be checked automatically every time you run through your CI/CD process with large numbers of randomized inputs that are extremely quickly processed.

Spec-based and property-based testing frameworks are available for most languages, but to work well, they depend on your code being stateless.

Another huge benefit from the functional approach, especially at the entire architectural level, is in operational costs. Let's look at an example to see why.

So we have a customer, Jane, who wants to buy a cocktail dress for an upcoming party. She needs to know if it's available in her size. It doesn't have to be in a store or warehouse anywhere close to her. It can be anywhere as long as it can be shipped to her. When she checks walmart.com for the dress, the site can't be down or take too long to load. Competitors' websites are just a click away.

So our item availability API needs to have an SLA in the ballpark of 99.98% uptime. That's just shy of two hours a year permissible downtime at 300-millisecond latency.

So what factors go into item availability? Well, I already talked about Panther, how it tracks warehouse inventory and reservations. But there's a lot more than that. We need to know what kind of inventory is in the store, on the floor, or in the back rooms. We need to know if the item is eligible to sell, if the manufacturer is allowing us to sell it, if the warehouses are open, maybe they're on vacation, or if there are any other kinds of sales caps or promotions happening. In addition to all of that, there are legacy systems that might be providing all of this data, but from an older system. We have to reconcile those.

So how do we add all these together to give Jane her answer? A common model is the service-oriented architecture, SOA, in which we decompose all of these different factors into services. You call each of these services on demand to get the information that you need.

But what does that look like here? Well, I have the pleasure now of showing you one of the ugliest diagrams I've ever made. And don't worry, I don't expect you to memorize this or even be able to read it. The complexity is the point. I will point out you can still see walmart.com at the top calling the item availability API. Each of the item availability factors I listed on the previous page is represented somewhere in this diagram by a service. Each of those services may depend on other services. And to give you a sense of scope of this entire diagram, at the very bottom row in the middle, you can see the warehouse inventory. That single box represents the entire Panther system that I described earlier.

So let's walk through what happens when Jane needs to look for her dress. At the top, the customer-facing website calls the item availability API. That general API calls the global item availability API. It checks its cache and doesn't find it, so it falls back to other services, which call other services, and more, until we can finally compute the answer for Jane.

So let me save you some time on the math. To get that dress availability in under 300 milliseconds, 99.98% of the time requires 23 service calls. Each of those needs to have five nines of uptime with a 50-millisecond marginal latency SLO.

So let's look at a different approach. Based on the success of Panther, we started rebuilding all of these systems following the same principles and patterns, allowing us to convert from a primarily service-oriented architecture to a primarily event-driven architecture. All the same availability factors are represented, but now almost all of them are hooked up asynchronously. In this diagram, messages are flowing from the left to the right. We're trading real-time computations for precomputed data throughout the entire supply chain architecture.

How does that affect the hot path the moment that Jane is looking for her dress? Well, like I showed a moment ago in the SOA model, I'll highlight the hot path of web service calls in red. Watch carefully. That's it.

To achieve the same SLA, we need only two service calls, not 23, both of which have only four nines of uptime, not five, and only 150-millisecond SLO, not 50. All the event-driven systems still need uptime and processing time SLOs, but they're no longer in any of the customer hot paths. They are completely asynchronous. Three nines of uptime and end-to-end processing in seconds or even minutes is sufficient.

So how does all of this affect cost? Well, I'm not at liberty to share our exact numbers, and it's going to vary among organizations, but we can ballpark it. First, for the event-driven system, three nines and a minute latency is about as cheap to operate as any system we're likely to see. If we increase our uptime by 10x and we drop our latency by 400x, for a lot of organizations, you're looking at an order of magnitude higher cost. To push your uptime to five nines while tightening the latency even more, for most orgs, that's an obscene amount of money.

So how do the total operational costs compare? Again, without giving any exact numbers, with the functional event-driven approach versus... Yeah. It's a lot of money. Come on. Yeah. Thank you.

So now we've covered functional principles, the designs of systems based on those principles, and the kind of benefits that get conferred on these systems. This is a good time to revisit what happened when Jet's Kafka cluster exploded.

So first, we rebuilt the cluster. New brokers deployed in minutes via Ansible scripts. While this was happening, we coordinated with all the teams who manage edge systems. These are systems that are exposed to the outside world, our merchant API inputs, customer order inputs, and the like. These edge systems, like all the others, are event sourced.

The teams reset checkpoints on their event streams into a point in time just prior to the outage. All the events after that point were re-emitted to the downstream consumers. Now, there was some overlap in the messages that had been sent earlier and what messages sent later, but all the systems had been designed and tested for idempotency. Receiving those extra messages, those duplicate messages, caused no problems.

Oh, another point, and that's very important. All these systems, there was a flood of messages that came through, but they were all designed this way, so they could all be scaled up trivially to deal with that flood of throughput. So we were able to maintain customer SLAs without any problems. What could have been a complete catastrophic disaster ended up being nothing more than a minor annoyance.

Now, I recognize that everything I described here is probably a pretty big shift in both mindset and approach for a lot of people. However, there are still small steps you can do right now to take these principles and apply them regardless of what your systems look like now.

First, you can identify just one dual write somewhere in all of your systems and figure out a plan to eliminate it. Consider using change data capture to do so.

If you have a SOA, identify one web service you could switch to event-driven. You don't have to fully commit to event sourcing. Just publish your changes as they happen. Then switch one consumer of that service to read the events rather than make web service calls at runtime.

Try property-based testing. Encourage some of your devs to use it. It'll turn out most of your devs won't find it that different from regular unit testing, but you'll get much better coverage and much better results.

And finally, after you try these, or even before you try these, if you want to talk more about this or learn more about this, please contact me. I'm not just at the conference, but I'm on Slack, and you can reach me by email at scott.havens@jet.com or via Twitter, @scotthavens. Thank you very much.

Host Outro (Gene Kim)

Thank you, Scott.

So I think it's just a phenomenal case study of how incredible these principles are. So I want to share with you what I had found on Twitter a couple of weeks ago. So this is a graphic that shows the difference between passing variables by value versus passing them by reference.

And so what I think is interesting is that when I was studying programming languages in graduate school, most mainstream languages only supported passing variables by value, which meant if you passed a variable to a function and you changed it, it only changed a copy, right? So it only filled its version of the coffee mug.

But now what it meant is that you have to return a variable, you have to do a lot of copy and pasting, and if you have large objects or structures, that could lead to a lot of typing. It's tedious, manual, and error-prone, and very time-consuming.

So back in 1995, I got introduced to passing things by reference. And it was just showing up in mainstream languages like C++ and Java. And the magic was that you pass the variable to a function, and you can change the variable, and it will also change the value that you had passed to the function. And to me, I thought this was incredible. It was such a time saver. It saved time. You didn't have to type as much code, and it seemed really great.

But three years ago, after learning about functional programming through Clojure, one of my biggest aha moments was just how utterly terrifying this is. In fact, when you see this, what you should really see is this. You should be thinking, "Who's putting stuff in my coffee mug? I didn't touch the coffee mug, and someone is filling them up, and how do I make it stop?"

So this is actually--and Scott mentioned the notion of Heisenbugs. I'm trying to stop it from happening, and I need someone to fill my coffee mug, and I don't know how to make that happen. So these are conditions where it is very difficult to reason about.

So the notion about immutability: no one should be able to touch your coffee mug. Pure functions: you never get to touch a coffee mug. What you get back is always a function of its inputs. No side effects: that means no one can touch your coffee cup. In fact, you can't even touch your own coffee cup. If you want to fill it up, you get a new coffee cup. And the notion of side effects, you push them to the edges.

And in my mind, I cannot overstate how much better of a tool this is to think with. And so these are the concepts that are at the very core of things like Docker and Kubernetes. You can't change a container once you create them. If you want to change it, you have to make a new container. Git: no one's allowed to rewrite the past. You get yelled at if you try to rewrite the commit history.

And so I think what's so remarkable about Scott Havens' story is that you can do amazing things, even at the vast complexity of the inventory management systems at Walmart, and do it with a smaller team, do them faster, have it be more resilient, and be safer. So that's why I think Scott's story is just utterly amazing.