Log in to watch

Log in or create a free account to watch this video.

Log in
San Francisco 2017
Share

The Making of Amazon Prime Now

What Amazon Prime Now is, and the challenges faced building it. Presented by former Director of Engineering. Listen to this story about DevOps at Amazon.

Chapters

Full transcript

The complete talk, organized by section.

Tisson Mathew

This is Tisson Mathew, CTO of Alignment Healthcare. We are one of the fastest-growing healthcare companies in the United States. We focus on healthcare for seniors. I know healthcare is trending towards almost 20% of our GDP, so it's a big problem there.

Prior to Alignment, I was at Amazon as the director of engineering for Prime Now. What I want to talk about today is what Prime Now is and the challenges we faced building Prime Now.

When you think of Amazon, DevOps has been part of the culture for many years. There are thousands of DevOps teams that have built and operated services for over a decade. So it's quite a lot of history with the company around engineering culture and how to build software.

What Prime Now is, is it gives you two-hour free delivery of household goods, as you can see there, and you get the one-hour delivery if you pay us some money. All right? As simple as that.

The team launched the first Prime Now in L.A. For us to scale this to an entire country or even go international, we had to do a lot of work across multiple different teams.

I'll show you the Amazon virtuous cycle. It's right here. This is Jeff Bezos's napkin Amazon business concept. You can Google this and find it.

First, it starts with a set of sellers who want to sell products online. When Amazon started, it was just Amazon.com, their own retail. It started as a book company. So you have a selection of products that you can buy from. Then, if you deliver products on time, very, very important, customers will buy more products. That will create more traffic. That will drive growth.

While you're growing the company, you drive that growth and the profits that you potentially get into your infrastructure. Then you drive improved efficiency and also drive better delivery speed.

The speed of delivery matters a lot for consumer experience, or customer experience. If you think of Amazon's flywheel, this has grown tremendously over the last 15 years. As Gene was showing Amazon's trajectory and how it transformed retail, the essence is this. The second is DevOps, I can tell you that. The reason is the company went through massive decoupling in 2006 through 2008, decoupling the entire company to things that folks at Capital One were talking about.

Decoupling the company from departmental silos to product teams. There are literally thousands of these product teams in the company, and they can continuously innovate. And so Amazon doesn't need an innovation department. All right? That's very important. If you're trying to create an innovation department, this means that your company itself is very difficult to innovate on.

This flywheel is the concept that drives everything within the company. Prime Now primarily focused on one aspect of it, which is faster, more reliable delivery. Faster means really, really fast.

This is a problem we faced. The problem statement is this: order products using an app, it's called the Prime Now app, on the website, or an app, or Alexa, and get your order in an hour. Actually, we were initially targeting 20 minutes. Twenty minutes, 40 minutes, like you order food in a hotel room. All right? It's a cloud 7-Eleven store. Think of it that way.

Prime Now focuses in larger density, which means the cities have a certain level of population and demand. It's inherently making it local, so you can order food and items from local stores as well as Amazon.com.

If you think of the problem statement itself, online ordering has three major functions. The first one is ordering. You go to the website and place an order. The second is fulfillment, which is a distribution center picks and packs your stuff up, and then sends it off to UPS or FedEx, and they deliver that to your home.

Those three were fairly segmented and usually built to handle things in two to five days. Think about the entire industry was built on people placing the order, it sits around in a queue and goes to a fulfillment center, it takes a couple of days for them to get it all packed up, and then send it off. And this delivery time needed to be consolidated.

The second is, if you think of Amazon, the first two parts, ordering and fulfillment, traditionally are done by Amazon itself. The company takes the orders, fulfills the orders, and the delivery is done by UPS or FedEx, right? Or USPS and other means.

So we had another challenge there: how do we fulfill? We cannot ask USPS to get your orders to your home within two hours. I don't think that will happen ever. Maybe one day. But that didn't exist. And the same didn't happen with UPS or FedEx either. So we really had to figure out how we can take orders and get it fulfilled and handled and get it to customers' homes within two hours.

Our challenge is this. One is, do we build everything on our own? That means take our problem statement and start building software end to end. Start ordering, fulfillment, and delivery all on our own.

The second, obviously, we have Amazon.com, quite a large company, a lot of assets there. How can we leverage existing services? Or should we do both? It's quite a big challenge for us to start with something that is existing when that is built for a different set of problems, right?

I'll show you a picture of services and DevOps in Amazon.com. It's been service-oriented for a very long time. A lot of microservices talking to a lot of services via APIs. A lot.

It's run by end-to-end intact teams. It's similar to what folks at Capital One were talking about. It's a two-pizza team, which is a concept that Jeff Bezos came up with. If a team size cannot be fed by two pizzas, it's too large. Communications break down, things like that.

That is the two-pizza team. It's an end-to-end team with developers, operators, testers, all of that. You build it, you run it, you operate it. If things go wrong, your pagers are going to go off in the night and you have to go fix it. There is nobody else to help you, okay? Other than AWS now, right?

The next portion is the team owns one or more services. No service is bigger than what a two-pizza team can actually run. You don't take a service and put 500 people on it, right? That's too big.

Teams control their own destiny and whatnot. So things are awesome here. There is mostly no coordination between teams at all. Really, I don't tell you when I'm going to release software. I continuously release and deploy software, and my pipelines continuously run. As you know, Amazon takes a lot of orders. Things are happening every minute. We cannot turn off the engine and stop the money flowing to the company, so it keeps going.

Each team releases on its own schedule. There is no enterprise-wide cadence of any sort. You can have your services running at a one-week cadence or two-week or three-week in terms of sprint cycles. It'll continue to happen.

A lot of telemetry. That's something that is inherently built. We built a lot of telemetry into all services. So you could literally go to LiveWatch and actually go see what is happening with your service, what APIs it supports, and all of that. So it's awesome.

But if you look at the "mess," quote-unquote, how these are connected, and there are literally thousands of these services and service teams within the company when it is in a hypergrowth mode, right? How do you deal with a massive enterprise-level change when you have hundreds of services like this?

When you're touching two or three services and you're making some changes, it's much simpler. When you are to do 300 services, or 400, or 500, how are you going to deal with it? That's the context.

Now we get to the solution space. We had a very small team when we had to get started with these things. As a concept within Amazon, whenever you're starting a new initiative, you don't get a lot. You get a one-pizza team. That's all you get. Six to nine people, that's where you always start. We start very small.

There isn't anything like, "Okay, I'm going to deploy 1,000 people into this organization, and let's do a strategy slide deck." We don't have McKinsey coming in and giving us strategy on any of this stuff. So we got to build it all from the ground up and start very small.

Our solution to the problem that we were facing was, why can't we just do one thing? Just deliver super fast, right? We can do it ourselves. Actually, I did deliveries myself. Get in the driver's seat and take your package from the fulfillment center, drive it to people's homes. Pretty simple, right?

We built a bunch of new delivery functions, built a bunch of services that take orders and do fulfillment and delivery. It's basically our own delivery pizza team. We got some trucks and started delivering things.

But the problem is we don't have much control over the ordering or fulfillment. In the first, when we were talking about deliver really fast, we were taking small orders that were routed to us, to our services directly. We were not taking orders from the main pipeline.

This is very small, very nimble team, but trying to do something very small. It is not scalable, right?

Now, how do we have to go about dealing with hundreds of service teams, and they're all operating with full autonomy, and each one of them has a backlog of things they have to do, and they're not stopping for you? And how do you convince them that your little team that was thinking of delivering in two hours is even worthwhile?

Think about it. Jeff Bezos always talks about it publicly, that people hear about Amazon stories of success, but nobody hears about things that have failed. A lot of things in Amazon have failed. Whenever you are trying a new thing in Amazon and talking to other teams, they think you have a very high chance of failure than success. So they're going to tell you probably, "No, go try something else. You come back in a year or maybe six months from now."

You're not approached as a success story, because this is very new. Nobody has tried it, really, so they're not even sure it's going to work. Think about trying to make a bunch of code changes for a random team who's trying to do something and is not even proven.

What do we do? Army of program managers? I have nothing against program managers, but if you hire them, they can go talk to people and say, "Hey, how is your scheduling going to look like? Can I have a coffee with you? We can talk about this new thing." And they will say, "Well, you're a good guy. We have stuff going on," things like that.

The second is steering committees. We could do that. Create a lot of steering committees within the company and talk to more people about it and say, "Hey, is it going to work or not?"

Third is lots of emails. I shot a lot of emails to different people. Can we do this? Can we do that? Or Slack. Now we have Slack channels, right? You can create new channels and add people into it, and nothing is going to happen there again, right?

Or be good friends with product managers. If you have amazing product managers, great MBAs who will do amazing slides and Excel spreadsheets with the forecast of like billions of dollars is going to come to Amazon Prime Now, so get on board, right? This is going to revolutionize e-commerce. And five years ago, this was like, probably not.

Or you escalate. You go straight to the senior teams and go to Bezos or whatever and escalate. They're like, "This needs to happen."

So these are our options. What do we do?

Our option number one was getting our own little ordering system and getting to our fulfillment. The second we looked at is, ordering is not fast enough. Because in a traditional e-commerce sense, you have ordering that is done in one or two days, and fulfillment goes to another one or two days, and the shipment happens after. Or if you do a two-day shipping, it's again shortened down to one or two days. Still, it's not fast enough for us.

For us to get orders to ship, we need 15 minutes. Within 15 minutes, we need our orders in our fulfillment centers. If it happens in the same minute, which is even better.

Think of Amazon.com. There are literally thousands and thousands of orders going on every minute. It's a massive machine. You have to cherry-pick what are the orders that are coming into your Prime Now versus others. Also, there are millions and millions of other sellers that are selling through Amazon too. It's all coming through the same pipeline.

It's not easy for us to tell the service team who operates billions of transactions every minute to stop and make changes to the code. Right? We are cutting across the two most powerful business units within the company, ordering and fulfillment. These are massive organizations dealing with a massively scalable global organization, running pipelines continuously.

So how do we actually drive change into it?

We had a couple of interesting solutions that came up by our senior engineers. One is we make no changes to ordering, which means that we tap the pipeline at a certain flag and then route the orders to us without changing the pipeline or the team touching the pipeline itself.

The second is we integrate our fulfillment back end with our delivery function. And then we actually got a steering committee and a rock-star TPM, a technical program manager. She was amazing. She knows the company really well, knows the teams very well, has the credibility within the company to talk to these teams that this is real, and we need their help.

The steering committee's decision was to remove blockers between different teams and limit the fulfillment changes to happy path only. Think about happy path is where you have a delivery pipeline where there are no exceptions. If exceptions happen, you just put that order on the side. Happy path only means everything goes right: your credit card is actually working, your address is correct, your stuff can get shipped.

If there is an order that comes in from a customer that doesn't meet the happy path, we actually have to do something else about it, because we cannot push that much change to the service teams.

And we built something called a facade, which is a composition service in the delivery function itself, which is somewhat unique that we had to do it. But now, if you look at the microservices world, there are more facades and composition services that are coming up because this is needed for integration at the enterprise level.

I'll talk a little bit more about the facade.

The evolution of the approach, one, which is creating the facade that cut across hundreds of services, so which means that we can really handle things ourselves to a great extent. For example, if a payment has failed, what to do? Send an email to the customer. That's a good thing, right? So we can handle that. We can do things that people don't need to deal with in their service that we can handle, but it's still happy path only.

Then we can do a lot of monitoring and alerting ourselves. We don't have to deal with the service teams themselves. Also, we can do test war rooms. I think Capital One folks talked about the clean room, software clean room. But this is more intense when you actually have to launch something to the field, physical goods are getting shipped and whatnot.

We can actually start a test war room. When something is going wrong, we are actually recording it and then we have to handle it later on.

The facade that we built became a very mainstream service. The first, it gives you the decoupling in the business. Think of the three business units. You have ordering, fulfillment, and delivery. This facade gives me the decoupling capability between the different business units. And it's a service by itself.

It provides a lot of caching, and so it gives you better performance eventually when you have more customers ordering stuff. And it's a dummy data provider. I can make up anything I want in this, right? So I control its destiny.

Also, it acted as the request-response gateway to ordering, fulfillment, and transportation, which is a big thing. Think about delivery by drones, all kinds of stuff like that. This is where we can actually start doing request-response, and we can transfer it over to crowdsourced transportation.

There is something called Amazon Flex, which is allowing people, like an Uber-like model, where you can get delivery passed on to a private driver who can actually deliver packages for us, and that also can be decoupled.

Again, think of that as a separate set of services. We can decouple our orders to a decoupled service as well. So the facade became a very mainstream service.

My point here is why we had to create service compositions. When you think of products or DevOps within an enterprise, you're decoupling the entire organization to product teams, and then you're building a lot of services. As your organizations mature, you're going to have hundreds of services, sometimes thousands of services.

But now we have to drive an enterprise-level initiative, then you have to cut across multiple services. Think of it if we had a monolithic application. Most of the functionality resides in the monolithic application. You can actually go drive change there. But what happens if there is a service composition? Too many fine-grained services.

If you think of Amazon.com, or a larger enterprise, as a decoupled enterprise, you have a lot of microservices to deal with. That inherently becomes a problem, right? And how are you going to deal with that?

The second is greenfields in enterprises are often limited, which is you cannot think of, "Oh, this is a new thing, and now you go start a new team, but it doesn't touch anything else." It's not going to happen. It's going to be very rare.

Even think of the drone delivery. Even then, that has to connect to our pipeline, right? So it's not by itself. It still has to integrate to our existing infrastructure.

Business functionalities are often required to interact with more than one service. You have to interact with more than one service for the business applications to work. And business really doesn't care. They just want things to work.

As a customer, you really don't care. You're going to pop out Amazon.com website, or an app, or Amazon Prime Now app, and you can place an order. You don't care about what is happening in the back end. You don't care if it is DevOps, or you don't care if it is monolithic. You don't care if it is microservices and whatever fun things in the back end is, right? You just want your stuff delivered to your home on time, and you get proper notifications.

Reality requires integration. That is very important.

And then hybrid usage of microservice and monolithic architecture. Even companies as large, as nimble, as modern as Amazon, we do have legacy stuff we had to deal with. How do you integrate with this legacy stuff? How do you decouple legacy applications? We can't just throw it away. You got things that have been built a long time ago, and it is actually, quote-unquote, "working," and it does its thing. So how do you integrate with this?

And also existing applications that are SaaS providers and other tooling that we had to integrate with, too.

There are other industry examples. Netflix has an API. Uber, ride services. PayPal's facade. There are a few larger companies actually have a very service-oriented... I also saw Spotify has something similar.

Larger services, they have decoupled the organization, kind of have gone through the journey that you're going through, probably, in most of the enterprises. They've decoupled the company into separate services. Now they have to integrate. So how are they going to deal with that?

Also, they have to communicate with external entities. So you can interact with Uber services as an outside entity, right? It serves both purposes. It serves a purpose for external integration as well as internal integration.

API gateways also provide a lot of additional advantage. When we were dealing with our facades, AWS didn't have the API gateway to the level of maturity that we wanted. For example, Apigee has a product. Azure, if you're using that, or Google has another product. If you're using public cloud infrastructures, you get the API gateways to probably give you some help, but it doesn't solve your problem entirely. That is the service composition.

Our facade that we built really drove a lot of service compositions, as well as it allowed us to do decoupling.

These are our key learnings from building Prime Now. Enterprise-wide initiative. We didn't know where it was going. It cut across the company's main business capabilities. We had to deal with a lot of service teams. We created a facade, a service composition. We composed services. A lot of engineering awesomeness in it. And we are lucky that it is successful. I will talk to you towards the end where we are right now.

What are the pros and cons of decoupled services? DevOps is awesome, but how do you deal with integration? End-to-end intact teams and DevOps are pretty amazing.

Enterprise programs will cut across major business lines and have major engineering challenges. If you talk to your senior leadership and say, "Okay, we want to do a new line of business, and this new line of business needs to deal with 200 services," and if you tell them, "Everybody has their own pipelines, and we don't know how to intercept all of this," they'll say, "Well, you guys said this is going to be faster. You're going to make our business faster. Now you want to slow it down?"

How do you deal with a situation like that and give more reliable answers to your business?

And how to minimize redundancy. That is another inherent problem with the independent services. Since it is decoupled, it's very hard to spot two services that are kind of doing the same thing. Sometimes that happens, and if you create such technical debt, how do you remove those?

How to use the right-sized teams of TPMs, or technical program managers, and enterprise-wide committees or centers of excellence to kind of drive the business challenges around it.

I think Gartner came up with this new term called mini-services. I don't know how many of you guys have heard about this thing. This is new. I found out and it's like, wow, there's a new name for this now. Apparently, it is a composition of microservices that is delivering a higher level of business capability. That in itself is a service, but it kind of plays a key part of MSA, or the microservices architecture.

Understanding what are the service granularity, what level of focus you want in a service versus not, and technologies for building integration microservices. Now they are coming up. API gateways is one of them. You have abilities to create technology that can actually create compositions without you doing a lot of work.

The last and most important thing is a single-threaded owner, which is, in Amazon terms, called the two-pizza team lead, or someone who is in a leadership role, has direct access to the CEO for an enterprise-wide initiative. It can be any delegate. Someone who says, okay, this is important to the company, and the product team has direct access to the CEO to make very important decisions.

Because when you have 300 teams, and if things don't come from the CEO, it's not going to happen. You really need the CEO's buy-in. So how do you, as a technologist or service teams that are owning products in the enterprise, get direct access to CEOs is very important.

All right. This is the status of Prime Now.

As you know, with any large companies, we don't take things for granted. It's a work of many, many engineers, and different teams have tried this and made it happen. This is not we as a first team started this and made it successful. That is not true. There's been quite a lot of work that was done in the past to try to do something like this, try to minimize the delivery times.

I don't know if you're probably seeing a lot of these effects on delivery itself. Sometimes when you order products in Amazon.com, you get your products the next day without even you asking for it. That is also driven by a lot of the innovation that was done by the Prime Now team to influence the major line of business.

Essentially, this little sideline project became mainstream, and then it has started to influence the business of the company. If you think about five years from now or 10 years from now, I don't see any other way. I don't see a place where you have to wait. The offline-online world is going to merge together where you click an order and stuff is going to be at your doorstep within an hour, without you even paying anything extra for it, without you waiting for it. The connected nature of online and offline world is going to come together.

For that to happen, think of when Amazon is going to redefine the experience for delivery, and speed, and selection, what are your enterprises going to do? Your customer is going to expect stuff they order at their doorstep really the same day.

People are expecting in the insurance industry, claims getting paid within an hour or two of them making a claim, or the same day. Earlier, they were to wait for 30 days. The delivery speed is actually coming closer and closer, and in five, 10 years from now, things are going to change.

If you look back 10 years ago, getting stuff to your door within five days was okay. Now you're expecting it two to three days or one day, now within one hour.

Right now, Prime Now is available in 30-plus U.S. cities and in the U.K., and it's growing really, really fast.

It's a launchpad for new businesses: for Amazon Flex, which is a crowdsourced firm, and Amazon Logistics, which is a delivery engine. It kind of became a launchpad for new services.

It's 25,000 items over 25 categories, and when we started, we hardly had anything. We actually struggled finding items people were ready to give to us to ship. It was that bad. It's like, "Okay, are you guys going to really deliver this? Because you're going to impact our main business, and you're going to not deliver the packages on time, or you're going to probably deliver to the wrong address. How can we even trust you guys?"

To now, there are thousands of categories and also other sellers. There are a lot of sellers on Amazon Prime Now who sell through the service, which means that we can influence the third party. But we got there with a lot, a lot of hard work.

Most of the items are, as you can see, groceries. I never thought grocery delivery was going to be that exciting, to be honest with you, when I started. And now it is the awesome thing that you can ever get onto, right?

I thought, well, it's pretty hard trying to get milk and make sure it is in a chilled container and delivering it to the home. People are still saying, "Oh, it's not cold enough," or, "Oh, it is in the wrong place on my doorstep," and whatnot.

And you have electronics, and gifts, and seasonal items. For holidays, things like that. It's an amazing service. It's now available, and the two-hour delivery became free. That's even better. There is no extra payment for the two-hour delivery. We still have one-hour deliveries, and it's hyper-convenient for customers.

And the last is, I think it looks very futuristic to me. We didn't see that coming, even when we started. We had a lot of hope that things are going to work out, but we can really connect the dots and create the future, and we believe in it.

But this looks very futuristic with drones and maybe quantum delivery, I don't know. Things kind of show up to your door in fourth dimension, I don't know. But this looks very, very futuristic.

If you think of this, this will redefine experience of items getting to people. There's just nothing like it. If we actually played a part in that journey for the world to get better in terms of experience, it's an awesome thing to be part of it.

I want to do the same thing for healthcare. That's my passion now, is to really deliver healthcare experience to people similar to Prime Now.

All right. Thank you.