Driving a Tech-led Reimagination of eBay Through DevOps

Log in to watch

US 2021

Download slides

Driving a Tech-led Reimagination of eBay Through DevOps

Randy Shoup

VP Engineering and Chief Architect · eBay

Mark Weinberg

VP of Core Product Engineering · eBay

One of the original hyperscalers, eBay helped pioneer modern operational techniques like sharding, circuit breakers, feature flags, and distributed tracing, and in 2020, eBay’s data centers turned in record cost effectiveness and availability. Also in 2020, however, the company needed to confront waterfall technical practices, years of legacy software and technical debt, slow product delivery, and a challenging competitive environment. This session outlines eBay’s technology-led reimagination begun in 2020, specifically the cross-organizational Velocity initiative to improve eBay’s ability to deliver value to customers.

We started by characterizing the breadth of the problem, which ultimately spans culture, organization, people, and technology, as well as every stage in the product development lifecycle. Using value stream mapping to identify constraints to flow, we decided to focus our initial efforts on improving software delivery across the board, because we recognized that eBay’s ability to deliver software rapidly, safely, and repeatably was a prerequisite for every other improvement. In addition, since one particular area of the site was a bottleneck for numerous business initiatives, we also focused on modularizing and modernizing its architecture.

Because our focus was on software delivery, we adopted the Accelerate metrics to measure success and look for opportunities. We launched ~10 independent tracks -- from build time to continuous integration to automated deployment -- and hand-selected pilot teams to work with in tight feedback loops. We have also embedded subject matter experts directly in product teams. We further use team-of-teams weekly meetings to share learnings and reinforce continuous improvement.

Explicitly to break down silos, the presenters are co-leaders of the initiative -- one from the product engineering side and the other from the tools and infrastructure side -- with entirely shared goals and metrics.

The initial results of this initiative are already bearing fruit, and we identify -- and bank -- new wins every week. With support from the top, and excitement from the grassroots, the Velocity initiative has galvanized teams to question the status quo and look for opportunities for improvement. Equally importantly, working collaboratively and breaking down silos are paying second-order cultural dividends. We have a long way to go, but this session will provide actionable insights for other organizations going through similar journeys.

Chapters

Full transcript

The complete talk, organized by section.

Host Intro (Gene Kim)

Welcome to the afternoon plenary sessions.

I have known and admired the work of Randy Shoup for a decade. I met him at Jez Humble's FlowCon conference, and maybe the best piece of evidence of just how much I love his work is that Randy is one of the most cited people in The DevOps Handbook. It included the work that he did as engineering director of App Engine at Google and as chief engineer at eBay over a decade ago.

I am so excited about this presentation. After nearly a decade, Randy Shoup is again at eBay, this time as VP of Engineering and Chief Architect. I've always loved how Randy thinks about solving problems. To see this problem-solving dynamic put in service of increasing productivity at an engineering organization that has thousands of developers is truly awe-inspiring.

I'm so delighted that he will be presenting with his colleague, Mark Weinberg, VP of Core Product Engineering, who is a technology leader of the entire product engineering organization. Between the two of them, their areas cover almost everything that gets done at eBay engineering. So here is Mark and Randy.

Mark Weinberg

Okay. Hi, I'm Mark Weinberg from eBay. Welcome. Randy Shoup and I are here today to talk to you about how we're using DevOps at eBay to transform our engineering.

First, I thought I would just start off with the problem statement. What are we trying to solve? Well, frankly, we are too slow as a company and we lag industry leaders in terms of our engineering velocity. Why does this matter? Engineering velocity leads to better customer experiences, stronger business results, and more engaged and happier employees, which we all want.

If you're a developer, you want to write code. You don't want to sit in a wait state. You want to work on the code and work on the product. If you're a product owner, you want more features and value for your customers. If you're an executive at the company, you want to stay ahead and beat the competition. The bottom line is we need to move faster.

Our mission is to turn this around and make engineering velocity a competitive advantage for the company. For us, getting a little faster just won't be good enough. There are competitive threats everywhere. We have threats from bigger companies like Amazon and Walmart, but there are also new startups like Shopify and StockX and others. Bottom line: we just have to get faster.

How did we get here? Randy and I spent the first three months at eBay surveying the state of engineering velocity. We looked at the entire landscape. We talked with engineers and engineering leaders across the company. At a high level, what we found is that there are systemic challenges that have accumulated over many years at the company. This ranges from code and architecture, with lots of tech debt and monolithic code, to missing tools and infrastructure. We had a low-quality staging environment, some poor processes, lots of PRs submitted with too many changes involved, long-lived feature branches, slow code reviews, only yearly site-wide upgrades of frameworks and platform elements, and lots and lots of major team dependencies that effectively created distributed monoliths in the code.

As a result, there's no silver bullet. Fixing these issues is going to require improvement across many different areas and span many different disciplines in the company, from engineering, finance, customer support, even at the executive level. Luckily, these are well-known problems with well-established solution patterns. Many companies share these challenges, perhaps many of you at your companies have these same issues, which is great. We get to learn from each other, apply best practices, and effectively help each other get better. Today, Randy and I are here to share what we've done and what we've learned.

Most of you probably know about eBay. eBay is a global e-commerce leader. We connect millions and millions of buyers and sellers across the globe. In 2020, we are a $10 billion-plus-a-year revenue company. There are 159 million active buyers on the platform and 19 million active sellers. It's a very vibrant, high-traffic, high-scale site and application. We have roughly 12,000 employees across the entire company.

Introducing myself a little bit better: I work in the core product organization. I'm responsible for leading the technology across many different parts of the product organization. I also run a few teams. I have a team doing a big feature for eBay called Stores, I run a small mobile team, and I run our planning team that drives our quarterly and yearly planning for the entire product organization.

Let me introduce Randy Shoup, who will give you a little bit more information about himself.

Randy Shoup

Yeah. Cool. Thanks, Mark. I'm Mark's partner in crime for all this stuff. I'm the VP and Chief Architect for the company. My areas are platform and developer experience, architecture standards across the frameworks in mobile and all sorts of places at eBay. I have a stable of enabling architects that we deploy to individual teams, which we'll talk about later. I also own the cross-functional product management organization.

Together, the two of us are able to apply our leadership in reframing and reprioritizing a bunch of priorities of the company. We're connecting teams from Mark's organization and mine to help each other and unblock each other. We have the capability to encourage teams and permit them to take more risks than they otherwise would. We have both the ability to suggest to teams and also to implement mandates when that's required.

The next thing we did was an assessment of the situation at eBay. We used a standard value stream mapping approach with a couple of teams that are a cross-section across all of eBay, looking at all of the phases that go from idea to customer value. Planning is idea to project; development is project to committed code; delivery is committed code to deployment on the site; and once we've deployed, we need to iterate to make sure that we continue to drive customer value.

We found issues at every stage. From the planning perspective, there is lots of coordination, lots of inter-team dependencies, and almost every team at the company has too much work in progress. In development, developers were suffering from really slow build and test time. They were doing lots of context switching and being blocked and queued up in wait states. We have 26 years of architecture behind us and three generations of infrastructure, so there is lots of highly coupled architecture. We don't yet have a tradition of service contracts between services, even though we have many services, and there is lots of hidden work.

From a delivery perspective, not every team has a good end-to-end automated deployment pipeline. There are lots of issues, as Mark mentioned, in the staging environment. Lots of teams do manual testing without automated rollout. We're now introducing canary deployments and better usage of feature flags, but there are lots of opportunities there as well. Finally, in the iteration phase, we found that teams don't necessarily have end-to-end monitoring of their systems, or of business metrics and customer experience. There are lots of issues in tracking what customers see, and what we would frame as a dysfunctional experimentation capability. We actually have a great experimentation platform. Some teams don't use it at all; other teams overuse it.

There are lots of targets in a target-rich environment. What Mark and I decided to do is focus with laser focus on software delivery and a little bit into software development. Why? Because this is the bottleneck. Thinking of the Theory of Constraints idea, software delivery is currently the bottleneck at eBay in our ability to make improvements in all these areas. If we can unblock and unlock software delivery at eBay, it makes everything else we want to do possible by enabling faster change and reducing the cost of change. If I can deploy multiple times a day instead of a couple of times a month, now I can make architecture improvements, more experimentation, and more changes for customers.

In terms of measurement, we're trying to improve software delivery, so we're using the DORA or Accelerate metrics. As my friends at eBay see all the time, I'm showing the Accelerate book almost constantly. Here is how we measure ourselves, the standard metrics. If you look at eBay overall before we started doing this velocity initiative, we were basically a medium performer across the board. Just to give you a sense of where we've been able to go this year in the velocity work, the pilot teams we're working with have moved into the high-performing area. We've been able to triple our deployment frequency for those teams, reduce the lead time for change two and a half times, keep time to restore service essentially the same, and improve the change failure rate three times.

Mark Weinberg

Great. Thanks a lot, Randy. How are we going to get faster? What we decided to do was focus our efforts on a select number of what we call pilot domains, applications within those domains, and platform tracks. The reason is we're learning; we didn't want to start with everything all at once. Roughly 10% of eBay's active applications are in this pilot.

For each of those, we wanted to focus on both short-term wins and longer-term capabilities, because they both matter. On the short-term side, if we can save five minutes on a task for every engineer every single day across the entire engineering organization, that turns into a big win. We want to do lots of those things. There are also larger changes like refactoring code and rearchitecture. These have bigger impact and carry more risk, but there are a few of those that we do need to focus on.

Driving improvements is the core of where we're focused, to Randy's comments. Developer productivity is the daily build, debug, test cycle that developers are in most days. Improving our build times, server startup, and PR validation times is a huge focus for us.

In software delivery, improvements are going to make everything faster. We have a ruthless focus on process automation. As a company, we still did lots of manual testing and lots of manual steps, so we're trying to automate things like load and performance testing, site speed tests, and partner sign-off. When one team is dependent on another team, another team will run tasks, and a lot of that was manual back and forth through emails. We're trying to automate those processes. Security, localization, and accessibility testing: automating those things is a real focus.

We're also focusing on faster deployment. Canary deployments and techniques like traffic mirroring really help us get deployments out quicker. As I mentioned, instrumenting the code, focusing more on alerting, and improving observability help us be confident to roll out faster knowing we can roll back fast as well. eBay is now managing payments for customers, and when you're talking about customers' money, you have to be able to detect and resolve issues extremely quickly. Now we're much better able to do that.

On the architecture side, we focused on areas that have high active development and high customer usage. The call-out here is that we have a page on the site and in our app called View Item. It's the primary experience for the product, with hundreds of millions of views a day on that page. If we can make velocity improvements there, it allows us to improve the product much more rapidly. These areas also have lots of team dependencies, so unblocking those dependencies is a big focus for us. The old code was also very brittle and hard to test, so that's where we're putting our focus.

The way we think about this is in terms of more horizontal platform tracks. These are tools and infrastructure to better support our builds, CI, staging environment, and even educating and training our engineers on techniques and patterns. Then we have pilot domains that cover big areas of the product, such as selling, search, and ads. These teams work very closely together to come up with ideas for improvements, implement those improvements, and share across all the domains to make everybody faster.

This has truly been a collaborative effort across our technology platform teams and our application teams. These teams work extremely well together, very iteratively. This didn't used to happen at the company. There were mostly walls between the two. We've done a good job driving collaboration across. As Randy mentioned earlier, we embed senior architects within these teams, and they've helped to coach, advise, and even write some critical code. That's made a big difference.

Communication is key. Randy and I meet every single day to talk about issues, come up with solutions, and work on our plans. We drive a weekly Scrum of Scrums meeting where individuals and teams learn from each other on things that have been working and things they could use help on. Randy and I do weekly deep dives with each of these teams to coach, push them, find gaps, and come up with solutions to problems. We also regularly do a monthly operating review with our executive team to keep them up to date, engaged, interested, and curious about what we're doing, because we need their support.

Randy Shoup

From a measurement perspective, we're using the four key metrics as the way to measure our progress. One of the things we did almost immediately was add to our existing development dashboard the four key metrics for every app. It turns out we had all the basic data, but we hadn't put it together in that form. We've continued to iterate on that. We added granular visibility to the entire delivery pipeline. For every deployment by any app, we can look at how long it took to build, how long it took to roll out to the first machine and the last machine, whether it was a rollback, and whether canary deployment was used. We keep adding more capabilities to this, and it's been super helpful for us to see overall visibility, but also super helpful for individual teams to debug and optimize their processes.

For each of the platform tracks Mark mentioned, we have lower-level input metrics: goals for improving P95 build time, P95 startup time, end-to-end PR validation time, and so on. We track ourselves at the individual level and at the overall program level.

In terms of iteration, this has been the huge unlock. When we went to teams and said, "Hey, we see you're deploying once a month, twice a month. What if we asked you to deploy every day?" they'd be like, "Oh my God." We say, "Well, tell us why you can't." They give us a big, long list of 10 or 20 different things that they can't do. We're like, "Great. We're going to knock down the first three, and then we're going to knock down the second three. Oh, by the way, this other team has already solved seven, eight, and nine." Having the conversation from the perspective of the goal we want to achieve, and what we can do to help remove impediments, has been the huge unlock. That's been the driver for all this collaboration.

For each thing we're doing, we do a very tight Deming cycle of plan, do, check, act. We try a thing with one team. If it works, great, we start rolling it out to everybody else in the Velocity program, but also outside. If it doesn't work, we stop it and try something else. As Mark mentioned, we have a pretty big focus on education because there is a lot of opportunity to help educate the thousands of developers at eBay on modern ways of doing things with domain-driven design, TDD, automated testing, and all that.

Here are some quick results. This is the big thing that really matters to the execs. For the pilot teams involved in this initiative, we have been able to double their productivity. We've moved the DORA metrics, but it has shown actual results in doubling the productivity of those teams. Holding team size constant, the same teams with the same members are delivering two times or more the number of features, bug fixes, et cetera. Those pilot teams roughly represent 10% of the actively developed apps and services at eBay. We've improved their deployment frequency by 3X, lead time by two and a half X, and change failure rate by 3X.

Mark Weinberg

It's really been fun to see the results. The first thing, as we've talked about, is putting these metrics in place. I couldn't recommend doing this more, using these industry-standard metrics. It really stops the debating that happens over whether these are the right metrics to use and whether we're sure they tell us if it's working. We hear that a lot. It's been great to point to those metrics and the research behind them, and show the correlation of how improving these metrics leads to better productivity.

We focused on removing bottlenecks, as Randy mentioned. We keep tearing these down with the teams. We knock one off and ask them, "Okay, what's the next problem that you have?" We keep doing that, and it's made a big difference.

We spent a lot of time focusing on nuts-and-bolts stuff: build startup and PR validation times. As a quick story, when I started at eBay, I'm a developer myself, and I asked to do a build of one of our larger components. I synced the code and did a build, and found it took an hour. I was shocked by how long it took and started talking to people. Through a lot of work in our tools, paying attention to these things, and just doing the work, that 60-minute build time now takes four minutes. We've done that a number of times. It's amazing what improvements we've been able to make in these core areas.

PR validation time is another case. We had a team that was taking an hour and 45 minutes to validate a PR. That's now down to 10 minutes. If you add these things up, it really makes a huge difference.

As we've talked about, getting our staging environment healthy, making sure we have good quality data that is privacy clean, good components that are reliable and trusted, and having engineers treat staging with care like it's a production environment, means we can trust that when we run something in staging, we know it's a good representation of how it will work in production. That's been a big difference.

Automation includes automating site-wide upgrades, testing, and deployments. We still have work to do in this area. We've made a ton of progress, but our aspiration is to get to one-click automated deployments for all of our active applications. We're not quite there yet, but we've put a big focus on that.

Other things include definition-of-done processes around partner sign-offs, doing proper code reviews, paying attention to these things, not letting code reviews sit around for a couple of days, and having a regular recurring rhythm of jumping on code reviews. Things like that have made a big difference.

Lastly, about a year ago, we were doing our mobile releases just once a month, and that led to all kinds of problems. Now we're down to weekly mobile releases, and that's improved our quality. People don't rush to get a change in because they're worried about the next train not leaving for 30 days. They do things right. It also allows us to do faster hot fixing of a problem. It results in smaller batch sizes, and that has been a huge improvement.

A lot of this has changed our culture. We look at metrics regularly. Teams are inspired to work with each other and help each other. Teams don't want to go back to the old way of working. They feel like it's working, and we hear that a lot. We have teams that aren't in the pilot asking, "Can we participate?" There is a lot of demand for some of the improvements we're making.

On the collaboration side, creating an enabling culture has really paid off and made the environment more fun to work in. People find a problem and want to work together to fix it. We do a lot of partnering with teams where maybe we didn't have close collaboration before, like our security team, SOX compliance, accessibility, and localization teams. It's more fun to work in that kind of environment.

Randy Shoup

From a community and sharing perspective, this was something I was not expecting at all. The standard eBay model is that my teams on the core technology side produce tools and infrastructure, which are then consumed by the product engineering side that Mark works in. Part of that is still true, but we've also found that teams in the product organization are automating their own workflows. They're writing little tools to help remind them to finish PRs and code reviews, and to do performance testing and accessibility testing that they need to do. Teams are automating their own stuff.

We give them space, now multiple times a week, to demo new tools they put in place and new practices. It's a big opportunity to share what they've done with other teams, and there is a lot of team pride associated with that. That sharing is something that never happened laterally between those teams. It's been a great injection of community and collaboration into the organization.

Last, but very much not least, we've really benefited from a lot of executive support and engagement. Just as I'm a returnee to eBay, so is our CEO. He constantly highlights at company all-hands, exec meetings, and operating reviews how important this velocity initiative is. He says many times, "This is the most important initiative at the company," and then the uh-oh for Mark and Randy is that he's like, "Go faster." It cuts both ways.

Here are some of our current challenges. From a program outcome perspective, we've been really good at improving the team-level metrics. What we haven't yet been able to do, and we're hoping to do in the next phase, is improve overall eBay outcomes. We have ideas about how we can leverage Mik Kersten's Flow Framework to think about those outcomes and measure them in a good way, but that's remaining work.

A challenge for us as the initiative team is that there is never enough time and resources to do what we want. We have a big, long list of impediments we want to help teams unblock, and we're under-resourced from the platform side and frankly also under-resourced on the product engineering side. We need to make sure the teams involved continue to retain their commitments to improving their velocity in addition to all the other things they're asked to do, like security patches and, by the way, features and customer value.

Another thing we found relatively recently is that a lot of the team representatives in this program tend to come from the QE side of the organization rather than from development. That's been great because those people are about quality and pipelines. But when we're trying to move a little bit more upstream, toward trunk-based development and changing development-team practices, we're finding that we need to bring in the engagement of development leaders too. That's a learning for us.

The last thing, for Mark and me, is that we're pretty overtaxed. At the moment, Mark and I are doing a lot of high-level stuff and a lot of deep dives with individual teams. We need to figure out how to scale ourselves and delegate a little bit more.

Mark Weinberg

We see a lot of people focusing too much on the metrics and losing sight of what we're actually trying to achieve. There is also a little bit of gaming of the system, people trying to make sure they're hitting these metrics. We actually want more pilot apps in the program, but when we add new apps, maybe our deployment frequency numbers go down or our lead time for change goes up because they're starting. We hear that a lot about the metrics.

There is this notion of fear of failure and the consequences. If we go faster, are we going to create quality issues? If we create quality issues, are we going to get in trouble? We're trying to teach people that going faster doesn't necessarily mean quality goes down. In fact, we're seeing it get better. We're also creating an environment where people understand that going slower has consequences that are probably greater than us having a deployment that we had to roll back. Getting people over that fear of failure is really important.

We still have some people who don't necessarily believe in the approach. There is engagement with those folks on how we're doing things, why we're doing things, and teaching them why this works, how it has worked at other companies, and how it has led to success. It's a big company with a lot of people, so education is an important area where we need to get better.

Randy Shoup

I'm going to close with what we hope the future will look like. By contrast to the current state of our lifecycle, here is our North Star. From a planning perspective, we'd like to be doing rolling planning with many small, cheap experiments all the time. If, and only if, we find value from that experiment, only then do we double down with a big, massive cross-functional project.

From a development perspective, we'd like teams to be doing small batch sizes, maybe single-piece flow: one single PR, one single commit all the way to production. We want fast build and test iteration, at least daily merges and deploys, and a more decoupled architecture as we go forward.

From a delivery perspective, we'd like a fully automated test and deployment pipeline with one-hour commit to deploy. We'd like to do a lot more iteration in production using feature flags. Last, but very much not least, on the iteration side, we'd like to do way more end-to-end monitoring. We'd like to have tracking everywhere, again with many small, cheap experiments and rapid feedback on those results.

The help we're looking for is that we're going to be scaling this initiative from 10% of applications to hopefully 50% really soon. We'd love people's experience around taking an initiative like this and scaling it much more broadly. As we mentioned in our challenges, we'd like help inspiring and motivating engineering managers. It's easy to motivate the developers; they're self-motivated for this work. It's actually easy to motivate the execs. The middle is a little bit of the challenge. We also have great executive commitment and sponsorship at the moment, but we'd love people's thoughts on how one sustains that commitment to this kind of initiative over the long term.

As we mentioned, there's a huge amount of executive sponsorship for this work. We're lucky to have our CEO, Jamie Iannone, to tell us about how this matters for eBay.

Jamie Iannone

Hi, everyone. I'm honored to be here today among this incredible group of tech leaders. When I came back to eBay in 2020, I shared my vision for a tech-led reimagination for the company, one that would propel us forward and help us set the stage for the decade ahead.

The cross-functional velocity initiative that we kicked off this year, led by Mark Weinberg and Randy Shoup, is a key part of our strategy, underscoring our work to create an environment where both engineers and innovation can thrive. We've already seen this initiative double the productivity of the teams involved and directly improve our ability to deliver value to sellers and buyers around the world. It's truly incredible.

I think it's helpful to explain how this success has been achieved. First, we started with clear goals and metrics and a collaborative approach across both the product org, engineering, and technology platform organizations. Throughout the process, the teams have zeroed in on the impediments and bottlenecks in our software development system, working to eliminate them one by one. We've also embedded strong architects into our teams to help improve the team's technology, their practices, and their ways of working overall.

It's great to see how excited and inspired all the engineering teams are to be part of this initiative and how they're continually working to improve our velocity across the entire technology organization. I've asked Mark and Randy to scale this initiative even further next year, and I can't wait to see what they and their teams are able to achieve. Thank you.

Randy Shoup

Thank you very much. This was a lot of fun for us, and I hope it was valuable for you all. Thank you.

Host Outro (Gene Kim)

Thank you, Mark and Randy, and my heartiest congratulations to you and the entire eBay engineering teams that led to that amazing message from eBay CEO Jamie Iannone. I think it is a phenomenal testament of what you and the teams have done and shows just how important the work of this community is.

Here is my request for all people interested in speaking at future DevOps Enterprise Summits: if you are in a position where you can get your CEO or COO to share a similar message, I definitely want to hear from you. The team from Nationwide Building Society was first to pull this off, followed by American Airlines, Fannie Mae, and eBay is now the fourth. I am confident that they will not be the last.