Dawning of a New Era

Log in to watch

Las Vegas 2019

Dawning of a New Era

Head of Customer Operations · AutoTrader

AutoTrader is currently in the middle of a journey migrating all apps and services from two dedicated datacentres to the Public cloud, this talk will cover a brief synopsis of what we do, the need to move to the public cloud, and will give a 'warts n all' coverage of what worked and what didn't.

Chapters

Full transcript

The complete talk, organized by section.

Dave Whyte and Andrew Humphrey

Dave Whyte: It's an honor coming here. Basically, me and Andy come from Auto Trader, Auto Trader UK. There's no connection to autotrader.com, but it's quite a similar company, and we've flown all the way from not-so-sunny Manchester. So it's a real honor to be stood here talking.

Introductions: my name's Dave Whyte. I'm an operations lead at Auto Trader. I've been there for 15 years. I'm also a co-organizer of our DevOps Manchester meetup group, so quite proud of that.

Andrew Humphrey: And I'm Andy Humphrey. I work in customer operations at Auto Trader. I also co-organize DevOps Manchester, but I thought it'd be weird if we came in matching outfits, so I let Dave have the T-shirt. We work together a lot, worked together for maybe 10 years at Auto Trader, and it's really nice to be able to tell a story to you guys about how we work.

Dave Whyte: The talk's "Dawning of a New Era at Auto Trader." Essentially, it's a migration-to-public-cloud journey that we're on. To set the scene, we've currently got two data centers, physical VMware servers, and that sort of setup. We want to kill everything and move everything to the appropriate public cloud. We're talking more around GCP, but it's quite ambitious for us to move lock, stock, and barrel to the public cloud while taking on board the benefits we're going to get from that.

01Auto Trader's Business And Eras

Andrew Humphrey: If you understand about autotrader.com, you probably understand about our business as well. We run the biggest automotive marketplace online in the UK. It's a really, really busy platform. We connect vehicle buyers with vehicle sellers, as you'd expect, and we've got tens of millions of consumers coming to our platform every month.

We've got a really big influence over the automotive industry in the UK because so many car dealers use our platform and so many consumers know our brand. We're really lucky from that point of view. From the point of our organization, we floated on the London Stock Exchange in 2015, and we've doubled our value since then, and we're one of the top 100 companies by valuation in the UK. We do that with 800 people, mostly based in Manchester in the UK, over 200 developers, and then a centralized infrastructure and operations team of 27 people.

Dave Whyte: Eras. At Auto Trader, we've gone through a few eras. Our first era was as a magazine company. From 1977 to 2013, we were a very successful magazine company. In 1996, we launched our first website, and very quickly it became apparent that the internet was the thing of the future, so we built out services utilizing the internet.

In 2013, we published our last magazine and made a really successful transition to be 100% digital. That meant that in that period: traditional website, traditional database. You come to our website, you want to search for a vehicle, it's all done in our infrastructure.

From 2019 to now, I believe we're more of a technology company. I'm saying this because we've gone from a traditional website setup to utilizing our data insights to enhance and extend our platform capabilities. That's making APIs available, taking on board technology from GCP, and building really quite cool products for our customers that help them with selling vehicles.

Andrew Humphrey: I'll talk to you a bit about our applications and services. One of the things we're most known for is connecting vehicle buyers with vehicle sellers. Over 90% of people in the UK know what Auto Trader does, and this is what they think of us. But actually, we have loads more services than that. It's not just one application.

From the consumer side, people want ever more choice. We're really well known for selling used cars, but people want to go to one place and see a new car, a lease option, different finance options, and a used car all in the same place. We need to offer people more choice, and we're continually trying to evolve our applications to deliver that.

The car-buying process is really stressful. You have to argue over the price. You have to fill out thousands of forms to get finance. It takes a long time, and we just want to make that really convenient. The next car you buy might be an online purchase. The electric-vehicle vending machine in London that Auto Trader set up might not be how it materializes, but online transactions are coming, and already in the US, Amazon and other players are trying to do this as well.

From our customers' point of view, most of our customers are car dealers, and they're going through this digital transformation themselves. They're faced with more competition, needing to use data more and more to drive their businesses. We see all the information about the automotive marketplace in the UK, and we provide them with data-driven products to know what vehicles to buy in their area, what's in demand, what price to buy things for, what price to sell things for, and help them run their business efficiently.

As we progress and build our platform out, we're now interacting with different kinds of customers like banks and insurance companies who need our data, insight, and services and are connecting to us through APIs. This is a whole new world. We're not just a single-application company. We've got hundreds of applications that are continually evolving, and the challenges to move quicker are getting more and more every year.

02Why Public Cloud

Andrew Humphrey: This is where we are. We still have two physical data centers, which is traditionally how we've hosted all our applications. More and more, we're starting to use different cloud platforms for different workloads. We tend to use different cloud platforms depending on what will suit our workloads most.

At the moment, we're in the middle of an all-out migration of all our applications from data centers to GCP, Google Cloud Platform. We've managed to migrate over 300 applications in the last year, and we're hoping in the next few months that'll be complete. Part of what we're going to talk about is our story of how we're doing that.

Dave Whyte: The question is, why public cloud? Public cloud is quite cool. It can be a good thing to do, but you don't have to move if you don't really need to. Andy's mentioned that performance is pretty good, and the previous slide showed really good stability. For us, it's all about increasing organizational agility and increasing velocity in releases.

Our first step was to build our private cloud platform. That's built on a CloudStack platform. Very quickly, it became quite hard for us to maintain that and look after the underlying infrastructure. It was open source, so when we had issues it was hard to get the right support for that.

Lastly, it was around GDPR. We had a customer who wanted end-to-end encryption for an application, and I think we spiked out a bit of work for six weeks. Three months later, we couldn't build it in that platform. One of our colleagues spiked out a bit of work in GCP. In two days, he had a working solution for us. For us, it was very clear that moving forward meant moving to the public cloud. Last week, actually, we've just completely killed off our private cloud. All servers powered off. It was quite a proud moment.

Around our migration, I believe there are three important areas really helping us with a good migration. First is culture. Second is organizational agility. Third is people.

03Culture

Andrew Humphrey: Culture is really important to us. Our culture is our brand at Auto Trader, and we really believe that a great organizational culture should give you highly motivated teams who are energized and delivering great value. A great organizational culture should give you a place of work where you can be yourself and do your best work.

What we were finding maybe five or six years ago was that wasn't the case. Some of the things we'd inherited from being a print organization were that we were really fragmented. We had 15 different offices all over our country, which is probably about the size of Nevada. It was difficult to communicate between different teams, and all our work areas were quite siloed. We built up different product teams who were competing with each other. We had no real ongoing vision or mission about what we could align behind. We had hierarchical layers of management, so lots of senior managers would sit in their offices away from all of their teams, and communication between different layers of that hierarchy was quite poor.

At that time, about five or six years ago, we got a new CEO who'd had experience running a digital consultancy. One of the first things he did with his leadership team was agree a mission for Auto Trader. This was the first time we had an ongoing mission around a digital objective, making a clear demarcation between a print organization and a digital one. This is still our mission: to lead the digital future of the UK automotive marketplace. That's how we align all of our work towards that goal.

The other thing we did was introduce principles about how we would work: values. These are our six values today, which we still talk about a lot. These values are not something that you just see in a presentation. This is how we measure performance, construct interview questions when we're hiring people, construct our induction, and evaluate evidence when you're looking for a promotion.

A couple have been really crucial. We have a massive market share in the UK, and that means it can be easy to be complacent or arrogant. We have to continually strive to deliver customer value and make our services better and better. Being humble is important to remind us of that. Being courageous is important as well. If we're going to lead the digital future of anything, it's important we make bold decisions, do new things, disrupt our industry, and take it in a new direction. It's important that we challenge each other to do that and don't be too safe.

One other thing we did was talk about how we were going to work, what work we value, and we agreed at organizational level on operating principles. It's a similar format to the Agile Manifesto: principles on the left-hand side override the ones on the right-hand side. It doesn't mean we don't do the things on the right-hand side. We're clearly stating we value the things on the left. This was a line in the sand to say this is how we're going to work, and to help us make decisions and work together clearly.

Two things are useful in this context. We've talked a lot in this conference so far about products versus projects. Product evolution has been a focus of ours for five or six years now. We used to do everything as a short-lived project. We'd bring in contractors from outside the organization, deliver a new product in three to six months, and on the day of going live all those contractors would leave. Me, working in operations, would be left to clear up whatever happened next. Five or six years ago, we decided we would have long-lived, multidisciplinary teams that owned applications and evolved those continually.

Another is people development. We used to hire a lot of people from outside the organization, but we made a conscious decision: if technology is our core competency, we should focus on developing the skills and experience of our own people. We called that out clearly to say that people development is one of our operating principles.

Other things we've done to promote a healthy culture are around our working environment. We were fragmented in lots of offices. We brought everyone together into two offices, a smaller office in London and a bigger office in Manchester, plus a small office in Dublin. That means collaboration is a lot easier. These offices are built for collaboration. We hot desk. Even the legs underneath the desks are recessed slightly so that when you're sitting next to a colleague or pairing with someone, you don't bash your legs on the desk. They're designed so that you can be fluid and flexible in your team structure.

Other changes were around reporting lines. We used to have an IT department, which seems crazy for a technology company. I don't believe IT departments should exist anymore. That was causing friction between product owners who reported to a different executive and technical leads who reported to the CIO. We don't have that anymore. Our product and technical teams were brought together into multidisciplinary teams with a mission that aligned to our overall organizational goal. Technical voices are heard alongside product voices when we make any decision around our products.

04Organizational Agility

Dave Whyte: The second area is organizational agility. For us, organizational agility is a company's capability to rapidly adapt to change. We're talking about a top 100 company moving all their stuff into public cloud. That's a massive change. GDPR, issues with competitors: if a competitor brings a product online, we have to adapt and bring out something better.

To embrace organizational agility, you've got to be an enabler and not a blocker. We definitely were blockers in the past. Operationally, stuff like building physical servers or VMware servers might take six or seven weeks. Raising tickets went into a ticket black hole. We used to have a CAB process. We don't anymore. It used to be two or three CABs a week, taking two or three hours at a time. For us now, that's not a thing. By embracing new processes or flexible processes and embracing new technology, operationally we are now more enablers.

Andrew Humphrey: The story of our organizational agility is probably best told with this graph: seven or eight years of releases to our live environment. Lots of people at Auto Trader have worked on this because getting these things right takes people from infrastructure, product teams, and all kinds of skills working together.

Five, six, seven, or eight years ago, we had people doing manual deployments, logging onto servers all day, every day to try to get releases out the door. They'd be operations staff, and development teams would have to plan releases a month or six weeks in advance. As we moved to being a digital company, we started focusing on release automation, extending continuous integration pipelines out to the live environment, automating things, getting rid of manual errors, and increasing speed. Releases came down from a few hours to maybe an hour each, and we got better repeatability.

Squads started learning to get into a cadence of regular releases: don't save up all your work for a big release every month. As frequently as you can in your team, try to get that value out the door. As we've moved on, a private cloud environment doubled the releases we were doing each year because we were using infrastructure as code and releasing infrastructure changes tested through continuous integration pipelines just like applications. Infrastructure and applications were promoted at the same time.

As of last year, we moved to a public cloud environment, and release numbers have trebled again. Now we believe we're in that continuous delivery mindset where people don't save up work. Product teams have worked to get slick at releasing value early, and our platform allows that to the point where this year we think we're going to do 40,000 releases. Our biggest day this year was 455 releases.

The reason it's important is that any threats that come along, we can react quickly. Any opportunities, we can react quicker than anyone else. In terms of the chaos this might cause, the number of failed releases is reducing year on year. As we release things in smaller component packages, they're easier to test, quicker to get out there, fail less, and even if they do fail, we can back them out more quickly.

Dave Whyte: The left-hand graph shows a massive increase in release velocity, which is great. But operationally it's, oh my God, how do I know what's broken and what's caused issues? Historically, we used to have an Outlook calendar as our forward schedule of change. You had to manually put in that you were going to do a release. With velocity, it's hard to manage that. If you're trying to see three or four releases during a day, did that release actually tie in with the Outlook calendar? It's a nightmare.

We have some clever people at Auto Trader, and we like building some of our own products. This is an in-house product called Lighthouse. Effectively, it's our release dashboard. When a release starts in a pipeline, it sends a timestamp to Lighthouse, and it sends a timestamp once the release is finished. Operationally, if monitoring tells me there's a problem related to Consumer Gateway, I can work out that this was due to a release because it started erroring as that release went in. It gives us more information.

05Platform Visibility, Incidents, And Cost

Dave Whyte: When we moved to Google Cloud, the idea was to embrace technology. Our main digital apps are on GCP utilizing Kubernetes and Istio service mesh. This dashboard is from a Node app we built taking on board all that information, most of it from code. From here we can see the amount of Docker containers, live CPU and memory being used, and the amount of applications in this environment. I can do the same for prod, non-prod, and dev.

Next is our app directory. This is a service catalog of all our applications. Every single application that's live and being used is available to anyone to see and get the data from. You haven't got to search for a wiki or error information. It's all in one place for dev and ops. This is truly a DevOps dashboard we built. This is not a vendor.

For an application like AB Test Allocator, Consumer Experience is the product squad responsible. You've got the owner, Andy Riley, a developer owner for that application. It ties in so if he leaves the company tomorrow, he'll get taken out of AD and we'll get an alert telling us that the owner for the app has left the company and asking us to find a new owner.

As a tier one app, we've got various BCP and DR strategies. Tier one is zone and region failover, so I know what DR strategy this app has. We've got a buddy: an operations engineer, a squad buddy. Even though we're a centralized operations team, one operations person will get a squad and go to their stand-ups and meetings, be a liaison, and give them a person they can speak to within operations.

There are bookmarks for the app URL, admin URL, deployment pipelines, source code, Kibana logs, metrics, service mesh, and tracing. At Auto Trader, all logs are shared. Developers and operations all see the same logs. There should be nothing hidden. Inbuilt within Istio service mesh, you can load Jaeger tracing, so I can investigate slowness in an application. It's really helpful for troubleshooting issues.

We've had a problem for a long time with understanding how everything fits together. We used to have a big wall where someone drew how all the apps fit together, but it got outdated very quickly. We built a dependency graph using D3, Kubernetes network policies, and Istio service mesh. The size represents how many apps connect to an app or API. You can see flow, direction, and the squad responsible by color. If any were in a state of alert, it would tell me there was an alert there. You get a true blast radius if there's an issue with an application.

As part of the Istio service mesh, we have dashboards. We used to have millions of Grafana graphs showing everything. This is all in one place in the platform. If an app in the new environment has an issue, it very clearly shows within graphs, and we can investigate. On the right-hand side, I can see timestamps of when releases are going through a system. If something changes, it's easy to see whether a release occurred and track down what the issue might be.

The next one is cost management. There are two myths with public cloud. One is that if you go to public cloud, you lose control and visibility of applications and it all goes out of control. We've shown that isn't the case. Cost-wise, costs can spiral. We have true cost management. I can tell a squad how much it costs for their monitoring and, if they turn debug off, how much money they'll save. When we build containers, we don't say it has to be three CPUs and six gig of memory. We change it depending on the application. If the application is underutilized, operations or dev can change that in code.

We've got Skipper. Skipper's a bot. It's basically a webhook for Prometheus data, and it prettifies Prometheus data. In the example, the Forecourt Service app is returning a load of 500 errors. We see clearly what that might mean and which container is causing the issue. It's a critical alert, so the ops engineers and Paul, who's a developer, get an alert in Slack. At the bottom is a link to a runbook. It's very important for any alert to have a really clear runbook, aimed at the engineer who might get alerted at 3:00 a.m. and needs to know how to investigate.

Paul is the developer who took ownership of this issue, so there aren't two or three people investigating the same issue. Skipper realized that this app had been deployed over the last half hour and gave a link straight through to the pipeline and the last bit of code that was pushed out. Paul's response in Slack was that he was rolling back the issue.

A developer pushed out a release. Within minutes, eight minutes, the system noticed an increase in 500 errors. It sent an alert to ops and dev. That developer picked it up and rolled it back, all in seven minutes, which is quite a small timeline.

06People And Skills

Andrew Humphrey: We've just talked about culture and organizational agility. The last of the three things we wanted to touch on is how people can enable our cloud platform migration. We're really lucky to have great, talented, highly motivated people at Auto Trader. But the problem was that we had a big cloud migration we needed to embark on, and we didn't have the skills and experience to do that.

We had to decide how to bridge that gap. We've got great people and a desire to learn new skills in our engineering community, but no experience of doing it, and this is a big new problem. We wanted to use our existing team. We went to different conferences and vendors and spoke to different companies to understand how they'd dealt with this problem. Most of the feedback was quite depressing. Lots of people outsource this problem to a different company, or they hire a new engineering team to build a cloud platform, and then the old teams sit there looking after legacy kit until that dwindles away and they leave the organization. We wanted to think about how we could do that differently.

The way we approached it was to have really open conversations with our teams, to be clear about the reasons we're moving to the cloud, clear about what we thought it would mean for people's roles, and open about the fact that we might not have all the answers. The reaction from quite a few people was concern and skepticism. It's difficult when planning big moves to trust that an organization is going to support you through that. Coming out of some meetings, we had network engineers, including Chris, who'd worked at Auto Trader most of his life, asking, "Have I got a role here anymore?"

We looked outside the organization to hire a couple of people who had experience building public cloud infrastructure and integrated those people into our operations and infrastructure teams. Using brown bag sessions, knowledge-sharing sessions, and boot camps, we started to embed that knowledge and expertise into our teams. People could see how their roles would transition through this period and how their roles could grow. The impact on the team has been really good. People can now see how their role is changing, and they can feel part of something really special.

A couple of quotes: Chris, our network engineer who was worried about his career at Auto Trader, said, "I was really worried about it, but actually this move has enabled me to massively expand my skills and be a more effective network engineer." Another quote is from Shaun, one of our customers, a tech leader of lots of engineers who needs to deliver new products and services quickly all the time. He was happy-ish before, but now he says he can get new products and services delivered in minutes, and all the instrumentation, information, and monitoring he needs comes out of the box, so he knows how his applications are performing.

07Summary And Ask

Dave Whyte: In summary, the three areas. Culture: you're going to hear it a million times at this conference. The right culture is really important and key to success. It needs to be bought into at all levels, right from the top down; otherwise it's not going to work.

Organizational agility: go from being blockers to enablers. Be bold and courageous around your quest for continuous improvement.

Lastly, people: your employees are your most important assets in a company. Empower them to be part of the journey.

The last slide that Gene wants us to do is what we need your help with. I think we say it as collaboration. This conference is awesome, but there are other little conferences and meetups all over the world. Get involved with them and support your local DevOps meetups. It's amazing the amount of stuff you can learn from those meetups. Lastly, we're here telling a story. Everyone here has got a story, so we're asking you to go to these meetups and share your stories. Thank you.

Andrew Humphrey: Thanks.