The Dawning of a New Era

Log in to watch

London 2019

Download slides

The Dawning of a New Era

Dave Whyte

Operations Squad Lead · Auto Trader

Russell Warman

Head of Infrastructure and Operations · Auto Trader

Auto Trader is currently in the middle of a journey migrating all apps and services from two dedicated data centres to the Public cloud; this talk will cover a brief synopsis of what we do, the need to move to the public cloud, and will give a 'warts n all' coverage of what worked and what didn't.

Dave Whyte is an Operations Squad Lead at Auto Trader UK. Auto Trader is the largest digital automotive marketplace in the UK. Over the last 13 years, Dave has played a key part in continually evolving IT Operations department which has resulted in enviable levels of stability, availability and performance. This has been achieved through the practical implementation of ITIL processes and adapting to ever increasing organizational agility.

Russell Warman is Head of Infrastructure and Operations at Auto Trader, the UK's largest digital automotive marketplace. He leads the team charged with the capacity and performance of one the UK's busiest websites -autotrader.co.uk, which sits at the heart of the UK's vehicle buying process. The team have a cloud native approach to enable our developers to make more frequent changes safely and are increasingly looking at public cloud services to continuously improve.

Dave Whyte, Operations Squad Lead, Auto Trader

Russell Warman, Head of Infrastructure and Operations, Auto Trader

Chapters

Full transcript

The complete talk, organized by section.

Host Intro (Gene Kim)

Gene Kim: Hello. I was so delighted to meet the operations leadership team from Auto Trader. Russell Warman is head of infrastructure and operations, and Dave Whyte is operations squad lead. They will be talking about how, as a company, they're in their third era, which will depend upon a successful migration to the public cloud. But what is so amazing to me is that this is only one aspect of what they view their mission to be. They'll be talking about how, as an infrastructure team, they've been helping elevate developer productivity, enabling faster and more frequent software releases, and B, also enable safety and happiness. They operate in an incredibly competitive industry. Actually, and to do this with their existing ops staff, to help elevate and ensure that their ops staff are relevant today. So they operate in an incredibly competitive industry, and I think the most visible evidence of how well they've done is Auto Trader's wildly successful 2015 IPO. So please welcome Russell and Dave.

Dave Whyte and Russell Warman

Dave Whyte: Good afternoon, everyone. A final straight, so get cracking on this. Thanks to Gene for the awesome introduction. We're really honored to be representing Auto Trader and speaking at an awesome conference like this. Quick introductions. My name is Dave Whyte. I'm an ops lead at Auto Trader. I've been at Auto Trader for around 14 years. I'm also a co-organizer of the DevOps Manchester meetup. Go, guys. Icebreaker, because I think this is unscripted. Icebreaker. We're being trying to be very open and honest as companies and tell horror stories as well as good stories. 10 years ago, I brought down our whole website during peak time for an hour. vi and BIND don't go very well together.

Sometimes one mistake you make, don't make that mistake again. But we're a very open company and want to tell open and honest stories. I'm joined on stage by...

Russell Warman: I'm Russell Warman. I'm head of our infrastructure and operations squad, and I've been at Auto Trader for about 21 years.

Dave Whyte: So then, we're talking about dawning of a new era at Auto Trader. It sounds very dramatic. We're making a lot of changes, but the biggest change really is our migration to public cloud. Currently running in two data centers, lots of legacy kit. Ambitiously, we're moving everything onto public cloud or the appropriate public cloud. And we're also not doing lift and shift. We want to take on the benefits that we can get from the public cloud. Literally nothing left in our data centers. That's including killing off our old Oracle databases and everything. So, pretty ambitious. Before we get into the detail on the talk, just a little bit of background on who Auto Trader are.

So we operate the UK's largest automotive marketplace, and at any one time, we've got around 55 million cross-platform visits. We're a FTSE 100 company. As Gene said, we floated in 2015, and late last year we got promoted to the FTSE 100. We've got a market cap of about five and a half billion, and we've got revenues of around 355 million each year. And we do this with 830 employees, and about 220 of those are developers working in multi-discipline squads in the Spotify Tribes and Squads model. And we've got 27 infrastructure and operations people that work in a centralized multi-discipline squad. To give a flavor of our business, basically the core business is connecting vehicle buyers with vehicle sellers.

This is primarily done via our vehicle search platforms. You can search for cars, which are shown in this picture. You can search for bikes. You can search for caravans, and you can search for diggers. Secondly, really important to us is vehicle stock. Lots of vehicle stock with accurate data is crucial. Put simply, if there's no stock, there's no search, there's no business. And the final stage is traditionally we've been known for used cars, and over the last few years, we've done an awful lot to try and give people more choice about their next car, and in particular, showing deals on new cars and giving people the chance to configure and buy their next car online.

So this conference is all about collaboration. So I'm really proud to announce that yesterday Tom spoke from ITV, and he's talking about "Love Island," and we've agreed we're going to show a clip for the next... No, I'm only kidding. Here's a brief video showing some of our history. So I'm really proud of that video. Definitely shows our heritage and history, and the fact that we're not a startup. We do get accused sometimes of being a startup. I'm also really glad it was on before Disney and not after because yeah, that might not have gone as great. This slide shows our eras. We talked at the start about different eras. 1987, magazine era, a very successful magazine company.

We both joined when we were a magazine company. Digital was pretty much the... Yeah, digital. Very successful transition across to a digital company. 2013, 100% digital. No more magazine on the shelves. And we're saying then from 2019, it's our third era. We're becoming, we believe, we've evolved from a digital company to a technology company. The reason for this is we're using data and insight to extend and enhance our platform capabilities.

Russell Warman: And this just gives you a little bit of an indication of where we are. So today, we're still running two physical data centers, but we're across three public clouds, and as Dave said earlier, we choose the right cloud for the right workload. We've got about 375 apps in total, and about 226 of those have already been moved up to the public cloud. Why did we choose to move to the public cloud? Historically, we've run our data centers for about 18 years, and we've always had a great record of performance and stability. But the thing that pushed us to start thinking about cloud was about increasing developers' velocity to release applications. And the first step that we took on that journey was building a private cloud because we felt it was the shortest and most incremental step to get us to becoming cloud native.

And we asked our developers to start re-architecting their applications to think about the cloud and planning for failure, and building in that tolerance within their applications. From an operation and infrastructure point of view, one of the things that we worked out while running this private cloud was just the level of complexity and the overhead in managing upgrades, the dependencies between different components. And increasingly, we chose an open source product, and increasingly, there was less and less support for it in the community, and we didn't have the skills and the ability to be able to do the pull requests, make the changes, and then fix those issues.

And what really brought this to the fore was about 18 months ago, we had a customer requirement to help them meet their GDPR needs to provide end-to-end encryption. So our delivery platform team took this little requirement, and they estimated it was going to take them about six weeks to be able to deliver this. Three months later, they were nowhere. They'd not managed to crack it at all. And we work with a guy that gets super frustrated when things aren't done within a week. And he took the initiative and said, "Look, I think I can fix this using Kubernetes and Istio." The quickest way for us to be able to test that was in the public cloud. So he spiked this, and he got it up and running within two days.

And what that demonstrated to us was just the capabilities that we could get and take advantage of by moving things into the public cloud. So we got agreement last August that we could begin migrating our applications into the cloud. And over the next few months, we got the infrastructure ready, and in December last year, we migrated our first application. And as we stand today, we've got 226 of those applications moved up into the public cloud. And a lot of that has just simply been possible because of the work that we did taking that interim step of building out our private cloud. So we're over halfway on migration and feeling really positive about the progress that we're making.

We believe this progress has been down to three important factors. First one being culture, second one is org agility, and the third is people. All very common themes that we've been hearing in various talks in this conference. So we're going to go through each of these in a bit more detail. So we'll start with culture. And I really like this picture because I think it portrays a strong message. Your culture is your brand. And a great company culture leads to energized and highly motivated teams. Around six years ago, we weren't in that place. It was quite doom and gloom. We were disorganized with no clear brand identity, with no long-term mission or vision, and we'd got multiple teams siloed in multiple offices.

And our structure was quite hierarchical, and our decisions were made about short-term financial benefits rather than thinking about what the long-term benefit for the company should be. And we were also starting to see new competitors springing up. We had a CEO start with us during this period, and he came with a wealth of experience of running and being involved with other digital businesses. And on his first day, he made a real clear statement that there was a change coming when somebody showed him where his office was and he said, "I'm not sitting in that. I'm going to sit in that with those teams." And very quickly, all those managers that had offices within our organization gave them up.

And essentially what he joined was a company that was digital. We didn't print magazines anymore, but we were still acting as a print and publishing business. But we were a business that embraced change, and we weren't afraid of an ambitious deadline or two. After he joined, he formed a new senior leadership team, and they went away for a bit and came up with a clear vision and a mission, and that is to lead the digital future of the UK's automotive marketplace, which is bold and ambitious. And to be able to help us achieve that mission, we knew we had to make some major changes, and one of these was through the introduction of our core company values.

And what really was pleasing around the way that the values were introduced were they didn't feel like they were just words put on a paper and stuck on a wall. They were all things that we could all relate to, and they were also things that people were already demonstrating or could aspire to. So these are our values, and I'm just going to pick out a couple which really are quite meaningful to me. I think humble is about being open, honest, and approachable, recognizing success in others, but also admitting and learning from our own mistakes. And the second one is about being community-minded. So looking after each of those, and thinking of others before we think of ourselves, respecting diversity and advocating inclusion.

And we want to make a difference to the communities in which we operate. And to support some of our decision-making, we came up with a set of guiding principles, and I'm sure all of you are familiar with this format. But those on the left are the things that we favor over those on the right. And in particular, this thing about achieving our mission takes preference over short-term financials. And we extended this within the infrastructure and operations teams. And we added our own. One of those that stood out for me was about fact over opinion, using data, not just assumptions, to inform our thinking. And the final piece of the cultural change that we undertook was moving from all these multiple small offices into two locations in Manchester and London.

And we moved all 700 people from these separate offices into two locations in just a year. In parallel with that, we ran a wholesale IT refresh project. We moved everybody to choose your own device. We integrated tools for collaboration, touch screens with video conferencing, and we went wireless first. And there was lots of consideration that was given to creating the most optimal open plan space for collaboration, even down to the desks that we chose. And I'm not sure if you can see it on here, but we took away dividers between desks. Everybody only got a single monitor. We removed pedestals. Everybody got a locker. And even the desk legs are slightly set back to enable people to pair because we really, really wanted to make a big difference in terms of how people collaborated by breaking down those silos.

Dave Whyte: So the second factor was org agility. And org agility could be a buzz phrase, buzzword. What does it actually mean? So this is the rather long, full, lengthy definition. I think quickly for us, it really means the capability of a company to rapidly adapt to change. We've got a lot of experience rapidly adapting to change. This has been a form of competitor threats that we deal with rapidly by swarming and taking on competitors. By unreasonable third parties, third parties that believe they can increase their prices to us year on year. Again, we've took them out the equation as well when we can. GDPR, rapidly adapting to change. So we believe that we are in a perfect place for this massive cloud migration because we do this day in, day out.

We are quick at embracing change. This is a phrase you use quite a lot. Blockers to enablers. You need to be an enabler, not a blocker. Operations, definite area we are in, used to be known as blockers. This was due to taking weeks to build servers, poor workflow management, ticket black holes, unwieldy change process, all these things that block progress for a company. Over the years, we've turned this around, and this has been achieved by embracing new technology and evolving our processes. Brief example, how we've gone from being blockers to enablers is, if there was an issue with an application and you want something quickly, it would take a while to do.

Literally a few weeks ago, we had an application that was going to go live, quite a big app, quite a big rollout. It was booked in. Dealers were aware, had people lined up for support. But we realized a day before that the app was causing an issue with the database. So the product team came across and spoken to us, and they were distraught. They thought they'd have to go and communicate to the dealers that it's not going to go ahead. Everything they planned is going to go pear-shaped. One of our leads, the lead that Russ had mentioned before, he got involved in the conversation and he's like, "What's the issue?" He explained it very clearly, and all he needed really was a caching layer.

And he said, "I can fix that in an hour." He built it in an hour, spun it up in live, tested it, and there you go. And the product team was like, "Wow." The faces was amazement that we turned it around so quickly. But truly, we believe we are now enablers.

Russell Warman: This slide really shows how we've achieved this step change in our deployments. So the first stage was, the first couple of years, we were very manual. We had two people executing those releases. The second stage, we moved to auto deployment, so enabling our developers to push their applications into live. You can see where we implemented our private cloud, in about 2017, and we doubled the number of releases again. And then as we've moved towards more continuous delivery and finally in the last year when we've started moving things into public cloud, we've just been able to accelerate that release. And we hit our record earlier this year. We hit 455 releases in a single day.

And this year we're anticipating that we're going to get to 40,000 live releases. So you might think massive increase in releases, there's going to be surely a massive increase in failures and issues, and that's just not the case. Graph here shows that a failure release is when there's a customer impact. That's a failed release. This graph clearly shows the number of customer impact on releases actually declined over the years. So this supports our thinking that more frequent releases with less change reduce the likelihood of a customer impact. It also means that if you're making small changes, if you come across an issue, it's very easy to either forward fix, preferred, or back out very quickly.

Dave Whyte: Now because we're increasing change, we need to have the data from releases. We used to have a forward schedule change, which was an Outlook calendar, which was a nightmare because you had to book it in. Did the release go? Didn't it? Where's the communication? Just didn't work. So we built in, this is an in-house app, which we call Lighthouse. And basically on the left-hand side, you can see the applications. Effectively, every time it goes through a release pipeline, a signal is sent to Lighthouse for a start, a signal is sent for a stop. So we can see the application name, you can see a start time, and you can see the duration. But this screenshot was taken around 16:42 there.

Those 153 releases that have gone through for the day. For those with eagle eyes amongst you, the Search 1 service at the bottom, that's taken 52 minutes to deploy. That's on our old platform. But we're actually doing some dual deployments. The sister service, which is the same code, it's actually taking 1 minute 31 seconds to deploy to the new platform. So massive reduction in time. Right-hand side. So our developers all deploy to live. We don't put any blockers in place. All we ask they do is they look at this dashboard and check this widget. This widget ties into our monitoring and effectively if there's an issue with monitoring, it will say Ask.

All we ask to do if it says Ask is to Slack the ops team and say, "By the way, trying to release this. Can I go ahead?" We won't block you. We'll then work out what is it you're trying to release. If that's got no connection to any of this, yeah, go ahead, do it. And they push a button and release. The bit at the bottom is In Progress, so we can see what releases are going through the system. So basically, if we see any monitoring issues, we can tie it in with what's going through, what's late to release, when did that stop, when did that finish, and we can basically diagnose issues with all this great information. This is a screenshot of our service discovery dashboard.

So this is an example of us rapidly adapting and embracing the new platform's capabilities. It's essentially a service catalog of all our apps in GCP. The thing about it, it's really a DevOps portal. And I honestly believe this is my holy grail, this really, from an ops point of view because it's a frigging awesome tool. Go into a bit more detail. Hopefully everyone can see from the back. There's a list of the applications. I can clearly see the app name. I can see there's an owner because we have a named owner for each application. The reason why we do that is, I'm not sure, I don't want to be in a situation where you try and track down an app owner because there's an issue.

Oh, they left the company six months ago. Great. Who's responsible now? But what the system does, we've got a script that checks AD, so if that person doesn't exist anymore, we will get an alert saying, "They're not here anymore. Please update this app to have an owner." Secondly, it's got the squad. It's got a responsible squad as well. So basically anything that happens to this application, I can track down that squad. Thirdly, it's got a tier. We've got three tiers that we're just starting to roll out. What we're trying to do is a lot of apps are tier three, which means we don't need to be alerting on-call. We're trying to get the work-life balance right between our engineers and actually what's critical to our business.

The right-hand side, we can see tags. So we can see tags on what the application is written in. We're pretty much a Java house, so a lot of Java applications. And if I click on the Jump To, I can see the URLs. I've been in a situation where a developer's come up to me and said, "What's the URL for this app? So-and-so has left. I don't know what it is." And how am I meant to know what that is? Here, it's all in there automatically. We've got all the URLs, links to pipelines, link to source code, link to service metrics, links to logs. DevOps, everyone sees the same logs, no exceptions. Metrics, JVM metrics, tracing, all built in from ground up.

The second thing is, I think this is really cool as well. Andy yesterday showed us Sky Betting's Monkey Bot. This is our bot called Skipper. We quite like the maritime theme, as in Docker, Skipper, shipper. We go for that sort of theme. Skipper, essentially a webhook receiver for Prometheus alerts. It enriches the data and forwards them on. It's another example of us rapidly adapting to a new platform. All alerts get fired into an ops-managed channel in Slack, and critical alerts go to PagerDuty. Go for a bit more detail. So you can see at the top here, this app is forecourt service, which is returning an increase in 500 errors. So good information there.

It's a critical alert, so it alerts the Slack ops channel ops engineers, so they get alerts for that, and a service owner. So Paul, normally, is a developer in the customer tools squad. There's a link to a runbook. It's very good having a runbook, especially for the engineer that's trying to fix something at 3:00 a.m. They got very clear instructions on what to do. I can see this alert's actually already owned by Paul. It'll normally say Click to own. So Paul's acknowledged this alert. If we want to, we can silence it. We rarely do that. Lastly, you can see at the bottom is two replies. So the first reply is from Skipper again. Skipper's noticed that there's been a deployment to forecourt service in the last 30 minutes, adding the pondering emoji.

It also gave us the link to the pipeline and source code. The last reply is from Paul, who's rolling it back. Paul is a developer. So another angle, the timeline for this incident. This is effectively what it's going through. Incident starts just after 3:31 and it's fully resolved by 3:38. Seven minutes. Bear in mind it took us about, on what you've seen before, our old app, it's like 50 minutes to deploy. This is deploy, spot an issue automatically, developer backs out within seven minutes. And for me, this incident was owned and resolved by a developer. No ops involvement. Caveat, this should not have gone live in the first place. Clearly, the developer hasn't put the right checks in place because it shouldn't have passed, shouldn't have gone live.

But I still think it shows a great example of a really quick response.

Russell Warman: And this gives us an indication of how we manage our costs in the cloud. So often you hear people talk about apps sprawling, and Dave's already demonstrated how we tie an app to an owner. And basically, this gives us great insight into what it's costing us to run those applications. You can see where we've underutilized CPU, and we've got our RAM just configured correctly. So we're able to right-size resources based on this, and each squad can do that. So the last factor is people. So our people are our greatest assets. We have a lot of happy, smiley, talented people who embrace change. However, we knew that we were missing certain skills that are needed to be able to complete the migration, importantly, support it, and when we actually were using it.

When we wanted to make this change, we wanted, as Gene said, to use the people that we'd already got within our teams. And I'm sure this is the same for many of you, that attracting and recruiting talent is super difficult. All of our engineers had the desire, but what they lacked was the skills and experience. So we wanted to come up with a way to bridge the gap. And we started engaging with similar companies, our suppliers. We attended conferences to try and find the answers to how we were going to do it, but we just couldn't find anything that inspired us. There was plenty of depressing stories around companies that had spun up a new infrastructure team to build out new cloud capabilities and then they'd let go of the legacy teams, and this wasn't the approach that we wanted to take.

So our approach was to have honest conversations with our squads. We were clear about the reasons behind moving everything to the public cloud, and open about the fact we didn't have all the answers. Initially, some people were skeptical and frustrated about the decision. A lot of this was due to concerns of the future, and we do say one of our network engineers jokingly, but he literally said to us, "When am I going to get my P45 because clearly I haven't got a future here." One of the key things that we did was supplement the team with a few key hires, bringing in experience of building in public cloud platforms. And then with their help, we ran bootcamps and brown bag sessions to help those engineers learn new skills.

And over time, they've organically seen where their role will transition. And if anyone does raise any risks or concerns, we'll deal with them as they come up. So far, nobody has left us due to the migration. If anything, they've really embraced the change because they feel that they're part of something special. So Gene's asked us for a few quotes. This first quote is from that network engineer that was literally saying about his P45. He writes quite a lot, but to focus in, he's initially worried there'd be no job for him, but quickly realized there was a massive opportunity to expand his skill set and become a more rounded network engineer. And as Tom said yesterday, there's no point in having a great platform if it doesn't meet the needs of your customers.

So here's some feedback from one of our customers, and he's saying that it's simple, super fast to be able to deliver his app. It's moved from an hour and a half down to two minutes in one click. And out of the box, he's getting 90% of all the monitoring that he needs. So in summary. Here's just a few takeaways. Having the right culture is important key to your success. Your culture is your brand, and it requires buy-in at all levels. Everyone here has seen how important it is to play a part in a healthy DevOps culture, and we see and hear of companies that have seen the DevOps seed sprouting, but they're not allowed to grow and succeed. Around org agility, go from being blockers to enablers.

Are you aware of a task, a process, or something that you know is blocking agility? Do something about it. Become an enabler. Also, be bold and courageous around your quest for continuous improvement. And lastly, your people. They are your most important assets. Listen to them and empower them to be part of the journey.

Dave Whyte: So Gene's asked for what we need your help with. I think for me, when I thought about it, it was simply collaboration. We are a very open company. If anyone wants to get in touch with us and discuss our story or share anything or speak to us, please feel free to get in touch. Secondly, definitely representing Manchester meetups. There's loads of local DevOps meetups all wanting to hear your stories, so please reach out to them and share your stories. Thank you.