DevOps Journey at adidas III: Exploring Data in the Cloud

Log in to watch

London 2020

Download slides

DevOps Journey at adidas III: Exploring Data in the Cloud

Fernando Cornago

VP, Platform Engineering · adidas

Daniel Eichten

VP, Enterprise Architecture · adidas

Team adidas comes back this year to describe the consolidation or DevOps and SRE practices across the whole IT Department, being now renamed as Technology.

Fernando Cornago, who has taken more responsibility and whose scope is extending towards Cloud and Connectivity, brings this year Daniel Eichten. Heading up Enterprise Architecture, Daniel will talk about adidas’ Cloud and Data Strategy.

Last but least, gamification is always a topic at adidas, we’ll see how this gets implemented all over business and tech.

Chapters

Full transcript

The complete talk, organized by section.

Host Intro (Gene Kim)

And now it's time for our first talk. Three years ago, the team from adidas attended this conference and left inspired by the Jason Cox from Disney presentation. The next year, Fernando Cornago presented on the plenary stage with his VP, Markus Rautert. Last year, he presented with Benjamin Grimm, who drives the product vision for their entire multi-billion euro e-commerce channel.

And I'm so delighted that Fernando is presenting again because the last year has been very exciting for him. He was promoted to VP of Platform Engineering, and he moved his family from Spain to the adidas headquarters in Germany.

This year, he's presenting with Daniel Eichten, VP of Enterprise Architecture. They will be describing the continuation of their journey, including what it's been like for Fernando to be given responsibility for all of infrastructure and operations, even as a career dev leader, and how they're transforming the data architecture for the entire company. Please welcome Fernando and Daniel.

Fernando Cornago

So thanks a lot, Gene, and hello, everyone. My name is Fernando Cornago, head of the platform engineering team for adidas, and it's an honor to be here for the third year in a row, this time remotely.

Today is all about the consolidation of our model. At adidas, we started very early our DevOps transformation, and now it's all around spreading it globally. For this, I brought with me my colleague and brother from different parents, and to be honest, I'm not lying if I say that he's the most technical, brilliant mind that I've ever met, Daniel Eichten.

Daniel Eichten

Thank you. Well, it's an honor to be here, although it's just virtually. And I was joking to Fernando, if he doesn't take me the third year in a row, I will not be friends with him anymore, and I will not help him with the move. Without any further ado, to Fernando.

Fernando Cornago

Okay. Thank you, Daniel. So let's get started. Okay, you know our logo, our motto. We have more than 400 million lines of code at adidas. But you also know the fancy videos that we typically use to start our presentations. This year, we don't have one. Black Lives Matter, period. It's so obvious in 2020 that we don't need to say anything else.

So let's get started. As a company, you can see some of our figures. So we did last year more than 23 billion on revenue. We operate in every market, in every region, and we do this with a total of almost 60,000 employees around the globe.

And at adidas, it's all about our products. Everything about our product as assets, including technology. Since the end of last year, we have started our global product-led transformation for IT, even renaming IT as such, and we are now called tech, technology.

As a company, we are a complex company. We cover the full product life cycle of our physical products, from ideation, design, planning, manufacturing, supply, selling, B2B, B2C channels, and this is reflected in our product domain map that you see in the picture.

This allowed us, this transformation, to push more responsibility and ownership and partnership of technology into the business, and we are convinced that this will create empowered teams, accountable, focused on value, while at the same time driving simplification and innovation into the company.

Does this bottom part resemble some of the five unicorn ideals, Gene? I'm sure it does, right? On top, this operating model allows us to measure products by happiness, value, quality, reliability, and flow. Thanks to flow, we are convinced that we will be able to detect the bottleneck to drive our investments better as technology. I want for this to thank Mik Kersten personally and all his team for all the conversations around implementing flow that we have had within the last year.

You saw all the products. We also categorize these products into three different types, three different levels. You see on the top, the experiences and touchpoints, where really the key is the speed, fast reaction, innovation. Then you saw on the middle layer our core products. It's basically where the outcome is data. With these products, we abstract experiences from the complexity of a company that is 70 years old and has more than 1,500 IT systems. The key factor by working there in these products is quality, is scalability. 80% of the job is happening behind the curtains. Experiences come and go, but the work that you do in the core and the platforms, it stays.

And at the bottom, that is my field, is the platform products. Yes, we really encourage all of you to really manage your foundations also as products. IT for IT should take care of their customers the same way and apply all the practices that the rest of the products are applying.

And how the platform teams are engaging with our users. This picture that you see here may sound familiar to everyone having read the fantastic Team Topologies book by Matthew Skelton and Manuel Pais. I would say, how should you manage? It depends.

So some teams, as you see in the picture, they cover critical capabilities that we really decided to decentralize for the sake of speed. These teams are engaging typically by enabling or collaborating models with the users. You see the API teams, the CI/CD, the fast data, the streaming teams. They help others to build their own experiences and master the capability themselves. Other teams still provide end-to-end services centralized, for the sake of efficiencies, for small size of demand, or simply because there's a big technical complexity below. The cognitive load, or the amount of information that a team can handle, is a critical factor to consider always when you really are designing your organization.

And last but not least, do we have a playbook? Do we really tell our teams how to function? We give a couple of suggestions to our teams and a couple of basic rules to them. On the left, you can see the rule of how they should spend their time. We always tell them to continuously decrease the time that they spend into manual operations through automation. And then the rest of the time spending in value creation, we tell them to split 50-50 into the next two buckets: evolving the platform so the platform doesn't get obsolete, and second, consulting, working with the users so they can see in first person how the platform feels from the outside and also they can keep the community alive, what will drive standardization and speed.

And the other basic rules that we put to the platform teams, you can see on the bottom, is how to measure themselves. So the platform value, so we are a product, we need to measure our value. The platform value that we measure is our adoption, so the amount of people, products that are using our technology, and the NPS, how happy they are with the service that you are providing.

IT for IT is hardly measured by direct business value. Where you really need to measure the business value, you see, is in the evolutions that you do to the platform. Once you reach a certain level of users, every single evolution that you do in the platform creates a huge impact into a lot of teams. And of course, on the right, you can see that we apply the same operational metrics on the rest of the teams. It's flow time. You don't need to be the bottleneck, you don't want to be the bottleneck, and availability. And you will see later things about availability, about the business loss that you create without issues.

So this is the overall framework. And then corona came to our lives. And how did our way of working help us to go through the crisis? First thing was focus, cutting cost and avoiding cash out wherever possible due to the big uncertainty that we had three months ago.

We really used our product domain map to visualize our key drivers and move our resources and investment to where we were creating business. In this case, only our digital e-commerce ecosystem. It was almost our single open store. Like my boss always said, Markus, our investment model looked pretty much like a Thor hammer. So very thick on the top and very thin on the bottom, which is okay, but we all know that it's only sustainable for a small period of time.

By the way, by moving all this focus to one area, we really verified that our platform strategy and common technology toolset was the right one because it allowed us to move teams from one area to the other, or move scope to teams, that we always preferred than moving teams to scope, and with really short ramp-up times.

Next thing after corona. So let's look into efficiency levers that take a little bit more time and more effort. Can we really do same for less? Can we be more efficient? And we look into the product domain map with a different angle, in this case, for the run cost. So while the build cost that you saw earlier resembled a hammer, the run budget really looked more like a kettlebell that we use for doing exercise, really very, very heavy in the bottom.

And essentially that was because of the way we have traditionally managed our infrastructure cost in the past, where the products that are the ones driving usage, they have really low visibility on the cost that they are creating because the budget has been always managed centrally. We are basically challenging this by implementing technology business management that will help us to visualize the total cost of ownership by the products that they are creating, really the cost. We are convinced that it will boost a cost-awareness culture among the company. Imagine all the things that we can do in the IT management team with value, cost, and flow really measured at the product level.

But it's all cost, cost, cost, and we know that DevOps is not only about cost, right? So DevOps, we truly believe that comes also with value generation. And this is where really, I told you, our digital channels have been our single open store for almost three months due to coronavirus. So the company doubled down on e-com and we even increased the initial challenging target of four billion revenue a year on this channel to 4.5, or now we are even talking about 4.7, what will be an increase around 30, 40% year over year.

And the biggest pain for our digital ecosystem are outages. So apart from damaging the brand and our relationship with the consumer, they really create revenue loss. So we checked with our digital analytics team, and in the last months before this initiative, we were losing almost a million euros a month because of revenue loss.

This revenue loss, together with the percentage of defects leaked to production, have been our killer KPIs that we decided when we launched our initiative that we call Digital Experience Excellence, where we put our platform engineering teams focused on engineering enablement and developer productivity, working together with the e-com team for the last quarter.

So we wanted to be in the elite. So we measure against the best, against the elite. We need to be there. And our time to restore, as you can see here, lately 4.5 hours was really not looking right or not looking as expected. So we really thought about attacking this metric. For this, we created a framework with four streams, led by one e-com person and shadowed by one of our platform experts.

We implemented software reliability engineering practices at scale. We revisited our QA strategy and our end-to-end experience testing. And let's not forget here, we are talking about a big area. It's more than 1,000 engineers working in our e-com ecosystem on a daily basis. And last but not least, we also reviewed the release management practices followed in the area where we may be too much into the DevOps stream. We found out that 62% of the outages were really caused by changes.

Of course, like everything that we do, all these streams were actionizing KPIs that were contributing to the two killer KPIs: net sales loss and defect leakage to production.

To do all these collaborations, we decided to move from our digital ecosystem to our dojo, so our space dedicated to learning and experimenting. In platform engineering, you saw our collaboration models before. We had struggles in the past, really by definition of scope, duration of our collaboration. So we found sometimes our people really stuck in a project instead of really developing platforms or spreading platform usage across different areas.

So now with dojos, we always start with the statement, with a problem statement, the KPI you want to actionize, and how the capabilities of both teams getting into the dojo can help each other. Our learnings: this is a peers game. So the platform engineering team is not the smarty-pants that comes to tell a product that everything is wrong. So dojos help to combine the strength of the platform team with the deepest knowledge in a technical matter, together with the team that is owning and living the product on a daily basis. And of course, last but not least, they need to be time-bounded and value-based. Please set up clear expectations and a time limit in order to achieve them.

The results so far are amazing. So as you can see in the graphs, we have decreased drastically both mean time to restore and mean time to detect, and also the mean time between service incidents. And even if you see on the top right that the revenue loss has a small spike in May, this is because of two outliers that were caused by external services, in our case, by payment providers. Because let's face it, nowadays with corona, every single digital online service is struggling with increased demand. Our learning there, anyway, is that we need to protect ourselves better and react better to outages from third parties.

And with that, I better stop talking, and I leave Daniel and Daniel's with really the adidas architectural vision. And none of these things I've told you would have been possible without it.

Daniel Eichten

Yeah. Thank you, Fernando. I would now say for something completely different, but actually it's not different because we are just talking about how can we underpin everything that we heard earlier with our updated cloud strategy.

And when I was looking around for our cloud strategy 2.0, and looking for a picture that can represent it, I found that nice picture, which I really like. But then looking at it for some time, I said, "Hmm, that's maybe a little bit too intimidating." So we shouldn't really use it. And actually, also, our cloud strategy 2.0 is not actually 2.0, it's more like cloud strategy 2.1.4 or whatever our wiki gives us as a versioning scheme. And it looks already quite nicer, like a warm summer day.

So what are the building principles for our updated cloud strategy? As our first cloud strategy was really more oriented around to avoid how we can really not think about lock-in, and looking for a potential Clexit, or cloud exit, and giving our engineers the same developer experience in all of our areas. Although all of these kind of principles were good principles and it's something that's really meaningful, we accounted that there is a problem with that one.

Because when you look into it, we generated out of our great ambition to not have the vendor lock-in, we created another lock-in, which is our tech stack choice lock-in. So we locked ourselves into containers, we locked ourselves into Kubernetes, we locked ourselves into Jenkins, et cetera. So something that we can still perfectly live with today, but actually it may not make very much sense, looking backwards to all of these metrics when we say we should focus on creating value.

So what is our secret key to avoid the vendor lock-in? Well, it's pretty simple for us. There isn't any. So we deal with it differently because now we pick really lock-ins that we love, I would call it, like the love locks you see in this picture.

So what do I mean with lock-ins that we love? You might be asking yourself what can be beneficial of a lock-in. Well, there is a couple of benefits. They're bringing in standards, even though it might be just de facto standards. Knowledge is usually broadly available on the market. So if we bring in new talent, there is a good chance that they already master these kind of skill sets that we are looking for.

And just thinking about de facto standards. If I tell you now photo editing, you might have the one or two choices in your mind that everyone else has in mind. The one is about the magician, the other one is from this big company starting with an A, but you get the idea.

So in our new cloud strategy, we simply accept the fact that it's not that super easy to move workloads from left to right, but we gain a lot of the benefits of making use of higher-level services. So we are not wasting our time anymore with spinning up basic systems. We really go directly deep in and create value.

And, actually, I was a little bit lying to you because there is still a secret weapon to avoid vendor lock-in, which looks a little bit like this. So actually the only way how we do it, if there is really a good reason to move things from left to right, it's actually destroying it and redo from scratch in the newer stack or the newer technology.

Talking about technology: our technology vendors. So obviously we also work with the big hyperscalers, with AWS, with Azure, with GCP. And by our history, the AWS usage is more on the consumer-facing one. That's where we are very prominent. Azure is something that we can make use of for our employee-facing applications, managed business or employee productivity. And GCP is just a new member to the group. At the moment, it's very thin and special purpose, so maybe we talk about this next year. But I'm very excited to have them on board as well.

So this is what we call a multi-cloud strategy. And obviously we also still have our own data centers on-prem, as Gaia-X is at the moment nothing really more than an architectural idea or a concept. There is still need for that one because we still have people in the company who say, "You are not supposed to put everything onto the public cloud."

So how do we now decide what goes on public cloud, what goes on-premise, and what are the typical workloads that we have? So let me start with this quadrant. And it might look a little bit like some other quadrants that you are aware of, where the leaders are also in the upper-right corner, and this is particularly also our areas. That's the big e-coms, that's where our big data, where we really leverage benefits from modern cloud platform. Right now, that makes up 25% of all workloads.

The second area is all of these new cloud-native workloads that we have on-prem and that we keep over there, and make use of that where we just have some data gravity or we are not even allowed by data privacy principles to move that to a public cloud provider. That's roughly 10%. Then there is still a huge area of legacy workloads where we stay on-prem, for a good reason as well. This is just more this infrastructure setup, and that is our warehouse management solutions.

And then we are also, for some legacy workloads, making use of the cloud, specifically in this area where we don't really have an own data center and it would be more expensive to build one rather than just picking these services.

And how do we make that available to our developers? It's pretty simple. We create a nice landing zone. So it starts with very locked-down cloud defaults, which are secure, which are compliant, but we also make it as easy and as convenient as possible for everyone to use.

And for this one, we actually stole, okay, let's say adopted some other people's nice ideas. And we're doing everything through Git. So GitOps is anyhow something that we practiced before. But when we create new cloud accounts, or even asking for them, everything is now also run through Git. If you now want to have a new cloud, a linked account, you go to Git, you fork it, you send in a pull request, it's getting reviewed, it's getting merged, and if the merge is accepted, it's also directly being published into the cloud and you get your keys.

For the less experienced users, and we have some, we just now have to build a UI which is integrating to Git rather than using Git and configuration files directly. And that's also pretty simple and easily doable. So the request form is now directly integrating with Git, and then the rest of the process stays as is.

Now I have to say, all this is valid for our setup in AWS because it's rather big. For the other cloud providers, like Azure and GCP, we took another strategy. We just say, "Leave the doors open. You can have whatever you want. Here are the keys. But be aware there will be a watchdog looking after what you are doing." And this watchdog actually is also helping us to nail down compliance to a metric. And there we are again on measuring everything in a metric.

And this is bringing us to the next thing that we are doing in an architectural point of view. We are implementing architectural fitness functions across the different products and across the different product domains.

So in general, what does that mean? If you attended also last year's conference, you saw this nice presentation of our improvements that we did with the site speed. And you see on the left-hand side how it looked afterwards, and on the right-hand side how it looked before. So it's a good indicator already for some improvement.

You heard earlier from Fernando that we are measuring our mean time between failures, our mean time to detect, our mean time to recover. I just hope this bus driver had a good disaster recovery strategy. But we are also measuring, where we can, financials. We are measuring the cost of outages, we are measuring the cost of running the service, and we are measuring, where possible, also the benefit of deploying a feature. And we all make that available in our central dashboard, our global metrics portal that we already have for quite some years. I was trying to find a picture without a brand name, but I think I'm quite unsuccessful because most people will recognize what brand that is.

But let's also talk about one tiny failure that we did, and that is our data lake. Looks beautiful, huh? Well, I picked the picture of a golf course on purpose. Why? Because they're not only always super beautiful, but everything over there is quite artificial, but created to look natural. And you have a couple of greenkeepers, usually a small group, who are taking care of that golf course. And you also let access in very limited way.

And this is actually where I say we failed because if I now convert that picture, we had our greenkeepers about to fill that lake with the cleanest and the purest water that you can find and always ensure that it has drinking quality. Well, guess what? It became a bottleneck because they were trying to fill that lake with a garden hose, because their limit was kind of very limited. Their bandwidth was very limited. It's just a single team, right?

And the result of that one is that other teams who really had the demand were creating data buckets left and right, as well as some very playful puddles of data used by the data scientists. But obviously, that was creating some other issues again, as we had no central visibility on all of the data that is available, and also no way to make that available to everyone, right? So it could be that we have some very meaningful data already available, but some others recreated it.

And this is how we came to making use of what we now call a data mesh, or some other people refer to it as calling it DATSIS, which is referring back to that Martin Fowler article and says a data product has to be discoverable, addressable, trustworthy, self-describing, interoperable, and secure.

But the biggest learning that we took out of that one is that it wasn't really a failure in terms of technical setup, in terms of architecture. It was a failure in terms of organizational setup because we just gave one team that one single target. And now we are using that towards our benefit. We are reversing Conway's Law and changing some organizational setups with different incentives to achieve the target that we actually want, because the data is available to everyone. And with that change, it's pretty clear for us as well: we know that change is a team sport. And with that, I give back to Fernando.

Fernando Cornago

Thanks a lot, Daniel. As always, amazing. And you can say change is a team sport. And for us and in our DNA, you saw last year, all is around gamification. Right? It's in the DNA of our company. So let me finish up with a couple of things that we are doing right now on how we are using gamification across the whole IT.

Things that we have started in engineering last year with the DevOps Cup, they are now expanding across IT, like Cash is Queen. We spoke about how to save money for the company, how to do same for less. We launched this campaign, Cash is Queen, where we really give a copy of The Unicorn Project to the team that is saving more money in a week. Thanks to this campaign, we have saved 10%, so we have switched off 10% of the machines in our on-prem data center, more than 500 VMs and databases.

DevOps and gamification is also everywhere. So we have the game of technical debt. One team, so one product area, is around 100 engineers. They are playing one technical debt sprint, working with the flow framework, every six sprints. And they realize that after the cleanup sprint, the velocity completely goes up and starts decreasing on the next five sprints until they clean up again.

Operational analytics. So our operational analytics team is following Accelerate by the book, step by step, about decentralizing operational analytics to the market teams. And what I am more proud of this year is the network and the identity teams, teams that come from a former infrastructure way of working, and they are applying little by little more DevSecOps principles. So not throwing anything over the fence to security. They are working the sprints, they visualize the backlog. Security can chip in and they feel now more secure that the network team and the identity team are really following a secure way into their backlogs and they can really affect these backlogs for the better of the company.

The biggest success of the year, no matter if we reach the 4.5 billion, 4.7 billion, is that from day one of corona, all our 60,000 employees were able to work from home without us having had to risk their lives for this.

Last but not least, this gamification is coming to our business. So our robotic process automation team really launched this 008 License to Automate when they are opening up our stack to business, because basically it's not difficult to use. We have 360 cases that people apply. We are training them, and I can tell you 60, 70 of them are really promising to keep really saving manual work into the company. Last year, only in the beginning of the creation of the platform, we really saved 62 FTEs of manual work with a bunch of five, 10 people in the team.

Okay, and just like every year, I just want to close with asking for help. And it's around, in our plans, we have, you've seen our product domain map, you've seen our products. Daniel, myself, we want really to define all this data flow around entities throughout the different product areas. So we really are strict in governing all this data interaction. And we put the data catalog in, data virtualization, service virtualization there, and we really are able to manage and test this data because this will make the products internally faster. So if any of you have done something like this in the past in a company our size or our complexity, this will be really appreciated.

And without further ado, Daniel.

Daniel Eichten

Thank you for the time. Hope to see you then next time in person or-

Fernando Cornago

In Vegas.

Daniel Eichten

In Vegas, yes. Vegas, baby.

Fernando Cornago

Yeah.

Daniel Eichten

Thank you. See you. Take care. Ciao.