Scaling Data Analytics Delivery Model With DevOps Practices
As a Data-Driven company, adidas has strong demand for reliable, scalable and fast Data Analytics Platform.
This presentation is about our solution how we enabled DevOps principles along with number of capabilities from the Accelerate book (like Continuous Delivery, Architecture for scaling) in the Data Warehouse domain. During the journey we convinced more than 10 teams around the world how DevOps principles bring them speed, high quality and empowerment in creation of data models and data pipelines.
Now as the practices are adopted, 10+ teams are delivering Data for Analytics demands:
- independently from each other
- faster and with higher quality
- in smaller batches
- deploying changes to complex data models and pipelines on-demand multiple times a day instead of old fixed biweekly cycle.
Some details:
- Our purpose is to increase speed and quality in delivering Data for Analytics use cases by:
- Increase quality of delivery in Data Warehouse environments
- Increase speed of changes
- Improve reliability of complex Data Models and Data Pipelines
- Improve reliability of deployments
Top challenges:
- Merging skills and ways of working of experts coming from Data Warehousing and Software Engineering backgrounds
- Finding Continuous Integration and Delivery patterns acceptable for Data models and pipelines
- Some tools in Data stack cannot be managed with the code
- Code (definition of the Database object) is tightly coupled with the data
- High complexity of atomic objects (e.g. a database view harmonizing KPI calculation from data points from multiple systems can contain several thousands lines of code)
Chapters
Full transcript
The complete talk, organized by section.
Dmitry Luchnik
Dozens of productive deployments per day. Many independent teams doing that. It sounds familiar, right? But what if this is happening in the enterprise data warehouse at a certain company?
Hi, my name is Dmitry Luchnik. At Adidas, I'm a solution architect in the data and analytics department. I'm excited to be here and to share with you how we scaled the delivery model in data analytics at Adidas. Thank you for spending 30 minutes of your time with me on this topic.
Let me start with a short intro of what data and analytics at Adidas means. Adidas is a global company running businesses on all continents, grouped into five markets. And that means our demands for analytics are also distributed across the globe. We have five markets, headquarters, marketing, product creation, supply chain, sales, finance, and so on. It is clear that every function requires data to be efficient.
Let me show what that means in numbers. On a global scale, we are talking about many teams in every function of the company who have data analytics at the core of every decision they make. It's hundreds of people in 15 or even more teams, who create reports, connect datasets, build prediction models, and so on. And it's thousands of those who consume those reports and prediction models, or analyze what is happening right now. From a data volume perspective, we are on a petabyte scale. This data is used 100,000 times every week, and it's needed for day-by-day decision-making.
Where is data coming from? A big footprint, of course, is our global platforms supporting core business processes like SAP FMS, or S/4HANA in the future. These are covering core processes like supply chain, sales, or finance. Other systems support e-com, like Adobe or Google Analytics. Such systems are managed and governed centrally. Plus, on top, we have a fair amount of local sources as well. In general, we collect and process the data from more than 40 various source systems. So what do we have? A global company with distributed analytics demands and usage, multiplied by centralized data sources and platforms. So what is our approach to offer speed in analytics?
Here you see an overview of the data and analytics department. Our role: to drive the digital transformation towards data-driven decision-making. How we do it: we provide reliable data, help with insights, and automate decision-making. We have several offerings to interact with the data, addressed to different types of personas: consumers of standardized reports and KPIs, data explorers, or data scientists. We also take care of collecting data from global sources and making it available for analysis.
Now, we are going to go deeper into our operational analytics offering. Why? Because data is an area which requires scaling and flexibility to cover market needs for insights; to enable and empower those 15 teams across the globe of content creators; and to serve those 3,000 explorers. So what is operational analytics, or how we call it, OA?
OA is a data exploration, reporting, and visualization platform. It can be used for flexible browsing through available data, finding answers to the question of the hour, and getting insights from combining local and global data together. Some of the sample queries or questions of OA are presented here. So we talk about how to rebalance our stock between European countries, or how to balance load in the factories, or under what circumstances it was to go with an air shipment as opposite to the land shipments. To answer these questions, you need to consider multiple data points: demand, supply, availability of the stock. So, big area for supply chain.
OA is structured in what we call a hub-and-spoke model. Hub for central functions like onboarding, providing platforms, tools, shaping ways of working, and of course provisioning the global data from the global sources. Spokes are empowered business intelligence teams located close to analytical consumers. It can be a specialized technological BI team or a business analytics team in the market. They know the needs, they can react faster, and should be unblocked to do their job.
This structure is based on several learnings we have made over the years. Local teams are much better aware of immediate needs, of what is really hot and what requires an action right now. Plus, creation of visualization for that: spokes are also super at this. They also usually are very competent in data preparation for the use cases because, if not for such platforms, they would be doing data preparation manually with Excel, Access, or who knows what else. However, decentralized heavy lifting, connecting to global sources, getting the data out, harmonizing dictionaries, running data quality checks, and so on.
We shaped the architecture around personal areas, giving the spokes full empowerment to manage their spoke. It's on schema level, on data transformation or ETL level, user management, all that. You could say it goes exactly according to Conway's Law, but actually we shape spoke teams around loosely coupled architecture principle. You'll see more loosely coupled concepts here. This is one of the main ideas which helped to scale up our delivery model.
By the way, this is not a first attempt to do something like this in Adidas. Earlier variations were not commonly accepted, as they had limited level of empowerment for the spoke teams. The idea of distribution was right, but implementation had too tight central governance and, therefore, was rather blocking than enabling for the spokes. Current implementation is very different. Foundational architecture and, very importantly, change management processes now are based on DevOps and the continuous delivery principles from the Accelerate book.
So how to deal with empowering spokes while not compromising integrity of the overall solution, and how do we hold such data analytics environment together? It is fair to say that there are some challenges in this question, like how central teams are supposed to bring global changes that are not disruptive for the spokes; how local teams can exchange know-how and help each other; or, for instance, how reporting key figures implemented locally stay aligned with global logic. If we are not addressing these technical questions, they easily become business issues of wrong numbers, misaligned reporting figures in markets and headquarters, or super long lead times. Means no way to operationally act based on up-to-date data.
To address these challenges, we applied an architectural formula. We took the data warehouse experience we had internally, actually decades of experience combined; looked at what was happening on the market; luckily for us, the Accelerate book was published and presented a condensed overview of what should be done, ought to be done; checked architecture patterns from the software engineering domain; and tried to fit it into a shape of the data domain. The result is what we call BLENDA. It gives speed and empowerment to spokes on one hand side. On the second hand side, it's ensuring governance and quality. And multiplied by automation, it helps to accelerate the data journey at Adidas. This is, by the way, again, one of our core principles of the fast analytics ecosystem: empowerment loosely coupled with governance. The name BLENDA itself shows that it's blended know-how of several teams, blended domain experience from data warehouse and software engineering domains, and the main purpose is to blend data. Now let's go deeper into how it works.
This is a normal continuous integration framework. We have a big bucket in the center: continuous integration, continuous delivery orchestration with Jenkins, virtual disposable environments, and so on. But wait, this is nearly normal, because what we have here is a stateful productive application. Means we can't really redeploy. It's a box with many, many terabytes of data. The changes to this box should be incremental.
The agents for doing changes are data warehouse experts. They know inside out the data modeling techniques. They can, in the middle of the night, answer where star schema deviates from snow normal form, but they might not be that fluent with Git, continuous delivery, Jenkins, and so on. That means in Git, we need to store artifacts understandable by data warehouse experts: SQL objects, ETL jobs, inserts, selects, shell scripts, something like this. The incremental changes which I was talking about, they also should be in the native SQL. With this, the teams are working with known concepts and components. We minimize intrusion into the regular ways of working. BLENDA is tailored to the skills of the data warehouse community.
We also need to test incremental statements, incremental deployments, and ideally on every change. This makes the full data model reliable at any moment. And one last thing: working with a sequence of changes is a good idea when those changes are introduced, but it's not very descriptive when you try to find out what's the actual structure of the table. So browsing through tens of add column, remove column, add column, remove column commands is not really rewarding. Therefore, we need to regularly condense all these alter tables into, and to show, the real up-to-date structure of the table. So that's an overview of our continuous delivery cycle.
Some words about the testing framework. We really wanted to make it fast, to ensure teams are getting meaningful feedback on a proposed change within five minutes. What do we test? I'll give you some examples of our testing scope. I said before what we test is incremental deployment. We test if this incremental deployment would work well when applied to a productive database. Means we first create a copy of production in a virtual environment, then apply an alter SQL command from the developer. If everything is fine, test is green, then we dispose the test database.
We're also checking if the data model exposed to the front end is still working as expected, because we have the full data model in the virtual environment. Last, we also can check if reporting key figures are implemented locally, still aligned with the global project.
Maybe looking at this picture, some of you recognize your own past thoughts about what works and does not work well together. I myself definitely used to think like this: that speed in delivery is compromising quality, that empowerment fails with central governance, that quality can be ensured by tighter controls. Luckily, this changes.
Well, now we introduced BLENDA, our continuous delivery framework, to data warehouse. And what we saw: some of reds are turned into greens, and I dare to say green gets even greener. As we follow a continuous integration approach, the spokes know that incremental change is not going to break their data models or reporting front ends. Empowerment plays together with quality. As we follow continuous delivery practices, quality is supported by CI and review culture, so speed is not compromised. This is a big contrast to the approach we used to have some years ago: quality checks done by multiple people coming into the weekly hub meeting to prepare once-a-week production deployment.
And, as we follow a loosely coupled principle, we can support central governance with a clear test report. The local teams are not blocked from making an urgent change. Teams rather are fully empowered to decide if they want to proceed despite a failed central test. The fix can be applied later, right? Or maybe such failed test is a real blocking reason. Decision is fully with the spoke.
Let's look inside BLENDA. It consists of three layers, clearly separated by the Git repo structure. First layer is for data warehouse content: data models, tables, views, ETLs with imports, inserts, stored procedures, and so on. Using BLENDA, it means to contribute via layer one to the data warehouse. All spokes are using at least first layer. While designing this layer one, we considered common skills in the data warehouse teams. And because of that, it's a rather quick start: one-hour onboarding and continuous integration, continuous delivery is enabled.
Second layer is our testing framework. Content here, most of it is created by the BLENDA core team. However, spokes can create their own local tests for their own purposes. This requires a bit more than data warehouse background, a bit of Python knowledge, a bit of knowledge of test data management, going a bit deeper into continuous integration. The core team maintains contribution templates. This helps to onboard spokes with also relatively low effort.
And now the third layer: it's BLENDA itself. Ideas behind architecture, design principles, continuous integration processes. Also, all functionality like deployment protocols. Maintaining this layer requires deeper understanding of both data warehouse and software engineering concepts. This layer is maintained and evolved by the core BLENDA team. Within Adidas, it's open source, so the spokes also can contribute via pull request.
This structure helps to drive speed and scalability in the data analytics area. Speed at onboarding with an understandable first layer which uses only domain objects from the data warehouse domain, SQL and ETL concepts. Speed in applying changes to data warehouse architecture as BLENDA enables a continuous delivery working model. Speed in foundational development: I will talk about this on the next slide. As the BLENDA template is decoupled from each spoke area, foundation and all functionality can grow independently at its own pace. Adoption of the new features by spokes actually emerges on demand.
Some words on this decoupling and how the core is decoupled, or better said, loosely coupled with the spokes' content. Basically, every spoke has its own BLENDA instance. This is a separate repo in Bitbucket, fully owned by the spoke: users, permissions, approvers, and so on. With the first layer of BLENDA, the teams are managing their own data models and ETL processes. Just to repeat, each spoke has its own copy of BLENDA, all three layers, including the core, something similar to a Linux distribution: open-source centrally maintained kernel along with user content in each installation. We do not have some central running BLENDA core. Each spoke is fully independent from this perspective.
What we do have is a BLENDA template repo, and here all central components are being developed: our centralized governance checks, whatever we want to have running in all spoke repositories, deployment protocols, and so on. This means all BLENDAs are loosely coupled with the central template. Even so, the basic principles and processes are aligned across all spokes, all developments can go completely independently. Adoption of new features is also internal spoke-by-spoke decision. Rollouts of the central functionalities, the STs on the boxes like here, for instance, here, is automated but not interfering with the spoke local owned function.
Now let's see: does it help actually to follow the guidance from the Accelerate book? Here I try to map what capabilities we do have with BLENDA, some are done, some not, and let's see actually how they help with managing our data warehouse.
So first, a diff-based review creates full transparency of what exactly is going to be deployed. No changes hidden inside some transport packages, how some data warehouse management tools are doing that. It's fully evident content handling. We can combine a model modification together with a loading or ETL process dealing with this model modification. It's also evident principle. We test all model, so they really have continuous integration of the complete data warehouse spoke. Reliable tests plus multi-eye review and lean approval process all are making productive changes faster.
Interesting effect we saw last year when COVID hit. It impacted our priorities. We had to reshuffle focus on some of the use cases, as you can imagine. Certain activities were put on hold, and the spoke developers had to switch content, sometimes contributing to a completely different spoke and completely different business function. But because we had evident similar changes in the past, and because change processes are harmonized with BLENDA, new team members were onboarded quite fast. We hardly noticed any reduction of the delivery speed, even under such circumstances.
So now I'd like to show you how we quantified the outcomes. I'm a data guy at the end of the day, so let's zoom into some of the KPIs. As with all KPIs, why not again take the Accelerate book as a guidance also here? What you see here is the data from the 2019 DevOps report. And an ambition of Adidas is to get to the elite group of software delivery performance based on the four KPIs from the Accelerate book: deployment frequency, lead time for changes, mean time to recovery, and change fail rate. From the data analytics side, we also contribute to this ambition. What we saw: with the right architecture, tools, and processes, we actually can play in this elite league. I will go deeper into the top two KPIs.
Deployment frequency. We can measure it very precisely. What you see here on the slide is an overview of all deployments from our first one two years ago to the current pace of changes. Every vertical bar is one day, and the height of the bar is the number of deployments during the day. Every color within the bar is a separate spoke. The picture actually speaks for itself. The spokes are in on-demand, several-times-a-day deployment culture. Quite interesting that COVID year also created a demand to speed up in analytics, and this is what you can see: rise from the second half of 2020.
Next KPI is the lead time. We can also measure it rather precisely, at least if we can see the kind of first-commit-to-production lead times. How to read the charts here: this one and this one. One, two, and three are hours, and then we show them separated. Everything below a single day but more than three hours is grouped under a day. More than a day, less than a week, that we group under a week, and so on.
So what we can see here is that for most teams, less-than-an-hour flow from commit to production is a reality. All in all, less-than-an-hour flow makes here about 25% of all changes. An example on top here is a typical small change taking less than half an hour end-to-end from development, testing, CI feedback, review approval, merge, and deployment to production. About 75% of changes, those areas combined, are changes which are done within a single day.
For sure, we still have some heavy lifting topics. They're still there, and they can take a bit more time than a day. By the way, heavy lifting is visible on this 3D chart here. Two teams are tending to do more day-to-week cycles. So we see this one and this one, that blue and this brown one. Those are the central teams who take care of the global data and more foundational topics.
Now, harvest time. After two years on the journey, we have some interesting outcomes. We have 14 BI teams around the globe working in parallel independently, yet loosely coupled with headquarters. They are serving important needs of the 3,000 analysts with about 100,000 report executions per week. We manage about 80 terabytes of uncompressed BI data inside this environment. Developments can be isolated in small batches. Lead times, you saw yourself, they went down in some cases to less than 20 minutes for deployment. Deployments are happening on demand when needed, multiple times a day, instead of old biweekly or weekly deployment cycles. Complete content of a local data warehouse can be tested over a cup of coffee.
Also, fair to say about the learnings, what we had along the way. Learning number one: abstraction layers like Liquibase or SQLAlchemy, they do not really fit our needs. To ensure adoption of the overall approach, we had to go with the native SQL over the Liquibase of the world. Of course, it has its own challenges, but it also has some positive effects on adoption. To achieve a high level of adoption of the deployment protocol, we had to make it super reliable, considering all potential bells and whistles. And actually, we had to rewrite it seven times within the first nine months. And then, of course, fast CI is a must. The idea of running full tests over the cup of coffee, that also made a difference.
Now, before I close, here is the help that I'm looking for. This journey here shows that we can scale up data consumption. So once the data is already available and collected and unlocked, fast insights creation is possible. The next step, obviously, is how to further speed up the process of bringing new data into the analytics landscape. We started a journey of a data mesh described in Martin Fowler's blog. Its purpose: to make data production easy and fast so business domains can participate. This is an active stream, and any ideas how to make it work are super welcome and super relevant. Would be grateful for that.
This being said, let me say big thank you for staying with me and for your attention. I would be happy to answer your questions.