Industrialised Data - The Key to AI Success

Log in to watch

Europe Virtual 2024

Industrialised Data - The Key to AI Success

The DORA research concluded that there are orders of magnitude difference in delivery KPIs between leaders and the incumbents. In this presentation, we will describe the corresponding "data divide" in capabilities in data engineering, and how the leading companies have adopted an industrial approach to data management, enabling them to leap so far ahead. We will explain why "data industrialisation" is a key factor for succeeding going from AI prototypes to sustainable value from AI in production. We will also describe a path for companies outside the technology elite to cross the data divide into the industrialised data realm and share some very honest learnings from helping companies go that path.

Chapters

Full transcript

The complete talk, organized by section.

Host Intro (Gene Kim)

So I saw the next speaker, Lars Albertsson, when he gave a fantastic presentation at the Sweden Functional Programming meetup, which I thought was absolutely amazing.

Among other things, he spent time at Spotify and Google Engineering, where he got to see how large-scale data was being generated and, more importantly, how it was used to create amazing business outcomes.

What struck me about this presentation was how he points out some easily observable ways that data teams and organizations work, and how you can use it to make some very accurate predictions about an organization's ability to innovate and learn. Trust me, once you see it, you will not be able to unsee it.

Up next is Lars Albertsson, founder of a company called Scling that does data engineering, to teach us something that is relevant to every organization.

Lars Albertsson

I'm Lars. I run a small data engineering company named Scling. I will tell you at the very end what we do, because the rest of the presentation is context in a way. Our marketing strategy is very simple: we simply share the knowledge that we have found for everyone to use, in the hope that it will be useful for a wider audience.

As you heard, I worked at Google and Spotify before, which are amazing companies and are really good at some things. I found that there is a myth in the IT industry that all companies are quite okay at most things, and if you look at the leaders, they are just a little ahead of us, just twice or at most five times as effective. So if we just join the latest hype, if we just have some AI, we'll be just as them.

This is not at all what I have found. Rather, I have found that the distribution is more like this: most companies are so-so, in the widest span, but a few companies in some areas are way, way ahead of the vast majority of companies. I didn't realize until I had been at both Google and Spotify, which are at the very elite in different areas of execution. To be honest, what one is good at, the other one is kind of mediocre at, so it was interesting to see this difference.

You can trust me on this, or you can look at some very wise people like Nicole Forsgren, Jez Humble, and our host Gene Kim, who made some research a number of years ago. They took a number of metrics and measured capabilities to deliver: the delay from idea to production, change failure rate, and so forth, and found the same kind of distributions, with a small elite being way ahead of most companies.

In my field, data engineering, if I look at data and the capabilities of companies and look at numbers, I see a similar pattern. There is a small elite that is several orders of magnitude ahead of the incumbents and the followers. One example is a highly unscientific poll I made on Twitter asking, how long does it take you to change a pipeline from end to end and do a trivial change? For the majority of companies, this time is measured in months. Then there are a few companies who do this in hours.

One other relevant metric that we sometimes measure is how many datasets are produced. While producing a dataset is not per se business value, it is some kind of proxy for business value because you have this machinery that produces datasets, and if they were not useful, you would be shutting them off and taking the pipelines offline.

By dataset, I mean some kind of data artifact of intermediate or direct value: a report, a graph that you show on the dashboard, a recommendation index, and so forth; or an intermediate dataset such as the master view of all users that you have, or the master view of all orders.

I've been asking a lot of companies. If you ask around retail, banking, telecom, and traditional industries, they usually respond that they have this BI analytics division with 50 or 80 people, and they produce perhaps 100 datasets, perhaps 200, at most 1,000 or something. If you look at the most data-mature companies, you'll find several orders of magnitude more. They are not always public with these numbers, but you can deduce the numbers. In the transition when Spotify went from Hadoop to the cloud, you could deduce that they were producing on the order of hundreds of thousands of datasets per day. This is now six years ago, so they are probably in the millions now if you extrapolate. Google was already in the billions about a decade back.

This is a proxy for the cost to produce datasets. If you produce them by the billions, they cannot be very expensive to produce. If it is expensive for you to produce datasets, to have pipelines that run and build value artifacts, then you can only use it for the most important things, the most valuable things, which are typically financial reporting, maybe some product insights, and so forth. Higher up in the tree, you have machine learning and AI features that are more expensive to create, so you really have to have pushed your cost down in order to get return on investment here, at least sustainable return on investment.

Why does this matter? Here is the best example I have of how capabilities to quickly try data and lower the cost enable product innovation. Spotify has a feature called Discover Weekly, where every week you get a new playlist that is tailor-made for you. I saw a presentation about the people who made it. I do not know them; I have never met them. One of the key takeaways was that they said that in a normal company this would have been a strategic effort planned on a year basis with a hundred engineers and a major effort. They did the first functional implementation for internal users with three people in three weeks. They were able to do this because the company had enabled bottom-up innovation.

I became very proud because 18 months before this, I was in the team that decided we needed to enable the company to work with data. We did a conscious effort of democratizing the use of data, pushing the cost down to produce new datasets and pushing the cost down to change pipelines, and so on. This was one of the fruits that came out of it.

Interestingly, the CEO said, "I never saw the beauty. I would've killed it if it was up to me, but the company doesn't work like that," so they shipped it anyway. One year after shipping, they had 40 million monthly active users on this feature.

Not a lot of companies can innovate like this. How many good ideas go wasted out there because somebody has an idea which would have made it, but the leadership did not like it?

At the other end of the scale, we have most companies. I picked on Volvo here because I drove a Volvo for seven years. I sold it just a few days ago. This was the XC90 plug-in hybrid, their flagship model, and it had an annoying habit to sometimes pull down the windows if you happened to press the open key for too long. One evening it did that, and then there was a snowstorm during the night. In the morning there was snow in the car.

The car knew that the windows were down. I could see that in the app. The car knew that it was snowing and that it was freezing, and it could have asked the weather service if it was unsure. The mobile app was made about a lot of things, but not that there was snow coming into the car. This would have been really easy if data was just flowing, if the people in the mobile app would have access to that data.

This is not a one-time thing. The car is full of these disappointments: if you just picked that data and did some counting and combined it with that data, you could have made a much better product. The navigation once led me out to a road that was actually dangerous to drive, and I could drive at like five kilometers per hour. If they had measured that nobody chooses to drive on that road, or that the people who do drive do five kilometers per hour, they might have guided me to another route. The same thing with snow happened with rain once when the rooftop was open. It knew that the rooftop was open. It had a rain sensor, but it didn't tell me. When I went to repair the car, the radar in the car had broken. The mechanic couldn't get information from Volvo about how the radar was doing, so they had to trial-and-error calibrate it for several weeks.

I pick on Volvo here, but this is the same for almost all Swedish companies. It could have been any company. I wrote a similar blog post about IKEA once. Most companies, if you are able to count and act on the things that you count and put that in their product, you are actually way ahead in most companies in terms of data engineering.

What is the difference between the incumbents and the ones that really do this well? We could summarize it as industrialization. We have been through this process for other types of software engineering. We have turned the craft into an industrial process. We no longer go around and install computers in racks in data centers. Instead, we work with a process that defines a cloud architecture, and when the cloud isn't what we want it to be, we change the process. We change the blueprint and respin things. In the same way, we don't compile things by hand anymore and patch configuration files on a running server. We build a new container from a container recipe. We have turned things into a process where the humans don't work with the things; they work with a process that builds the things.

We are halfway there with data. This was essentially what the big data transformation was about. We moved from database-oriented data warehouse methods to data pipelines and immutable pipelines, where we store the raw data and create new recipes to build data artifacts out of that raw data. We started working with the process instead of the data itself.

This is how we build artifacts today. The main way to build artifacts is to start with raw data, massage it in Excel or in the database, and then you build something and think that you have something of value. But for the ones that do that well, if they really care about this value, there is more: you actually build on top of this process and add supplementary processes.

I used to work in financial engineering at Spotify for a while, where we had to calculate the amount of money that was sent to different stakeholders. The raw data wasn't enough. We had to combine that with user data from many different sources. Some of it was wrong, some of it should be ignored and filtered out. Sometimes the mobile apps would actually record the wrong song being played, so we had to look at complementary information to go in and patch that and improve the process. For the processes that we really cared about, we added quality measurements and patched up these processes. That is why you see the hundreds of thousands of datasets being produced rather than a few ones. If you do not have an industrial way of working, this is beyond your capability.

The same goes for machine learning. The naive idea is that you build a model, you throw the model out, and then you're done. That is what you read in blog posts. The reality is similar to the financial engineering: you actually need to measure the incoming data so you don't have drift, measure the outcomes, have an ensemble of models, compare them, and so on. This is a quote from Erik Bernhardsson, who created the first recommendation services at Spotify, who realized that all of the work goes down into the plumbing. What I have clipped in there is now an infamous Google picture that says the same. The little black box in the middle is the data science. Everything around is all of the engineering and the plumbing and making things work.

This will become even more important now that we have the generative AI hype. It looks so simple and it is so simple to play with. You can download an LLM from somewhere and it can do amazing things for you. But in order to actually be useful in real use cases, you might need more things. You need it to be relevant, to have correct facts, to not hallucinate and not lie to you, and perhaps not recommend your competitor's products, as we saw in the Chevrolet chatbot, which recommended people buy Tesla instead.

If you want to add, for example, relevance, there is a hack that we have invented called retrieval-augmented generation, where we help the LLM a bit by using a standard search engine, going out and searching into our enterprise document corpus of document data if we want to speak about our own documents. We concatenate all of these things to the prompt, and then the LLM will spit out something that is generically useful but also influenced by the documents that we really care about.

How to do this looks easy at a glance, but there are many degrees of freedom and it is not so obvious how to do this. It is not so obvious how to chunk up these documents and feed them to the LLM, because you cannot feed everything to the LLM. It turns out that you need to do it wisely. If you have a document that says, "By the way, this product is dangerous if you do any of these things," and then a list of 10 things, and you just cut out thing number 10 without the context, then you are fooling the LLM and it will recommend dangerous things.

Here is an example of an experiment that we ran for household appliances, where we threw the manuals to the machine so that perhaps we could talk to our washing machine. Rather than running the same three programs every time, which we always do, if we augment it with LLMs, we could perhaps talk to it and say, "I have a stained shirt here. It has blueberry and chocolate stains. What should I do with it?" You could get advice back, both from general knowledge from the internet and also specific knowledge from the manual of the machine.

In this case, the manual had tables, and the RAG was doing the chunking wrong. It was mixing the different rows in the tables so that it would give the wrong instructions. The conclusion here is that we need domain-specific data engineering in order for these things to actually be valuable. In the future, we might hope that LLMs will understand tables and everything, but right now you can gain an advantage if you do some domain-specific engineering on top of it.

These are examples of pictures found on the internet on how to do the other things that I mentioned: how to prevent LLMs from hallucinating, how to provide guardrails, and so on. You see all of these boxes. All of these boxes amount to things that have degrees of freedom, and you don't know which ones will be useful. So you need a whole bunch of data engineering to figure out which ones are good, compare them, maybe use multiple ones, and then select the right one.

I have been through a bunch of hypes now. In the big data hype, we saw a lot of companies aiming to do what the cool kids were doing, installing Hadoop and so forth. Most of the enterprise data lake efforts or big data efforts failed, and the companies then went back to old-style data warehouse methodologies. We saw a whole bunch of new iterations of the data warehouse, like the modern data stack or the modern data warehouses, and also a whole bunch of low-code and no-code tools, which are easier to use than the clunky industrial Hadoop ecosystem tools.

This will fade. This will all go away. The reason I say that with such certainty is that we have seen this before. We saw this in the nineties, when there were not enough software engineers. We invented 4GL and UML-based coding tools so that everybody could be a software engineer, so it would be so easy: you could just move boxes around and connect them. One of these boxes in the picture is an Informix 4GL tool from the nineties, and the other one is Apache NiFi, and they look the same.

These 4GL tools are all extinct because we realized, yes, they were good for building proof of concepts, but they were not actually good for building real, high-quality products. Instead, we educated software engineers. In the same way, we need to educate data engineers to get out of this 4GL phase of data engineering.

Most companies that I meet want to do big projects. They say, "We should be AI first now," and do a great effort. I don't think there is such a thing as a successful big data or AI project, because the companies that do succeed do not work in projects. They work with products. They iteratively work on one problem at a time, and they are completely focused on improving the actual products and the problems at hand. Then the technology comes as a side effect, driven by the product needs.

In order to be successful with these, I find that organizational process changes are much more important than technology changes. Align teams along the value chains, for example, which is what the book Team Topologies also suggests. Use automation and data for the current challenges that you have at hand, and focus on the feedback cycles. Remember the six-month cycle for changing a pipeline. Those feedback cycles are typically really fast in companies that do succeed.

These realizations have formed my career. I have observed that the things that we do to spread these fantastic capabilities to the wider range of companies are not really enough, and there are not enough educated people so that everybody can take part in these fantastic data and AI capabilities. I believe that we must find new ways to collaborate. I formed a company that tries one of these possible ways of collaboration. We do data factory as a service, where we develop pipelines and host them and operate them as an external data team.

This is not so common in IT, but we see this in other areas. Foxconn is a good example: they make the Google and Apple phones, and they do innovation at a very high level outside the company, in collaboration with Foxconn, who are experts at doing what they do.

What we have learned here is that, yes, it is possible to do this. We have had a couple of successes where things go really well, and we match Spotify's numbers in terms of iteration speed and datasets per person and capabilities, but without the hundreds of person-years that they put into investing in the platform, just very lightweight investments.

But it is also challenging because we can come with capabilities to work with data, but going from there to capabilities to improve the customer's product and business, that is quite a gap and it is domain-specific, so we cannot cover all of it.

Another challenge is that many customers say, "How hard can it be?" That is an even harder challenge, because once they are in that mode, they will go about and do their own things. Usually they don't have the expertise, and they cannot hire the expertise on the market, so they fall back into traditional data.

Gene asked me to say at the end, what help do you need, what do you ask for? One thing obvious from these learnings is that in order for us to be able to help people, we need competent clients that are competent within their domain, but also humbled with respect to data.

I believe there are also other types of business models and collaboration models that one could use. We are explicitly looking for partner companies that are specialized in some vertical and helping clients with a particular vertical, for example analytics for energy or digitalization for lawyers, whatever. I hope that we can find such companies and collaborate with them to do data factory engineering, but for a specific vertical. So if you are in one of those companies, please reach out to me if you are interested.

I am also interested in hearing about other ways to slice these challenges, because I don't think the ones that we have work particularly well.

Host Outro (Gene Kim)

Fantastic. Thank you so much, Lars. If you're in the Slack channel, put your contact information. If not, I'll just make sure that either Ann or I will get your information in there.

Lars, thank you for teaching us and hopefully showing something that once you see, you cannot unsee.