Data - The Land DevOps Forgot

Log in to watch

Las Vegas 2024

Download slides

Data - The Land DevOps Forgot

Michael Nygard

VP, Data Engineering · Nubank

Data - The Land DevOps Forgot

Chapters

Full transcript

The complete talk, organized by section.

Host Intro (Gene Kim)

The next speaker is Mike Nygard. He's one of my favorite authorities on all things architecture-related, especially when it comes to resilience and enabling developer productivity. He is the author of the fantastic book Release It!, and I'll be forever grateful that he introduced me — and taught me — Clojure, my favorite functional programming language, that runs either on the JVM or gets transpiled into JavaScript.

He is now the VP of Data Engineering at Nubank, the largest banking platform outside of Asia, serving 100 million-plus customers, where he was asked to solve an increasingly urgent problem around data and batch jobs — a technology that runs every bank on the planet. He calls it 'data — the land that DevOps forgot.' So here to talk about the problem, how he helps solve it, is Mike.

Michael Nygard

Well, good morning everybody. About two years ago I joined Nubank, and it's not a name that many people in North America know, so I'm going to take just a minute to introduce who we are.

Nubank was started about 10 years ago. It's a digital-native financial services company. We started with the mission to fight against complexity, to return power to the people. And this has been a pretty well received message. I have some of the usual sort of numbers to back that up, but we also fight against complexity in our own organization. So among our engineering principles, we have canonical approaches consistently applied, to try to avoid the heterogeneity that can create drag and friction; leverage through platforms. So we want to centralize some things in order to decentralize other things.

Over the 10 years-plus that Nubank has been around, we started in Brazil, we entered Mexico, we entered Colombia. At this point, about half of all Brazilians and half of Colombians are Nubank customers, and about 78% of Mexicans have a bank account [with us]. We recently passed a milestone of reaching a hundred million customers — 92-some million in Brazil, and the rest in Mexico and Colombia. And that's a lot of customers. Not a lot of companies reach that level.

A lot of customers making a lot of transactions. Many people use this as their primary, or indeed only, method of payment, method of collection — and that produces a whole lot of data.

01Where we started: cloud-native on the transactional side, monolithic batch on the analytic side

Now, the platforms that we developed were mostly on the transactional side. We were cloud-native, we were DevOps from day one. The end point of most companies' digital transformations is where we began. But that's all on the operational side of things — transactions, microservices, long-lived databases.

On the analytics side, we developed a batch, and it's gotten pretty big. It served us very well and grew to pretty large scale. It was centralized, homogenous, somewhat monolithic, and it had characteristics that maybe worked a lot better in the early days than they did [later].

When I joined, in the operational world your work unit is microseconds — one request, one response, done. Over in the analytical world, we're talking about jobs that run for hours: some 14, 18, 20 hours.

The deployment time on the analytic side got very large because when you have a high-consequence system with a long lead time, you have to be very careful about changes that are introduced. So you create more and more and more testing and upfront validation. Eventually, that takes long enough that you get a merge queue. Merge queues are a big warning sign. If you have a merge queue, watch out.

And so we reached a situation where we had these long lead times, a centralized operations group, unpredictable results, variable reliability, low accountability, accumulating technical debt. Probably a lot of this sounds familiar.

There was one challenge we had that was not that common, because we had a single code base. We had great visibility, great leverage — but it also got big. Seven million lines of Scala code, more than any human could keep track of. A hundred pull requests a day being merged. And just building took six hours. Most people didn't have a machine powerful enough to run a build; you had to get a cloud instance to run a build.

02A new paradigm is required

So we reached a point where it was clear that we needed a new paradigm for how we were going to do analytics. Nubank started as a data-driven company. We're still very data-driven. We want to empower every Nubanker to make the most out of the data that we've got — but it was getting increasingly difficult, costly, and unreliable.

So we said, what would the new paradigm be? Well, we want teams to move independently and autonomously at their own pace. We want a self-service platform to enable them and empower them. We want high discoverability, high visibility.

And so we started thinking, well — actually, we've heard this before. This is what the microservices transformation was about in the application or operational layer. And it turns out that when you look for the microservices of data, it's already been given a name. It's referred to as data mesh.

Now, like a lot of terms in our industry, data mesh has a certain amount of cachet that various vendors are trying to get in on. But this time we've actually got a definition, in the book by Zhamak Dehghani called Data Mesh, where she defined what data mesh is based on these four key principles:

- Domain ownership of the data. This is decentralizing things, but also taking a high degree of responsibility for the data. Many organizations treat data as exhaust — it's kind of a waste product that maybe you can recycle and get some value out of. They don't view it as something that intrinsically has value and can have value added. But that's where the next principle comes in. - Data as a product. It's really bringing that product thinking, product approach, to the data, and making sure that each data product is valuable, viable, well-documented, has quality guarantees, and so on. - Self-service platform. Now we're decentralizing some things, but not everything. We still believe that a self-service platform is best provided by a central group with high leverage. - Federated governance. We're also using that platform to help create the mechanisms for federated governance. So responsibility gets decentralized, but the mechanisms are still high-leverage and centralized so that everybody gets the benefit of the governance.

03Good fences make good neighbors

When we talk about decentralizing ownership, part of what we're doing is trying to make sure that we have good boundaries, clear fences. Good fences make good neighbors.

So we need some things to remain private, while some things are public. And what we're choosing to keep private are:

- The definitions of the transformations — what are the functions you're applying to go from source to processed data? - Any intermediates — temporary tables, partial products, things that you might want to change without having other people dependent on them. - Storage.

Storage being private is actually a big change. In the data world, people don't rely on APIs and interfaces. They rely on files. And almost the first question you ask about a data environment is: Parquet, Delta, Iceberg, Hudi? Because everything meets at the file format. We're actually moving to keep storage private so that it is an implementation choice of the data product.

Now, the things that we're making public are the things you would like to couple to:

- Schema, of course - Data classification for personal data protection - Lineage — this is actually important to make public so that people understand and trust where the data comes from. So even though the transformation is private, the lineage and sources are public and visible. - SLOs and performance - Quality rules — also important to make public and visible because we are trying to create a high-trust environment.

04Simple idea, not so easy to implement

This sounds great, but obviously the devil is in the details, and there are some implementation challenges that we have to deal with.

Of course there are technical hurdles. The data landscape is highly fragmented. If you look at the vendor map, it's got thousands of companies on it, each trying to create a fractally defined niche where they can extract some value from your company. I believe we're going to see a lot of aggregation, but it hasn't happened yet. So there are interoperability challenges. They're just big implementation challenges, because the data itself is large, processing time is long, volume is high, and a tiny percentage change in efficiency of your environment can cost tens of thousands or hundreds of thousands of dollars before you know it.

There is organizational change. And of course, anytime we start talking about organizational change, a whole bunch of other dynamics come into play. We have a question, because we're distributing some responsibilities to people who've never had those responsibilities before. Data scientists often work in a kind of one-shot, report-and-deliver type of project methodology. They don't often expect to sustain something and go back and refactor and keep it up to date. So we're changing their responsibilities.

And actually, it turns out sometimes we don't even know who is responsible for the data. Somebody did a project three years ago, they're gone. Reorgs have happened. Squads have changed, or merged, or split. Lines of business have been created or shut down. So even identifying who's responsible is itself a project.

05Data as a product — the quantum of sharing

As we do this, we are trying to create this notion of product thinking around data. So the product is the quantum of sharing, and this is what we want to get people to sign up to and care about. In order to be a good product, it needs to be:

- Self-describing — published in a catalog - Discoverable — discoverability can be a big challenge in many organizations. Not that you don't have the information available, but it's in too many places, and people don't know how to find it. So Slack is sort of the ultimate discovery mechanism. - Understandable — if you have something like 'customer ID,' that tells you almost nothing. Many fields are duplicated with similar names but different definitions in many different places. So creating that understanding of the data product is also important to make the product viable and valuable.

There are other things along the way about the technical implementation. The data product is also where we apply access control policies. It's where we put SLO assertions. But those are sort of implementation or downstream effects.

06The self-service platform — managing cognitive load

In terms of our self-service platform, the idea, like every platform, is to manage cognitive load. The data scientists, machine learning engineers, AI engineers — they all have a huge domain of responsibility. So interacting with the technology is not something that we really want them to worry about until they need to worry about it.

We want to make the simple things easy, of course, and we want to provide flexibility. But these automated guardrails, again, are important because of the high cost of inefficiency or errors.

We recently had an example where somebody changed their query — they removed an `OR` in their query — and that changed it from taking 900 minutes to one minute. And everybody who works in data has a story like this — these three-orders-of-magnitude changes. So providing a way to understand what's going to happen when you run this for real is crucial in this area. We want to give users a lot of power, but help them see what's going to happen.

07Supporting the data owner

In terms of supporting the data owner, this was a key pillar for us — a key initiative early on. We knew that we needed to help them with observability of the platform.

In the centralized-operations days, we had developed a methodology, if you want to call it that, in-house, where our teams knew how to figure out exactly what was going on in the data platform at any time. And they did it with a combination of CLI tools and Databricks notebooks that they'd kept in their private folder forever. There were BigQuery queries that could tell you what had been happening over the days. But if you tried to teach this to one of the users, they would throw their hands up and say, 'You know, forget it. You're going to run this for me, I'm not going to do this.'

So we needed to create observability. We needed to improve automation, particularly the pre-production automation. We wanted to give them optimization options — I generally refer to it as the knob to turn of: do you want to spend more to get it done faster, do you want to optimize the costs, do you want to run frequently or infrequently? And of course, the guardrails.

08Supporting the data consumer

We also found out though, along the way, that we needed to support the data consumers as well — because consumers had developed their own methodology. A lot of it was based on Command-Space and find a name somewhere in the seven-million-line code base, and if it sounded like the right attribute, they would plug it into their query — and maybe it was the right one, maybe it wasn't.

So we needed to provide them mechanisms to discover, trust, understand, and use the data products.

09How to get there — incrementally

It was a tall order. And of course the question is always, how do you get there? How do you make a big change like that in a complex organization? Even a 10-year-old organization has some legacy to grapple with. And of course, as always, the answer is: you get there incrementally.

We had architecture increments along the way — introducing different options, introducing new elements of visibility that we could add on without tearing up the foundations of what was currently running.

PR bots — pull-request bots, not public relations. But we did do a lot of public relations, and that was the change-management aspect.

10Change management — communicate relentlessly

Explaining the why of this change is vital, and you have to explain the why many times and many places. I talk about communicating relentlessly, and I've sometimes said that by the time you're sick of saying something, they're just beginning to hear it. If you say something once and then assume everybody's got it, six weeks later, they'll figure it's dead because they haven't heard about it in six weeks.

So you have to keep saying it over and over again. And you have to keep saying it in multiple places and multiple levels. People ask, do you start a change top-down or bottom-up? The answer is, of course, both. You have to build support at the top. You have to build understanding and support at the bottom. And don't forget the middle layers — the management layers — that also need to be brought along the journey.

11Feedback loops and cost attribution

A big part of what we looked at were the feedback loops. What are the incentive structures that are sort of pinning the behavior in place, and how can we change those incentive structures?

One of the first ones was cost attribution. When one team has the budget for running the entire data environment, and somebody else gets to add as much workload to that as they like without having to pay anything for it — of course you're going to have exploding workload. Once you close the feedback loop so that the costs are attributed back to the business owner that set it up, then they start making decisions about the value of the work relative to their business. And that's exactly the kind of decision we want them to make.

We want to use data. We're not about just driving costs down — we're about driving value up. And so we want the people who understand both the value and the cost, but we want that to be one person as much as possible.

12Tracking, visibility, and messages for each layer

Tracking and visibility was key. We have all kinds of dashboards and spreadsheets and burn charts to show that, business unit by business unit, 'your data domain migration is at 86%. We thought at this time of the year it would be at 80%, you're doing great' — or 'we thought you'd be at 80% and you're at 57%. Let's talk about how we can help accelerate that.'

We created messages for multiple layers in the organization. Executive-level messages at the monthly operations review. Squad-level messages for the implementers. And management dashboards.

13Don't forget your own team — put your own oxygen mask on first

As you're doing any kind of change, don't forget about your own team. I love how Lauren said, 'put your own oxygen mask on first.' It's very hard to drive change into the rest of the organization if your own team isn't fully bought in.

And one of the big things that can cause loss of buy-in is if you're creating a legacy world and the promised land — and some people feel like they're going to be left behind, supporting the legacy and never getting to work on the new stuff or live in the better environment.

So make sure you bring everybody along. And you probably need to do some investments — not just in transforming things, but in reducing the operational burden for the team that's working on the current environment, to create the slack for the change to happen. Change happens in the white space in your organization, so make sure that you're building that as well.

14Summing up

So summing up: we are moving toward data mesh. We think it's a promising paradigm. It has advantages in decentralizing the production and consumption of data, but still allows us to create high leverage and high efficiency and remove complexity through the platform and governance. And we can federate the execution of the governance across the organization.

This is the journey we're on. We're about two years into it. Things are going well. This is where I should have the slide with the 'after' picture and all the great numbers, but we're still midway, we're still evolving. I'll come back in a year or two to let you know how it turns out.

15Seeking help — the 'Swagger for data' problem

In the meantime, in terms of things that I'm looking for — help that I could use from the seniors: when you get to the world of data, there is no such specification as 'the Swagger for data.' When you're in the API world, everybody has pretty much settled on Swagger — a specification that tooling vendors support, framework support. In the data world, there are a couple of competing specifications, but I wouldn't say either of them has great traction.

The data world is dominated by large vendors that don't have a huge incentive for interoperability. So the lock-in is high. I'd love to talk to people who have experience with either avoiding the data-vendor lock-in and focusing on the data product as the vehicle for interop, or anyone who's got experience with one of these data contract specifications.

Thank you, and good morning.