Cost Versus Stability in a Cloud Environment (Nubank)

Log in to watch

Las Vegas 2024

Cost Versus Stability in a Cloud Environment (Nubank)

In a cloud environment, where horizontal and vertical scaling are relatively easy and low frictions, engineers often feel the need to choose between cost and stability. This talk posits that cost versus stability trade off is frequently imaginary. In many situations, stable systems are cost effective systems. Join Cat to explore real world examples of stability yielding cost savings in a cloud environment. These real world examples will also show how to frame engineering decisions in a way that accurately weights cost and stability concerns, amongst others.

Chapters

Full transcript

The complete talk, organized by section.

Cat Swetel

I am Cat Swetel. I work at Nubank. I don't really know what else to say about that. Welcome. I wanted to say thank you right off the bat because this is not known for being a super exciting topic, and I'm glad that you joined me here today. I am finding my people in this room. Yes. So thanks for being here. I hope it pays off for you.

Nubank — The Tech

Like if you were around earlier for my colleague Mike Nygard's talk — not everyone here is familiar with Nubank, so I thought I would give a little bit of a technical overview. Who are we in terms of technology?

We are a Clojure shop primarily. So we have over 2,000 Clojure microservices in our transactional environment. We deploy kind of a lot — more than 200 deploys a day for backend services. We have another fun, quirky thing: we use Datomic as our canonical database. The important thing about that for the purposes of this talk is that it is an append-only database. Yes, pretty interesting in terms of cloud costs when you accumulate data forever. If you want to know more about it, we did just have a Jepsen test recently, and I think it's a nice overview.

We are AWS native. We were founded in 2013, right after the São Paulo AWS region opened. I don't even know if I would be standing here today talking about this if it weren't for that fun fact. And we have a roughly cell-based architecture. We have lots of groupings of services based on when the customer was acquired.

You may know us from such hits as last year's AWS re:Invent keynote, where we got a nice shoutout for practicing this idea of the Frugal Architect.

Business context

Here are some other things about our business — we covered the technology side. Here's the business side. The TL;DR is that we grew very, very quickly. We were founded in 2013 with zero customers, and now it is the year of our Lord 2024, and we have 105 million customers. I took this from the earnings report from Q2 so that no one in comms could yell at me about saying the wrong number. So yes — many, many customers had to scale up very, very quickly. That is an important fact that will come back later in the talk.

Everyone always loves — they look at this, they say, "Wow, you acquired all those customers and you managed to keep costs so low. Cost to serve remains under a dollar per customer." No, you are wrong. We didn't do that. It's not, "Oh, you got all those customers and you still managed to keep costs low." No. We got all of those customers because we keep costs low.

The whole idea of the business is that we are able to extend financial services to customers who may or may not already have had access to those things. Not all of those customers are the flashiest, high-dollar customers. The operating cost of the actual platform has to remain low, or we cannot democratize financial services. We can't democratize much of anything if the cost per customer is going up and up. So this is a key enabler of our business model. It's not a "Wow, you also managed to do that." Does that make sense to everyone? Lots of head nods. Okay. Alright.

The Pix shock

So that low-cost operating model certainly got put to the test in 2020. The Central Bank of Brazil — our headquarters is in Brazil — they approached us and said, "We have a super cool idea. We would like to do truly instant transfers between banks." Not like — this is a very interesting concept for Americans. How many of you are American? Oh, a lot of people. Okay. So it's not like Venmo, where you put money in a separate account and then you transfer it back to your bank, because people always say that's instant. "I instantly get the money." No, you don't — you have to transfer it to your bank.

So this is between banks. I do this transfer, and it is instantly available in your real bank account. You could pay a bill, you could pay anyone anywhere. So it's actually instant and liquid. You have the funds in your normal bank account available immediately. So not like Venmo. And also, people love to say, "Well, what about Zelle?" It is also not like Zelle, because it's the Central Bank of Brazil. So all of the transactions are mediated by the central bank, not by some conglomerate of other banks. The central bank is mediating all of the transactions.

I don't see any hand up to say "But what about…" — so I think we're good. They were like, "Yeah, we have this rad idea." And everyone's like, "Pumped, that sounds so cool, can't wait." And then they were like, "And we're going to do it in six months." And everyone was like, "Oh — we have to build new payment rails in six months." And the answer to that is yes, if you want to participate, you have to be done in six months. Then a couple months in, they say, "Hey, guess what? Also, we're going to have this really aggressive SLA on here. It has to settle instantly. And we are not messing around with that. You must settle instantly." So then we get these latency requirements late in the game. But it's all good. It all happens.

And then — beautiful success story, right? Within a year, the number of Pix transactions (Pix is the protocol) has eclipsed the combined total of debit and credit card transactions. So now plastic is going by the wayside. Everyone has their smartphones, they're doing these instant transfers. In a year, completely eclipsed transactions with plastic.

What does that mean for our lovely app? We are a digital bank, my friends. So this is all going through our app. No one is marching into a branch and going, "Oh gee whiz, I want to transfer something." No, that is not happening. It's all through our app. So there is an unprecedented level of load on our app. We also ship one app. All the products we have — whatever, dozens of products, I can't even keep track — we ship one app. They're all available through this one app. So when Pix drives so much load on the app, all the other products are impacted by that. Not a wonderful story when it comes to blast radius.

The temptation: throw money at it

So what is the solution to this? You should spend money, just so much money. You should throw money at that problem. Scale up all the things. That's a viable strategy. Don't lie to me, you all have done it.

Well, then there was a little bit of added pressure, because in 2021 we decided to go public. And remember how I said we have an append-only database? So people were naturally very curious: how are you going to keep costs under control when you are throwing money at this Pix problem and accumulating data forever? So that was not a fun time for me personally.

So we have this tradeoff now. Because people are saying to me, "We're about to go public. We've got to be super stable, don't make the news, all of this stuff. And also keep costs down. We have to impress these analysts." Do both of these simultaneously. And that seemed impossible at the time. I also didn't really know which one to choose, because it seemed like a lose-lose situation to me.

The intuitive moves

So we started trying the normal things that smart people try. We further decomposed capabilities so that we can really scale each one of them independently from the others. Common patterns for, "Okay, how do you save some money here? Okay, I'm going to decompose. I got this. Decouple. And I'm just going to scale this one little baby." That did not go far enough, as you might imagine. So there were all of these intuitive changes that we made, and then there were some counterintuitive changes that we made. And I think that's the most exciting part.

Counterintuitive change 1: add NVMe SSDs to Datomic

So the first one — get ready, nerd alert here. We have Datomic, our canonical database. This is the database that most of those thousands of services use. If any of you use Datomic, you're like, "Oh, well there's all these layers of cache" — and yes, but this is a 30-minute talk, I'm not going to go into that. Let's just say there's an in-memory cache, and there's an external cache, ElastiCache in our case.

As you scale up and you have more and more transactions, more and more going out to the external cache makes good sense. So what do we do? That's a problem for us. We try to beef up those machines — more in-memory cache capability there. Yeah. Spoiler alert: we threw away a lot of money on that. It was very expensive and it didn't necessarily have the performance gains that we were hoping for.

So we decided to add disks, which are expensive. They are, that is true. So we add disks, and now we have more cache capacity there, and better querying, lower latency, higher throughput, all of those things.

If you look at the cost for a specific service — so we have thousands of these microservices, and you might look at the cost of the specific service and say, "You added disks, that's expensive. Bad job, Cat." But if you look at the cost of the flow overall, the cost has gone down. The total carrying cost for the flow overall, because you're getting higher throughput and lower latency from that service, has gone down overall.

And in one case, we have a service that tracks the transfer out for those Pix transactions. For every one extra dollar we spent on disks, we avoided spending $3,500 in scaling costs across that flow. So again, if you zoom way in on the one service, it looks like, "This is stupid, you made this much more expensive." If you zoom out to the flow, this is great. You saved a lot of money. When you get to the scale of 105 million customers and all those many, many Pix transactions a day, this turns into a huge number. So that was the first counterintuitive change that we made. And again, the idea of zooming out was the most important thing.

Counterintuitive change 2: switch to ZGC

Then the next counterintuitive — who is a garbage collection enthusiast in this room? Two people. Hello, welcome. Yes, as Gene mentioned earlier in his opening remarks, Clojure runs in the JVM, and we're out there thinking about garbage collection, and it's a super-fun time.

So we dug in on this. What we were noticing is, we used the G1 garbage collector. Fine, doing great, but we would have these long stop-the-world GC pauses. Well, what does that cause the rest of that flow to do? So this service has a long stop-the-world GC pause. All the other services are like, "What are you doing over there?" Right? And it kind of causes panic across the flow.

So we decide, let's tinker with this. Let's try ZGC, or the Z Garbage Collector. (When I said this to AWS, they're like, "You have to say the whole thing — no one knows what ZGC is." Yeah, anyway.) So it's more expensive theoretically, because it's CPU-intensive, but you get shorter GC pauses. You don't have that long stop-the-world pause.

If you're looking at the one service, you might say, "Oh, I don't know if this is a great tradeoff." But if you're looking across the flow, all those other services aren't sitting there waiting for the end of that long GC pause. So across the flow, you end up being much cheaper. So for us, this was a great tradeoff — but only if you zoomed out enough to see it.

Does that make sense to anyone in this room? Even the people who are not GC enthusiasts, raise their hands. Alright, I'm doing a good job. Yes. So we're starting to see a trend here: it's cheaper because it's more stable. We smoothed that out a little bit in that one service, and the whole flow got cheaper. So more stable, cheaper. We don't have that thrashing in the demand. It is cheaper overall.

Counterintuitive change 3: culture — cloud cost champions

Okay. Next — of course, here, I feel like you always have to have one culture element, but for us, it ended up being true. So the last counterintuitive thing is that we decided we needed to change the culture around cloud costs at Nubank.

Every established business unit — I admit, I don't know how many there are anymore, but there are dozens — they all have an AWS cost champion. (Yes, I tried to make that sound exciting and no one cheered. Very sad for me.) Each of the business units is classified according to their total AWS spend and how established their product is. In other words: can you predict at all? Do you have any capability to predict how much you're going to spend?

We have this idea that there should be AWS cost champions who are responsible just for thinking about AWS costs from the point of view of the business unit. So they should be able to think about the flow — the customer flow overall — as opposed to just looking service by service by service. They should be informed by the business context and be able to see that. Does that make sense? Terrific.

We started off with this cloud-efficiency group, which more looked at, from the platform up, what are the opportunities. So I'm down here in my sad little infrastructure corner in the transactional environment, and we would look up from the infrastructure into the business units to identify opportunities. And then we kind of transitioned to this cloud financial management, which is looking from the business down.

Charge-back patterns and CloudZero

Now not just AWS cost, but lots of our pay-for-use software and tooling is charged back to the business units. I will say, something should not be charged back. Like, we will never ever charge back CI/CD, because we don't want people saying like, "Oh yeah, fewer deployments — yeah, let's do that." That's an anti-pattern. We track the cost of our CI/CD platform, and we can see when it's doing weird stuff or whatever, but we will never charge back that cost. And there are some other things that kind of fall into that pattern. But many — most things — are charged back to the business units.

We use this software that's called CloudZero. They are not paying me or anything to be here. It's just I feel this is an emerging space and they are way ahead of others that I have experienced. And the cool thing is that they allow us to charge back other software in CloudZero. So the business units can say, "What is my footprint overall for cloud costs and software costs together?"

Also — why are they able to do this in a way that is different? I think it's because their CTO has this idea that every engineering decision is a purchasing decision. So engineers need to be constantly informed and have that feedback loop on cloud costs, so that they can start to develop an intuition about cloud costs. You don't just pop out of the womb having some idea of what is going to impact your cloud costs. That's an intuition that we have to grow in people. Even super-smart people.

Why-not-both, and the era of Stable Efficiency

So when I was faced with this terrible decision — what should I choose, cost or stability? I chose both. Because it turns out that it's not a real tradeoff. People think you can't be stable, because in order to be stable, in order to scale in a way that is stable, you need to just throw more machines and beefier machines at the problem forever. And I don't think that that's true.

So now at Nubank we talk about being in the era of Stable Efficiency, which is to say that when systems are stable, costs are more predictable and things are cheaper overall in many, many, many cases.

What's next: continuous resilience and the one-app dilemma

So that's kind of what we got going on. Now, what is next for us? We have these engineering principles. They're kind of in the Amazon tradition of having some tension between them. There's this idea of radical ownership, but also we want you to use canonical approaches. So some interesting things there. And one of them is about technical resilience.

I think next year when I'm here, this potentially is one of the things that I would be talking about — how to achieve resilience and still keep our low-cost operating platform, and how to make it continuous. So even a couple years ago, all of our disaster-recovery exercises were a thing that everyone's a little bit uptight about. Now we've introduced a lot of automation, which helps keep the cost down and also means that we have many, many, many of them happening all the time. And a lot of times I don't even know now — I just get a Slack notification, "Hey, this is happening." So I think a lot of where we will go next is not just about stability, but about resilience.

And then the other thing is: how do we live into these principles when we are many, many, many products shipping into one app? What does that mean? Because different principles are more or less meaningful for different business units, but we are shipping to one app, so we have to be good stewards of that app. It's an interesting dilemma that I don't think a lot of institutions, especially in the US — because there's different regulations about what can be co-located in a financial services app — it is quite different other places in the world. And so I think this is a unique challenge for us.

So, you know — tell all your friends how awesome this was and now you're really enthused about garbage collectors. And maybe I'll be back next year.