POST No AWS Bills: Cloud Cost Optimization Without APIs

Log in to watch

Las Vegas 2019

Download slides

POST No AWS Bills: Cloud Cost Optimization Without APIs

Corey Quinn

Cloud Economist · The Duckbill Group

POST No AWS Bills: Cloud Cost Optimization Without APIs

Chapters

Full transcript

The complete talk, organized by section.

Corey Quinn

We begin with a simple, non-controversial statement that no one will take personally. Specifically, that all of the tools that help you optimize your cloud bills kind of suck.

Let me back up for a minute. Who here has no idea who I am until today? Oh, wow. You really should have gone to a talk that was actually good. But we're going to see why this happens. I'm a cloud economist, which is like a regular economist, except I dress far better. I also rant incoherently about clouds, and my first language is, as you're probably picking up by now from the accent, sarcasm. And I wind up doing an awful lot, mostly crapping all over the conference hashtag on Twitter. I'm hilarious in my own mind, if nowhere else. It's fine.

But what I do for a day job is I go into large companies, definitionally, and I fix the horrifying AWS bill, a small problem experienced by pretty much everyone. And this has led to some interesting insights as I've gone down this path, and I wanted to share some of them with you today.

So, we go back to my original non-controversial statement, and I could name companies here, but I won't, because first, some would be very annoyed that I named them, and then others would be annoyed that I didn't name them. And that's how this whole thing works. So I'm just going to have sharp elbows regardless. It's part of my milieu. It's fine. The problem is that all these tools are more or less equivalent, despite people's urge to tell you otherwise.

For example, they'll always tell you to buy some reserved instances, prepay for some capacity on a thing that you're about to turn off. Just set more money on fire. Why not?

Or they'll tell you to turn off those idle instances labeled the DR site. Fun story, by the way, slight detour for those of you with engineering mindsets. You want to have a hot DR site up and running, because if there's an availability zone or a regional outage and you're going to go fail over to another region, I hate to burst your bubble, you're not the only person with that plan. And the control plane often becomes saturated, so you're waiting an hour or two for things to spin up. If it needs to be there quickly, it needs to exist before the disaster happens. Don't turn it off to save money. Picture it like an insurance policy. Hopefully, you'll never need it. Here in reality, we play with matches an awful lot.

And of course, they're not going to tell you other things as you go down this path of, yeah, there are things we could tell you programmatically, but for some God-forsaken reason, we're not going to, because, well, it's not a big problem in our environment. The quality is uneven across all of these things, and it's not their fault.

This is a DevOps summit, and one of the things we do in DevOpses and summits and DevOps summiteses is we do blameless postmortems, because it's not about blame. It's about fixing systemic problems. So we conducted a blameless postmortem and found it was all VCs' fault. I didn't say I was good at it. I just said I knew how it worked.

And delving into this a little bit, it turns out that a lot of this comes down to venture capitalists having incentives that may or may not align with companies that they're working with or, importantly, their customers. For example, if you're a VC and you're investing in something, everything has to be massively scalable. If you think that perhaps consulting can be massively scalable, I urge you to talk to someone who started a consulting firm. Yeah. It's not. And fundamentally, since everything they want to fund has a low touch element to it, they're going for a plan of everything should be able to be self-service done with software as a service, which is totally going to solve things. And because they're VCs, of course, bonus points if they could work Uber for something in it. In this case, cloud billing, because Uber was the last really disruptive, innovative thing, and everything's waiting for the next thing. Before this, believe it or not, it was the next Yahoo. Sorry if anyone works at what used to be Yahoo. I know. It's a sad time for all of us.

I'm going to digress briefly into the problem with the entire approach for VC funding in this space. At the beginning, you may have been told that there would be no math. That was a lie. Let's do some math.

Let's assume for the sake of argument, based on last week's earnings call, that AWS makes about $35 billion a year. That may not be exactly correct, but it's at least directionally correct. Again, we're not checking your math work on this. Assume that a company goes out and charges a percentage of bill, as they all seem to want to do. Okay, great. Doing a little math there, we figure out that the total addressable market is a bit over a billion dollars. That assumes that one company could capture all of that. They won't. They have competitors. It further assumes that everyone is going to go out and jump on that. They're not. A lot of companies want or need to build their own. And it winds up, in turn, as a direct result, looking an awful lot like a pretty shitty unicorn. Incidentally, "The Shitty Unicorn Project" would be a great sequel or parody to Gene Kim's new book. I bet we can strike a deal if we position it right to him. He's not in here, is he? Yeah.

And there are some counterarguments to this, like cloud revenues are growing. True, they are. That's the fun thing about bills. You ever notice that? They don't tend to get smaller.

And there is more to cloud than AWS. I should call out that I bias towards talking in an AWS context because when I started this company a few years ago, it's where the expensive problems were, and by and large, that continues to be true. I'm not saying it's a one-horse race, but I found that specializing worked out well. This is not a comment on Azure, GCP, or IBM. It is on Oracle Cloud, but I digress. We'll get there.

So let's think some of these things through here.

First, the positioning that they're taking of that oh, we're going to have a single pane of glass so you can look at all of your different cloud estates. Yeah, multi-cloud done that way is terrible, and no one is really asking for it. I'm not saying you don't have different workloads in different clouds, and I'm not saying that having different lines of business with different providers is a bad thing. I am saying that absence of compelling argument to the contrary, having a workload you can magically deploy anywhere you want it to be onto any cloud provider forces you down to lowest common denominator APIs, and you spend an awful lot of time solving global problems locally. No one wants to do it. That's a separate talk where I spit a lot more when I talk.

And of course, it also presupposes, as this continues to grow, that the cloud providers themselves are not in turn going to make the tools to slice and dice the bills more accessible and better featured. Because, if there's one thing we know about cloud products, it's that they get worse with time.

And wait a minute, why am I asking VCs for business advice in the first place? I seem to recall you having a bit of a swing and a miss recently. It feels on some level, not all, but some, that being a venture capitalist is mostly about having won a lottery once, and now you're going to teach other people to do the same thing. Forgive me if that's not the most compelling sales pitch I can imagine.

And all of this is, of course, rather beside the point because none of the tools in question will actually fix the problems that you have.

What problems do you have? Let me tell you.

To be clear, I'm going to tell you what your problems are strictly in the context of cloud billing, because it's very hard to start a conversation with, "You know what your problem is?" and remain at a conference. Ask me how I know. You can probably guess, but eh. So invariably, you need to explain what's going on in a cloud environment in a financial sense to someone who is not directly involved with it. It turns out that they generally don't spend a lot of time in finance, for example, logging in to various cloud consoles and poking around in various services.

It also points out that the bill itself is vast and deep, and of course, it is not structured in a way that is going to answer business questions. Because it answers things like, how much did you spend on storage? And how much did you spend on data transfer? But not things like, how much did we spend in development versus production? Or how much did that sub-service cost once it broke? These are the things that companies want to know, and nothing out there from a tooling perspective aligns with answering them.

It also runs into other problems, specifically that not everyone who touches a cloud environment in an engineering sense is going to be a responsible steward. And I mean that in both directions. Sure, there's someone who spins up the biggest instances all the time because bigger obviously means best and never turns anything off. But we also have people who love to spend weeks golfing a couple of hundred bucks off of their developer spend as if they have no idea what they actually cost. At some point, that adds no business value. There's a bigger problem to focus on. Context matters.

So what does finance tend to care about in the context of a cloud bill? Mostly, it's about allocation and prediction, but after you play the corporate game of telephone, that's not what anyone hears. I've traced this back countless times. It's true.

For example, do you know the AWS bill does in fact have tax consequences? In the United States at least, research and development tax credits are available to companies. That includes, as engineers think about it, pre-production environments. If you're in a position to be able to divide out what is pre-production and what is not, that has meaningful impact. Of course, if you just guess and give wrong numbers that you can't back up, it turns out some auditors would like a word, and it's not the fun, happy conversation. Earlier this morning when they said, "The auditors are here," and everyone clapped, that's never happened before. No one has ever been thrilled that auditors show up, I promise.

And more importantly, as they start doing predictions based upon where money is going, how do we wind up calculating out the cost of the goods or services we're providing? And when that winds up being a function of something else, it's a difficult conversation. But they wind up talking to engineering about these problems, and what is heard is, "You're spending too much on the cloud. Spend less." And sometimes that's even what they think they mean, but it isn't.

The problem is that visibility and being able to predict what your spend is going to look like in the future matter to businesses. They inform strategic decision-making, presumably. So I would call some of these decisions less than strategic, but we'll be charitable because that is, again, a separate talk.

And of course, finance doesn't have the context engineering does. They see a big Amazon bill, they think, "A whole lot of books, and I didn't see that many boxes being dropped off at the office this past week. And wait, how do they have time to read all of it anyway? Don't they have full-time jobs?" It's not that they're dumb. They're not. They just have a completely different skill set, and historically, we haven't done a great job of bridging the gap between finance and the engineering world. Similar to a decade or so ago, between development and operations. Please don't call it FinOps. That doesn't work. It just doesn't work.

Meanwhile, on the other side of the fence, what does engineering care about? Very often, whether they mean it or not, it comes down to feature delivery and how quickly they can get something out, and companies wholeheartedly encourage this, and that's fair. It's reasonable.

But I've been in conversations with engineers who were dinged on their annual review because they spent a couple of weeks unprompted to optimize the cloud bill. And in one case, they optimized for a couple of hundred bucks, and I get it. Yeah, that was probably not the best move. And in the other case, they knocked $10 million off the company's cloud spend, but that's not the feature they were supposed to be working on. Context matters. Communication matters.

And of course, the other half of what engineering does that we don't generally talk about at a high level is problems with the computers that break things. And if this doesn't resonate, I urge you to go troubleshoot an app for about eight hours, where it turns out the problem is either a comma or a white space character that isn't UTF encoded quite the way that the thing that's reading it expects it to be, and then tell me you're not table flipping.

The challenge here is that as you start building out governance and control, and yes, cost optimization is a part of governance, but as soon as you use that word, half the audience tunes out. I don't blame you. I'm in that half of the audience.

As you build these guardrails, it fundamentally has to be easier than not doing things the right way. Now, some of you are fortunate enough that you work in regulated industries, where it turns out that just disregarding the compliance requirements doesn't just mean you're fired, it means you're going to prison. Most companies, for better or worse, don't have quite that strong of a guardrail around this, and increasingly, companies do governance wrong. It comes out as trying to be this impenetrable gate rather than a constructive filter. You have to make it easier to do things the right way, or things break down.

Simple example. If it takes six weeks to provision a physical server, and we're in the cloud now, so we're saving time, it only takes four weeks to provision an instance- ... you're going to have someone at your desk every 20 minutes asking for you to spin a thing up. They're never going to turn it off because it takes four weeks to spin up a new one, so they're just going to leave it around forever. And people don't remember to turn things off. It's never as exciting to clean up after yourself as it is to build anything. Source: any child with a pile of Legos.

In fact, someone on Twitter said the bill is not even about what you use, it's your bill for instead what you forget to turn off. Who here has not left something running in a cloud environment and found out because of the bill? Yeah, a few people. Yeah, most of us have been in that painful place.

And the reason you can't do any of this with tooling is that there's no API for business insight. The thing that might be the right move for one company could be completely disastrous for another because there's no good way to get information from people programmatically while remaining within the bounds of the law. Installing an API without consent is a problem when it comes to people.

Lastly, of course, you can't be a real economist unless you slap your name on a theory or a law or something else equally self-aggrandizing. And I don't have the attention span most days to write a tweet, let alone a book. So instead, I've noticed something that I wanted to bring up for your consideration based upon conversations I've had with fascinating people who are, in turn, doing fascinating things.

This is a data center, for those who don't know what one looks like. Ooh and ah at its majesty. Come on. There we go.

And if you're building out a data center for your application, and once the data center has finished construction, which, first, way more money than you thought, way longer than you thought, but once it's up and running, you can tell to a very high degree of fidelity what it's going to cost to run this for the next three years. Almost to the penny, which is awesome.

That's one end of a spectrum. Let's go to the other end. This is the actual architecture of a serverless application, and yes, I get paid a dollar every time I use the word serverless on a conference stage- ... that builds my ridiculous email newsletter and sends it out every Monday.

And there's a whole bunch of different things going on in here as it transforms my ridiculous nonsense into something that is still ridiculous, but now it's prettier and goes out to a whole bunch of people.

And I can trace, as Simon Wardley says, the flow of capital throughout this application. And let's say that now it's immaterial because we're talking a couple of cents a month. It does not matter to the business. But if I were sending out 10,000 email newsletters a day, which no one wants, I assure you. But if I were, I could start tracking, where is the expensive part? Where is it bottlenecking? And start focusing on that. And if someone in finance has a question, I can get into a very detailed, granular discussion about where that bill is.

This really winds up on a spectrum that I like to call cloudiness, and we'll get to that in a minute. On the one side, you have the story of data centers, then we move into instances or virtual machines, and then we move into autoscaling, the idea of scaling up and down. Yes, that is the AWS diagram icon for an autoscaling group. Their art is about as good as some of their service names, as far as being clearly understood. But roll with it. Take it on faith. That's what it's for. And then Docker or Kubernetes, a word I hate using on stage because I lose $5 every time I say it. Again, I digress. And then into the serverless world at the end of it.

To be clear, these are indicative. These are not prescriptive because it is possible to do it exactly wrong across the board. If you take what's running in your data center and shove it into a bunch of VMs and just run them instead in a cloud provider, first, it runs on money. Secondly, you haven't really solved anything unless the problem you're trying to solve is, you know what this company sucks at? Running data centers. And yeah, hey, I suck at running data centers. There's no shame in that. But I've never yet seen that as the stated rationale for doing a cloud migration. "Well, we suck at running data centers." Cool. Some folks will say now that you've done that, you're finished. VMware. Sorry, something caught in my nose there. But it's a transitional step. It doesn't get you far enough down the line for that to work.

And this isn't necessarily just me saying this. It turns out that if you say something dubious, you can cite it. Works out super well. As we learned from the State of DevOps Report this year, Dr. Nicole Forsgren, who I believe is giving a talk next, so she could not be here for me to call out in person, highlighted something foundational that NIST has come out with. If for nothing else, it was because it was the first time NIST had done something that wasn't 400 pages long, so people could actually internalize it and do something useful with it. And there's a high degree of correlation between how well people align with these cloud characteristics in their environment and how high-performing the team is. There's data that backs this up. So it's not just about doing things because the thought leader on stage said you should. There are measurable impacts that come out of this, and that's valuable and that's important, and that is something that opens up doors. But it also unlocks something else. Forgive my amazing skills of an artist. As you might tell, this is not my first skill set.

As something gets cloudier, it becomes inherently more cost efficient. And this shouldn't be a tremendous surprise to anyone because things like auto-scaling, if you turn off things you're not using, you don't pay for them. It turns out it's super hard to go and sell a server on eBay, and then a day later when you need capacity, buy that server back on eBay and do that in an effective way. But with cloud providers and having only paying for what you care about, you can become far more cost efficient, all the way up to the end of serverless, where it is pay for invocation, pay for consumption-based billing. So there is no idle that you're paying for. That becomes fantastic, but it's certainly not easy to get there, and I'm not agitating that anyone should attempt to in one fell swoop. That's, again, a different talk by other people who are way more idealistic than I am.

But as things get more cost effective, the ability to predict what that's going to cost in absolute dollars and cents becomes almost nil. And finance doesn't like hearing the answer to a question of, "What is this going to cost us to run during the holiday rush?" being, "Eh." They want you to say it authoritatively. And it still doesn't help because this is over Slack, and how can you even express tone that way?

But they're asking a question that can be answered in dollars and cents, and instead, they're getting an answer that, while accurate, isn't helpful to them. And one of the ways to start expressing this in a more helpful way is to view it as a function of a KPI that the business cares about.

When you tie it back to business metrics, a popular one in the SaaS world is monthly active user, or MAU. If every thousand monthly active users winds up costing $X to service, plus a fixed fee of all the things that don't spin up or spin down, like the Jenkins box or a whole bunch of supporting infrastructure, that's something that they can at least work with. And then it goes on to the BI folks to try and figure out exactly what the real numbers are going to look like. It's like reading tea leaves.

The challenge though, too, is that people often take the wrong message from that, which is why having the right conversation with people is valuable.

I'm asked consistently by people for industry benchmarks around monthly active users. What should it cost?

And it's a fair question. It's not a bad one at all, but there's no reasonable way to answer it. Because in my spare time, I run twitterforpets.com. It's like regular Twitter, but 80 times less racist. And each user is just more or less tweeting. Other companies have a monthly active user recommending, or rather references banks. They're going to be doing giant machine learning projects on it. Those cost a little bit differently.

But I'm tired of answering the question, and now that I'm on stage and I have a microphone and no one else does, I'm going to say for the record that the monthly benchmark for monthly active users is 32 cents each. It's founded in absolutely nothing, and ops teams will absolutely hate me for what I just said, but 32 cents seems about right.

Because the answer is anywhere from a penny or less to multiple millions of dollars, and it's hard to get an industry benchmark, even within a sector, because everyone builds things fundamentally differently.

All of which is to say that my working thesis, and I would love to hear people's thoughts on this, directed loudly at me, tied to a brick, through my window, is that the more, I guess, cost optimized a cloud environment becomes, there's an inverse correlation to how easy it is to predict the cost.

And this isn't inherently a bad thing as long as the costs are tied to something the business understands.

But having those conversations requires a bit of empathy. It requires getting to a point where finance and engineering can have a conversation together, and everyone walks out of the room content with the conversation and not thinking that they're being condescended to or being dragged into a whole engineering exercise that makes no sense for them.

And that's been my working theory. And when I tell people this in various contexts, roughly half of them say, "Well, yeah, that's obvious. What about it?" And the other half just sort of stare at me with an, "Oh my God, I hadn't considered that" look. And that tells me I'm either onto something or I've got the basis for a terrific scam. And I'm not entirely sure which way it goes.

So if I have one ask for folks here, it's to tell me what I missed. What am I not thinking about when it comes to being able to talk about optimized environments being harder to predict in absolute terms? Because I work with a certain subset of the industry, obviously. I don't work with every company yet. So there are things that I'm missing. There are clearly missing pieces of context for me. And I would be very interested to hear from you folks what it is you think those are.

My name's Corey Quinn. I am a cloud economist, whatever the hell that means. Thanks for listening to me. You can follow my exploits at LastWeekInAWS.com, and I'll be haunting around the conference until they hurl me out of here tomorrow afternoon. Thanks.