Top Ten Mistakes in Managing Cloud Costs

Log in to watch

London 2020

Download slides

Top Ten Mistakes in Managing Cloud Costs

Corey Quinn

Cloud Economist · The Duckbill Group

"(Number four will blow your feet off!)"

Chapters

Full transcript

The complete talk, organized by section.

Corey Quinn

Hi there. I'm Corey Quinn, and I'm a cloud economist.

You may have little to no idea what that means, but that's okay, because I have no idea what it means.

Fundamentally, I took an engineering background and applied it to an expensive business problem that nobody would wake me up about at 3:00 in the morning, namely, the horrifying AWS bill. Four years later, we're seven employees. Our customers spend over a billion dollars a year on cloud services, and we have angry opinions that we back with data.

I also write the "Last Week in AWS" newsletter, which gathers the news from AWS's cloud ecosystem, gently and lovingly makes fun of it, and then goes out every Monday and Wednesday to over 20,000 readers.

I also host a pair of podcasts, "Screaming in the Cloud," which is a serious interview show about the business of cloud, and "The AWS Morning Brief," which allows me to indulge my ongoing love affair with the sound of my own voice and what I imagine to be humor.

Last year, Analytica ran an analysis that determined I was the greatest cloud influencer in the world. I've been completely insufferable ever since.

Now, the reason I bring all of that up is to validate that while you are going to see some of what passes for humor in this talk, I do know what I'm talking about. And today, that's what I'm here to do, talk to you about the top 10 mistakes we see in the world of managing cloud costs.

We start with the usual stuff that you'll see in every talk about cloud economics. If this is the kind of thing you're into, great.

Allow me to refer you to every single talk about cloud cost optimization other than this one.

It's always the same tired advice, and it doesn't matter if someone's giving you this talk in 2020 or in 2012, because it doesn't really change. This is proof that the advice given here doesn't actually work for crap.

Instead, I'm going to talk to you about the top 10 terrible mistakes that I see companies making around cost.

Let's get started. We begin, of course, with running Kubernetes.

You might chuckle at this and think that I'm being either intentionally antagonistic or setting up to make a clever point, but I'm not doing either of those things. We had a large enterprise client who had their cloud billing divided into Kubernetes and everything else.

Kubernetes was a giant, expensive question mark.

From the perspective of a cloud provider, you can spin up a whole bunch of instances and run all of your workloads inside of Kubernetes, and then get yourself into billing hell.

That's because to that provider, you're really just running a single workload, Kubernetes. There's no visibility at all into what workloads are going on inside of that environment.

Scaling your clusters up or down is a ridiculous fantasy that everyone talks about, but effectively nobody actually does.

So in practice, it's a bunch of big instances sitting around that cost you the same every hour of every day.

Those instances talk to each other in weird ways.

In AWS, transferring data from one availability zone to another in the same region costs it the same as it does to transfer it from one region to another. Two cents in most cases, although there are exceptions that are egregiously high. Kubernetes also has no sense of zone affinity.

So that weird workload that the cloud providers are seeing, it spends an inordinate amount of time not only talking to itself, but racking up the bill as it does so. Worst of all, you can't really attribute those costs to workloads within those Kubernetes clusters other than by what basically amounts to dead reckoning.

You squint, you figure that 70% of that cluster is for workload A, the rest is for workload B, and that's how you allocate it.

Now, namespaces do kind of work to solve this in a somewhat passable way, but now where do you wind up putting the AWS primitives that Kube needs, regardless of workload or capacity? How do they get attributed?

They often don't or can't be because there's no tagging mechanism for any of these things that actually freaking works.

When the world isn't melting down into a recession fueled by a pandemic, you generally care a lot more about allocating where your spend is going than shaving dollars and cents off of the bill.

This client went with Kubernetes originally because they were hybrid.

They had workloads in data centers, and they had workloads in cloud.

They wanted to move workloads between those two environments seamlessly, and the middle layer that did that was, of course, called Kubernetes.

What they were doing was improving their data center at the expense of their cloud environment.

Every year, until this one at least, the CEO of AWS gets on stage at re:Invent in Las Vegas and unleashes a torrent of product and service announcements. They're bizarre.

Two years ago, they announced something called Ground Station, a service that's used to talk to satellites in orbit around Earth.

This is a legitimate service that exists, but at least a third of you watching this are reasonably sure I'm making it up for the sake of a joke. I'm not.

AWS has over 200 services, and yet over 80% of spend on AWS comes down to just five, EC2, RDS, S3, EBS, and Data Transfer, which all sounds like a bunch of letters that I'm throwing at you, but roll with me here. The rest of the spend is either long-term strategic bets, interesting technologies that customers asked for, or something else.

Remember that every AWS service is for someone, but no AWS service is for everyone. Just because your cloud provider has built a thing does not mean that you should use it, or frankly, even that you can. If you're using something that your cloud provider has built as soon as it launches, you're going to run into tricky edge cases.

A company who was very excited to use Amazon MSK, their managed Kafka service, instead of running their own Kafka, jumped aboard as soon as it came out. Now, every time I talk to that company about a new release that comes to Amazon's offering of it, their response is, "Well, that sure would've been nice to have at release time instead of this ugly, hacky workaround that we spent a month building, since that thing is a core feature of Kafka that we still can't believe that Amazon forgot." There have been no fewer than six of these after-the-fact releases that would've made their jobs easier.

Sometimes Amazon releases features that you'd swear I was making up to insult Amazon unfairly. One that was my favorite was Amazon Neptune now supports TLS. Now, as I said in my sarcastic newsletter, the far bigger story was that it somehow launched without supporting TLS in the past few years.

We've long since passed the point where I can talk incredibly convincingly about AWS services that don't really exist and not get called out by AWS employees. There are over 200 of them.

Who's to know which ones are real or not?

Just because a visionary from your cloud provider shows up to tell you what the future is going to look like on stage, doesn't mean that you need to be the first person or company to embrace that future in your production environment.

You presumably have a cloud strategy.

Don't let the flashy announcements distract you from that.

Similarly, AWS announces all kinds of different services.

You have a limited set of things that you're going to be able to innovate on.

Choose wisely. The third mistake that we see companies making.

I talked to a company recently who was receiving vast quantities of data from their customers. Then they were transferring that data internally between availability zones over and over and over as they sliced, diced, restructured the data, ran and reran different queries.

It turned out that, this is not an exaggeration, for every gigabyte of data that they received from customers, they were transferring over 50 gigabytes of data internally. This isn't exactly the best architectural approach you can take, so we dug into it a little bit further.

They did have valid reasons for doing it. They were slicing data apart.

They were doing useful transformations.

They weren't just doing this out of ignorance or to be silly, but it had a very real cost. Now, in a data center, that means your switching fabric is going to be fairly congested, and you might have to spend a bit more on your network equipment. In AWS, it manifests very differently. Every time you move data between availability zones or regions, it costs the same as storing that data inside of S3 for three weeks. You're a slight, simple, easy-to-make misconfiguration away from that number exploding to just shy of four months. Once you get data into your environment, absolutely do not pass it back and forth between EC2 instances in different availability zones.

Please, if you're going to be doing that much data processing, store multiple copies of the data instead.

Which brings us to our next point. A lot of companies fall into the trap of not storing data in the right location constantly, and that stems from not fully understanding how the data life cycle works in cloud. It goes well beyond, do I put that on SSD or spinning disk? Well, what do you need from a durability perspective versus a latency perspective? There are a lot of options here.

One of our reference customers, and it's rare that we get to name names when we're talking about our clients, but we can here. They're Honeycomb.

They had a Kafka cluster running with local EBS volumes as a backing store.

They were able to save money by changing to instances with NVMe volumes and save a lot on cost, increase throughput, and address the durability concern of having those volumes tied to specific instances by offsetting that with Kafka's built-in replication factor.

If they lost an instance or two, there are multiple copies of that data, but it's replicated intelligently. A different company has a bunch of data that lives in Splunk, by which I mean pretty much every company.

Imagine that, an expensive story that features Splunk. Who would've thought?

They're currently in the process of moving all that Splunk data to S3.

Why S3? Well, the clouds, all of them, long ago took a collective vote about what the data storage model of the future was going to look like, and object storage won hands down.

Typical GP2 volumes, which is SSD on AWS, costs you 10 cents per gigabyte per month, and you'll need multiple copies of that for redundancy. Let's say three availability zones, so that's 30 cents per month per gigabyte. You're also not a dangerous lunatic, so you're not going to be completely filling up your disk volumes.

So let's assume they're all 75% full, aggressive, but doable.

So now each gigabyte of data you're storing is costing you 40 cents per month. Or you could store that same data in S3 for 2.9 cents a month instead. But wait, there's more.

S3 offers a sarcastic durability guarantee.

They say their design goal is 11 nines of durability, which is in the realm of win the lottery while getting struck by a meteorite at the same time level of likely. When you access that data from different availability zones from the same region, there are no data transfer charges.

There are request charges which ballpark to 1,000 requests to S3 costing you a penny, and that adds up. So while you want to be sensible about it, the economic wins here are so incredibly massive that it's a slam dunk, but if and only if your applications can speak to object stores instead of disk volumes. If you try to treat S3 like a file system, you'll make it worse.

No one's going to be happy with that if you do it.

Another common mistake is chasing the multi-cloud dragon.

Everyone knows that lock-in is to be avoided, so the right answer is obviously to build everything you can in a completely cloud-agnostic way, so you can deploy your entire stack to different cloud providers on a moment's notice.

To quote my friend Ben Kehoe, think of multi-cloud like cow tipping. Cow tipping is an urban myth.

And how do we know it's a myth? Simple.

There are no videos of anyone ever having successfully tipped a cow on YouTube. Similarly, we know that building in a cloud-agnostic way for multi-cloud is also a myth, because if anyone had actually done such a thing in a way that wasn't completely horrifying, we would never hear the end of it from every multi-cloud vendor's keynote stage.

Instead, we're treated to things like last year's VMworld keynote, where their vision was so horribly complicated that to use it all, they had to invent a fake T-shirt company called Tanzu Tees to theoretically use all of the ridiculous nonsense they were talking about that was a part of their platform. Multi-cloud like that doesn't exist.

Stop trying to chase the impossible dream.

By doing so, you're really turning your back on a bunch of higher-level differentiated services that cloud providers offer to massively improve your business, and in return, getting basically nothing of value.

You're paying for an optionality that you're not cashing in, and the coin you're using to buy that is your own feature velocity.

As a rule of thumb, pick a provider per workload and go all in. I am not an AWS partner. I don't care if you use AWS or Azure. If I really dislike you, I'll suggest you use Oracle Cloud. But pick a provider and go all in until you're forced not to. To that end, we don't have clients that are actually doing multi-cloud for this definition, at least for this version of multi-cloud. Instead, we see a number of clients refusing to commit to a single vendor out of a misplaced fear of lock-in, so they build everything to be provider-agnostic.

One company we've spoken to is spending over $100 million a year almost entirely on just EC2 instances.

They run a whole bunch of databases on top of those instances, but nothing managed. Their single concession to anything cloudy is the object store, but they're careful to only use S3 functionality that's widely replicated in other providers.

As a result, they once spent three months of an engineer's time trying to get VPC peering over IPSec working between Google Cloud and AWS using Terraform. Now, you'll note that I said they spent three months trying, not that they succeeded. The idea doesn't pan out in the real world, and for all of the effort they've put into trying to maintain this, they are still a single cloud shop.

Let's talk a bit more about analysis paralysis and what it means to finance.

Let's hypothetically say that you've bought in heavily on the idea of data centers.

You didn't build your companies on top of public cloud because you're responsible, grown-up companies instead of Twitter for pets, and back when you were building out your technology capability, the cloud didn't really exist the way that it does today. Now, you're hearing lots of good things about the cloud.

You're also hearing a lot of bad things about the cloud that usually masquerade as digital transformation, but in reality are the sound of a vacuum cleaner that's run by a giant consulting company being fired up and aimed at your wallet.

So here's the big question: Will you save money by migrating to the cloud? You can spend the next 18 months doing a TCO analysis to answer that question, and at the end of it, the results will be wishy-washy or incredibly one-sided, depending upon internal corporate politics. Let me save you some time on these.

If you're going to save money by doing a cloud migration, it's going to happen on a five-year time horizon. Let me be even more clear to you.

You can assume that you will not save money by moving to the cloud.

What you're going to gain is capability and a level of rigor that your data centers will not be able to match.

If you leave an application running unmaintained in a cloud provider for years on end, it gets better. The instances it runs on grow more durable.

The network gets faster. If you try that in a data center, you're going to discover, much to your chagrin, that raccoons have carried your servers off somewhere around the two-year mark.

At some point, you have to stop measuring and make a decision.

There's always risk, but we've reached a point where the cloud has been effectively de-risked for most workloads. It's time to grow up and get off the fence.

We had a conversation with a company that needed to get approval to begin planning moving out of the data centers onto AWS.

They got approval on that plan last month, and they started that plan in November. That's not a migration plan.

That was a, "Can we even do it at all?

Is it feasible for our business?" They have not built a migration plan yet, and they're seven months in, for anyone who's watching this video later in time trying to do calendar math. Seven months to do an analysis as to whether moving to the cloud was feasible. Now that they've decided it is, they get to build their migration plan, which, as anyone who's ever done a data center to a cloud migration, the planning is going to be another six months if you rush it.

Their AWS bill, if they migrate everything, which isn't guaranteed, will be roughly $120 million before discounting.

Whether cloud is worth it or not for you is a question you have to answer.

Your path to get there is not going to be via financial analysis. Instead, it has to be a capability story.

Similar to analysis paralysis is letting the accountants believe that the world hasn't changed. If you tell me to build a small WordPress website in AWS, I could not tell you what it's going to cost to run for the first month within any closer than 20%.

We need to first figure out how much traffic does it get, what does that look like, and then we can adjust our financial forecasting to match reality.

This is not how it used to work. If you go back to data centers, you pretty much knew what the servers you bought in that data center were going to cost you for the next three years to a pretty accurate degree.

If you build a pure serverless environment on the other end of the extreme, you're going to spend way, way, way less money than building a data center, but the answer to what it's going to cost is going to be a lot more variable.

In fact, it's going to be a function of how many users you have on the system over a given time span. You're going to have to bring finance people who understand the concept of unit economics into these conversations sooner than later, but stop them before they lead you down a different path to utter madness.

This isn't cost accounting. It's the cloud. It's elastic.

You aren't going to get accurate models to the penny.

Getting to within 10% will optimistically take you months, and most of the time, it's not directly worth it. If you over-index on this too soon, you're going to become like a different company we ran into.

We'd spoken with someone who requested our help in building out this model for them. They're prepared to devote six engineers to building that model over eight months, which is more than their entire cloud bill costs.

This isn't an early-stage startup trying to dial in unit economics.

They're already in steady state. They're trying to allocate almost down to the penny, and they're spending far more than they're ever going to recapture in value to do it. So let's pretend for a second that you're in a data center environment. Your CoreSite bill shows up.

Your CFO sees it, has a mid-sized heart attack, and your month goes on.

That's the end of the story, more or less. It's a data center.

What are you realistically going to do about it? Breach contract and leave?

Turn the servers off? Now the AWS bill shows up. Well, that's a different story.

That can be broken down into different business units, but that's not how the bill's aligned. It's not, "I spent $2 million on EC2."

It's "I spent $1.4 million on our search service, $300,000 on displaying very impressive Norton Antivirus badges to website visitors," et cetera.

You start allocating it to business functions.

The latter version of that story can be used to communicate, then, with the business that carries weight and context beyond, "It turns out that upon careful analysis, computers are super expensive." In order to get there, you first have to slice the bill into resources that align with that business. Finance doesn't really care how much you spend on AWS. Believe it or not, they don't particularly care if it's $1 million or $5 million. If you come to them with a, "Well, cloud is just expensive. It does that. Deal with it," that doesn't help them any.

They're not there to hold the purse strings.

They're there to help the business grow.

Give them information that helps inform their forecasts, and they in turn can help the business make better decisions.

This month, your cloud bill is $1 million, and next month it's $2 million, and the month after that, it's $3 million.

The CFO is probably going to be pretty upset.

What's not as well understood is if next month it's $2 million, the month after that it's $1 million, and the third month out it's $300,000, the CFO is going to be just as upset.

It's not about the cost. It's about blindsiding the business with what it costs to provide your goods or services to your customers.

This ties into making the mistake of underestimating the power of prediction. We spoke with a company who was a bit undersold on the value of being able to allocate their spend.

They knew that it was important to have a vague idea of where the money was going, but they weren't so sold on the idea of why that actually mattered.

Then their Amazon contract came up for renewal.

So open secret in the industry, if you commit to longer-term spend at certain dollar figures, you'll get discounts off of retail pricing. There are a lot of complexities to this, but that is the baseline universal truth of the situation.

The trick is, what is the right number that you should commit to?

If you commit to spend too little, you're leaving money on the table.

If you commit to spend too much, and you haven't hit your commitment, then you'll have a shortfall. Complicating this whole thing is that Amazon has their own math for how to calculate out what they think your commitment should be.

And it's okay-ish. It has a tool. It has no context. It doesn't recognize that the holiday season just ended and your company sells Christmas decorations.

But what if you knew more about your growth than they did and had the data to back it up? What if you were able to predict your spend better than Amazon was?

This company happened to be in precisely that position, and it let them take the discount offer that they received and in turn realize half a million dollars in additional savings on a $5.7 million commitment, just because they had the better predictive model and the data to back it up. That's the value here.

When you're able to predict your spend, you're able to negotiate better deals.

The same was true back in the data center days.

If you knew how many racks you were going to need by the end of year five, you could've negotiated for all of that upfront.

We're really, in technology, reinventing the futures market in a nutshell.

The last thing that I want to talk about is the biggest strategic blunder that I see, and everything's been tying into this, and that is thinking of the cloud like it's just another data center. You can do that, but it's expensive, fragile, and basically VMware's entire business model.

They are the payday lender of technical debt.

Here's the easiest way for you to figure out if you've fallen into this trap.

Look at your architecture in the cloud.

Are you basically only using EC2 instances and disk volumes?

If so, I've got some bad news for you.

Another way to tell whether you're stuck in data center thinking is a bit more nuanced. Take a look at the hour-by-hour costs in your environment.

Do they decrease significantly in your business's off-hours?

Does the spend ramp up as your user traffic increases?

Or are you seeing what a lot of folks did over the past few months as the pandemic hit? 80% of your user traffic evaporates, but your cloud bill doesn't really move at all. The entire value proposition of the cloud is the ability to scale up and down to meet your workload's needs, which is what everyone tells themselves so they can feel better about not actually doing it.

If the premise of cloud is that you can dynamically scale to meet demand, then this is the part everyone forgets, you can also scale the hell back down again. Further, you can use higher-level managed services that remove the operational toil from your environment.

I know I've talked a lot about AWS in this talk, but every mistake I've mentioned applies to every cloud provider. There's nothing here that doesn't apply to Azure if you add in a whole bunch of licensing issues, or GCP if you sprinkle in a bit of healthy fear of them turning your production environment off because they got bored and distracted by something shiny.

So you've sat through the slings and arrows I've hurled at a bunch of cloud decisions and come out the other side still intact. Good for you.

Now, go to lastweekinaws.com/does, D-O-E-S. I've put together a few things for you.

First, a PDF on how you can cut your AWS bill right now.

It's gorgeous. It has a platypus on it.

It's something you can use immediately to start the cloud cost conversation internally. It's full of actionable things you can do yourself today. There's a newsletter signup box on that page as well.

I gather all of Amazon's cloud ecosystem news every week, strip out the things I don't care about, and then I make fun of them because I have serious problems with my personality. You're going to want to sign up for that, too.

Lastly, I'll be hosting a free-form Q&A as a workshop here on cloud cost management. Ask questions. I'll make jokes like the ones you've just sat through while I answer them. It'll be a grand old time as we're trapped inside during a global pandemic. Again, I'm Corey Quinn, Cloud Economist at The Duckbill Group. We help companies fix their horrifying AWS bills. Thanks very much.