You Suck at Cloud and It's [Not] All Your Fault

Log in to watch

Europe 2021

You Suck at Cloud and It's [Not] All Your Fault

The Duckbill Group · Chief Cloud Economist

Corey is the Chief Cloud Economist at The Duckbill Group, where he specializes in helping companies improve their AWS bills by making them smaller and less horrifying. He also hosts the "Screaming in the Cloud" and "AWS Morning Brief" podcasts; and curates "Last Week in AWS," a weekly newsletter summarizing the latest in AWS news, blogs, and tools, sprinkled with snark and thoughtful analysis in roughly equal measure.

Chapters

Full transcript

The complete talk, organized by section.

Host Intro (Gene Kim)

Thank you, Nora.

Our next speaker is Corey Quinn, who you may know from the snark he delivers in his newsletter, podcast, and of course, Twitter. I found him to be an incredible source of insights, and I'm personally grateful for the help and the critiques he gave me on an early draft of "The Unicorn Project."

He has a keen eye for the absurd, and there's plenty of that in how organizations are using and misusing the public cloud. He'll be presenting some startling observations on what we tend to do wrong, and he provides some equally startling advice on what we can do about it. Here's Corey.

Corey Quinn

Hi there. I'm Corey Quinn, and I'm The Duckbill Group's Chief Cloud Economist. You might have little to no idea what that means, but that's okay because I have absolutely no idea what it means. Basically, I took an engineering background and applied it to an expensive business problem that, and this was key, nobody would wake me up about at 3:00 a.m.: the horrifying AWS bill. Four years later, we're about 10 employees, our customers spend billions a year on cloud services, and we have angry opinions backed by data.

How did we get here? Well, I got yelled at a lot when I used to run ops teams about the AWS bill. I wanted someone that I could just give money to, and they would make that problem go away. So I couldn't find that person, so I started consulting. Soon, I wound up creating my newsletter, "Last Week in AWS," that gathered all the information from Amazon's ecosystem that had an economic impact, which, let's face it, was pretty much everything. And then I shared my idea with Mike Julian, and we partnered to create The Duckbill Group.

So we now do AWS cost optimization because it's a growing problem experienced by approximately everybody, and we host three podcasts and write two newsletters covering all things AWS because focusing on something is not really something we're great at. Oh, and our mascot is a venomous platypus named Billy, unless he's doing consulting work; then he wears a tie and goes by William.

That's where you folks come in. I gave a version of this talk a year ago or so at a DevOps Enterprise Summit event back when the pandemic was just getting started, and Gene Kim apparently lost a bet or something because he wanted me to bring the better version of that talk here as a plenary.

So this version of the talk is very reasonably called, "You Suck at Cloud and It's All Your Fault," just to make sure that we set the proper tone and context for the nonsense I'm about to hit you with, because everyone feels on some level like everyone else has figured this stuff out and somehow we're the ones who are missing the bigger picture that those other companies have managed to get right. I'm sorry. Everyone is secretly ashamed of how they're working in the cloud. You're not alone, and it's not really your fault. So get comfortable. Let's chat. I'm here to talk to you about the plethora of mistakes we see in the world of managing cloud costs.

Now, when I say managing costs, you're probably going to expect a talk that covers points that are a lot like the ones right here, albeit with slightly less inferred violence against your account manager's furry friends, because this is the usual stuff you'll see in every talk about cloud bills, and they usually end with a rousing call to action: either go forth and tag everything better, or buy some company's product or service that won't solve your actual problems.

If this is the kind of thing you're into, great. Allow me to suggest every single talk about cloud cost optimization that isn't this one. It's always the same tired advice, and it doesn't matter if someone's giving it to you in 2021 or 2012 because it doesn't really change. I consider this proof that all of the advice on this slide doesn't actually work for crap when it comes to achieving outcomes that even slightly resemble lasting change. So if the usual suspects aren't what you should be focusing on, what are the worst mistakes I see companies making around cost?

Let's begin, of course, with running Kubernetes. You might chuckle at this and think I'm being intentionally antagonistic or setting up to make some clever point. Rest assured, I'm not. We have a large enterprise client who had their cloud bill divided into Kubernetes and everything else. Kubernetes was a giant expensive question mark. It's maybe where real work happened, maybe it was all waste, but nobody had a clue. This is because running Kubernetes is and remains a giant mistake from the bill side of the world.

From the perspective of a cloud provider, you can spin up a whole bunch of instances, and then on top of that, run all of your workloads inside of Kubernetes and get yourself into billing hell, because to that provider you're running a single workload called Kubernetes. There's no visibility at all into what workloads are going on inside of that environment. Scaling your clusters up and down is a ridiculous fantasy that everyone talks about, but effectively nobody actually does because autoscaling is often not worth the expense of implementing it properly and requires you to accurately predict the future. So, in the real world of enterprise, Kubernetes looks like a lot of big instances sitting around costing you the same every hour of every day.

Then those instances talk to each other in weird ways. In AWS, transferring data from one availability zone to another costs the same as it does to transfer it from one region to another: two cents, in most cases. Kubernetes has no sense of zone affinity, so that weird workload that the cloud providers see spends an inordinate amount of time talking to itself, and not in the fun way that I do, and it racks up the bills as you go.

Think about that for a second. Something inside of Kubernetes wants to talk to something else, and it'll frequently ignore the thing that's right next door that it can talk to for free and opt instead to shove a few petabytes a month at something that's charging two cents per gigabyte.

Worst of all, you can't attribute those costs, be they for data transfer, compute, or RAM, to workloads within those Kubernetes clusters other than what amounts to basically dead reckoning. You squint, you figure out that 70% of that cluster is for workload A, the rest is for workload B, and that's how you allocate it. Namespaces kind of work to solve this and do an okay-ish job, but now where do you put the AWS primitives that Kubernetes needs regardless of workload or capacity? The control plane itself, the snapshot storage, the backups, the stuff that doesn't cleanly allocate to one particular workload. How do those get attributed? They often don't or can't be because there's no tagging mechanism that actually freaking works for this. And the sad fact of the matter is, as an enterprise, you invariably care a lot more about allocating where your spend is going than shaving dollars and cents off of the bill.

So, this client of ours, like so many of them do, went with Kubernetes because they were doing a hybrid environment. That is, a data center and a cloud environment. They wanted to move workloads between data centers and cloud seamlessly, and that middle layer was, of course, called Kubernetes. What they were doing, in fact, was improving their data center at the expense of their cloud environment. Look, before I get angry letters, I'm not denying that Kubernetes offers advantages. If you want to learn more about that, go ahead and talk to anyone who has the word Kubernetes in their talk title at any event ever, and they'll be thrilled to evangelize a technology that nobody fully understands for any problem you have and several you don't. I'm pretty sure that's a job requirement at this point for some folks.

I talked to a company recently who was receiving vast quantities of data from their customers. That doesn't really narrow it down in 2021 at all because we live in a data economy, and we want more data faster across the board. Image and video files have gotten larger, and as we've increased bandwidth available on the network, companies have rushed to fill it with a whole bunch of telemetry. Now, no judgment here. That's a different talk where I get into privacy.

So, I digress. That client was receiving scads of data from their customer in the multi-petabyte range. Then, they were transferring that data internally between availability zones over and over as they sliced, diced, and restructured that data. They ran some queries, ran other queries on the results of those queries, and it turned out that for every gigabyte of data that they received from customers, they were transferring 50 gigabytes internally. That's not an exaggeration that I made up to prove a point. It's true.

Now, this isn't exactly the best approach to take, so we dug into it a bit further. They had valid reasons for doing it. They were slicing the data apart, taking the results, transforming it further. They weren't just being silly with it, but it had a very real cost. In a data center, doing this just means that your switching fabric is going to be pretty congested, and you might have to spend a bit more money on your network. In AWS, it manifests differently.

I mentioned previously that it costs two cents to move data between most regions and availability zones. Try thinking about it this way instead. Every time you move data between availability zones or regions, it costs the same as storing that data in S3 for three weeks. You're a slight, simple misconfiguration away from that number exploding to just shy of four months. If you want to achieve that high score, your search term is managed NAT gateway data processing fees. I know it sounds like I'm getting into the weeds of how data transfer is billed in AWS, right? Well, let's see what it looks like.

I'm not joking. This is how the billing works. Now, don't worry, I'm going to give you a link to the high-res version of this image at the end of the talk because it's probably going to be useful for you. I find myself consulting this constantly. And the only winning move with data transfer charges is simply not to play.

The lesson here can be distilled down into a basic truism. Once you get data into your environment, absolutely do not pass it back and forth between EC2 instances in different availability zones. Please. If you're going to be doing that much data processing, store multiple copies of the data, and we'll all be happier. To that end, data should live on the cheapest storage possible.

A lot of companies fall into this trap constantly, and it stems from not fully understanding how the data life cycle works. It's well beyond SSD versus spinning disk in this era. What do you need from a durability perspective versus a latency perspective?

One of our reference customers, and it's rare that we get to name names when talking about our clients, but we can here. They're Honeycomb. They had a Kafka cluster running with local EBS volumes as a backing store. They were able to save money by changing to instances with NVMe volumes and save on cost, increase throughput, and the durability concerns of having the volumes tied to specific instances was offset by Kafka's built-in replication factor.

A different company has a whole bunch of data that lives in Splunk because data always finds its way to Splunk. Imagine that, an expensive story in which Splunk is the main character. They're currently in the process of moving that data to S3. Why S3? It turns out that the cloud long ago took a collective vote about what the data storage model of the future was going to look like, and object storage won, hands down.

Let's do a little math here. Typical GP3 SSD storage in AWS costs $0.08 per gigabyte per month. You need multiple copies of that for redundancy, let's say three availability zones, so that's $0.24 per month per gigabyte. You're also not a dangerous lunatic, so you're not going to be completely filling your disk volumes. Let's assume your disks are all 80% full, which is aggressive but doable. So now each gigabyte of data that you're storing, not including replication cost, is costing you $0.30 a month per gigabyte. Or you could store it in S3 for 2.3 cents per gigabyte per month.

But wait, there's more. S3 offers a sarcastic number of nines in its durability, 11 of them, which is win-the-lottery-while-simultaneously-getting-struck-by-a-meteorite level of likely. And when you access that data from different availability zones in the same region, there are no data transfer charges. There are request charges. Ballpark that, 1,000 requests costs you a penny. It adds up, so be sensible, but the economic wins here are absolutely massive, but if and only if your applications know how to speak to object storage instead of disk volumes. Please don't try to treat S3 like it's a file system. Absolutely nobody likes what happens when you do, and it ends in tears before bedtime.

So this is where we start to get to the idea of modernizing your applications to speak to object storage. It's not easy. Please don't think I'm saying it is. And for some workloads, it may as well be impossible. I get it. It's hard. But there are serious durability and scaling wins if you can pull it off.

Ah, but what about not wanting to go locked into one provider? You want to have disks. Disks are available everywhere. Let's talk about multi-cloud and ignore for a minute the fact that everyone has an object store. Over the past year, ranting about multi-cloud has become one of my signature talking points because sometimes you set up a straw man, and then it comes to life almost like Frosty the Snowman. My position on this is often misconstrued to be understood as never use multi-cloud for anything. Some folks also like to use it as proof positive that I'm a shill for AWS, or that I hate them, or frequently both at the same time, however the hell that's supposed to work. I want to be clear here: I have no partnerships with any vendor in this space. So let me first state the point that I'm trying to make, and then I'll shoot down the various ways it'll be misinterpreted. This is very important. Don't email me until you've listened to this part.

Now then. Everyone knows that lock-in is to be avoided, so the right play is obviously to build everything in a cloud-agnostic way, so you can deploy your entire stack to different cloud providers on a moment's notice. To quote my friend Ben Kehoe of iRobot, think of multi-cloud as being a lot like cow tipping, which is a myth. How do we know that cow tipping is a myth? Easy. There are no videos of anyone successfully tipping a cow on YouTube. That's right. If it's not on YouTube, it doesn't exist.

Similarly, we know that building everything in a cloud-agnostic way is also a myth because if someone had actually built their entire application stack in a way that wasn't completely horrifying, we would never hear the end of it from a whole bunch of different vendors' keynote stage. Instead, we're treated to things like 2019's VMworld keynote, where their vision of the future was so horribly complicated, they could never find a real customer who would do this. So they had to invent a fake T-shirt company called Tanzu Tees to theoretically use all of the ridiculous nonsense they were talking about that was a part of their Tanzu platform. It doesn't exist. I like VMware. Please don't think I don't. But I'm serious. After all that nonsense, they didn't even have the decency to give me a way to go to that website and buy an actual T-shirt.

Please stop trying to chase the impossible dream. You're turning your back on a whole bunch of differentiated higher-level services that cloud providers offer that can massively improve your business, and you're getting basically nothing of value in return. You're paying for an optionality you're not cashing in, and the coin you're using to buy it is your own feature velocity. So as a rule of thumb, pick a provider per workload and go all in.

I am not an AWS partner. I don't care if you use AWS or Azure or GCP or Oracle Cloud. Hell, if I really dislike you, I'll suggest you use IBM Cloud. But pick a provider and go all in until you're forced not to on a per-workload basis. And that last point is key because I personally, at The Duckbill Group, use AWS for my infrastructure, GitHub or JifHub, as it is properly pronounced, for my code storage because AWS CodeCommit is really a sad joke, and G Suite for my email because I don't want to run mail servers anymore this decade. But each workload lives in a distinct provider. There's no workload that has to seamlessly flow between different providers because that's not sensible for most use cases. You'll spend more time getting that workload to speak different cloud providers' various dialects than you will actually running the blasted thing.

And if you're not running it in multiple providers, it's like a DR plan. You update it and get a binder and everything's set, and then three months goes by and it's time to test your DR plan again, if you're one of those forward-looking shops that believes in testing your DR plan, and it breaks. Then you keep iterating forward and you finally get it to work, and the next commit breaks the whole thing again. If you're not active-active, you're not really multi-cloud.

Now, there are exceptions for workloads where multi-cloud makes sense, but they're usually stateless workloads that often fit inside of containers. We see them, but it's infrequent, and it's certainly not nearly happening often enough to suggest this is somehow some kind of best practice. It's an edge case, but one that's talked up to be way more common than it is by vendors who will have absolutely nothing left to sell you if you go all in on your cloud provider, and by crappy cloud providers who know that if you're going all in on a cloud, it will certainly not be them or anything their dirty little hands have ever touched.

So for that end, we don't have a whole lot of clients that are actually doing multi-cloud for this definition of the term. Instead, we see a number of clients who don't want to commit to a single vendor out of a misplaced fear of lock-in, so they build everything to be provider-agnostic.

One company we've spoken to is spending over $100 million a year almost entirely on EC2. They run a whole bunch of databases on top of EC2. Their single concession to the cloud is an object store, but they're careful to only use S3 functionality that's widely replicated elsewhere. They once spent three months of an engineer's time trying to get VPC peering over IPSec working between Google Cloud and AWS in Terraform. You'll note that I said they spent three months trying, not that they succeeded. The idea doesn't pan out in the real world.

Avoiding lock-in? Please, you're already locked in by virtue of how identity gets managed, how networking fails to interact, and by the people you've hired that are more expensive than your cloud bill. They're good at your cloud provider of choice, presumably. Tell them to learn another cloud, and an awful lot of them are going to opt to just move down the street instead to continue working with the thing they're good at. So pick a provider per workload and go all in. If your company is all in on cloud provider A and you acquire a company on cloud provider B, cool. Leave them alone. There's not a lot of business value in upsetting that apple cart in most scenarios. There really isn't.

So let's say you're in a data center environment. Your CoreSite bill shows up. Your CFO sees it, has a heart attack, and your month goes on. The end, more or less, presuming they recover. It's a data center. What are you realistically going to do about it? Arson and claim insurance money? Not recommended.

Now imagine a cloud environment. The AWS bill shows up. Well, that can be broken down to different business units. It's not, "I spent $2 million on EC2," it's, "I spent $1.4 million on Search, $300,000 on displaying impressive Norton Antivirus badges to our website visitors," et cetera. That latter version can be used to communicate with the business in a way that carries weight and context beyond, "It turns out computers are super expensive."

But to get there, you have to first slice the bills into resources that align with the business. Finance doesn't really care how much you spend on AWS. They don't care particularly if it's $1 million or $5 million. If you come to them with a, "Well, cloud is just expensive. Deal with it," that doesn't help them. They're not there to hold the purse strings. They're there to help the business grow. Give them information that informs their forecasts, and then they in turn can help the business make better decisions.

If one month it's $1 million and the next it's $2 million and the month after that it's $3 million, the CFO is going to be pretty pissed off. What's not well understood is that if this month it's $2 million, next month it's $1 million, and the third month it's 300K, the CFO is going to be just as upset. It's not about the cost; it's about blindsiding the business with what it costs to provide your goods or services to your customers, presuming you have some. Pro tip from someone who's been there all too frequently: you almost certainly don't want to surprise your business leadership because they will surprise you in return, and those scars last.

We spoke with a company who was a bit undersold on the value of being able to allocate their spend, which in this example was a bit on the smaller side. They knew that it was important to have a vague idea of where the money was going, but they weren't so sold on the idea of why it really mattered. Then their Amazon contract came up for renewal. Fun story. If you Google for AWS contract negotiation, I'm results one and two. I kind of do this a lot.

Now, details are of course buried under deep secret levels of NDA, but here's an open secret in the industry. If you commit to longer-term spend at certain dollar figures, you get discounts off of retail pricing. There are a lot of complexities to this on a service-by-service basis, and spend causes different pricing structures to unlock, but that's the baseline situation. Now the trick is, what's the right number that you should commit to? Too low and you leave money on the table. Too high, you don't hit your commit and have a shortfall. And complicating this whole thing is that Amazon has their own math on how to calculate what they think your commitment should be. It's okay-ish. It's a tool. It has no context. It doesn't recognize that the holiday season just ended and you sell Christmas decorations.

But what if you knew more about your growth than they did and had the data to back that up? What if you were able to predict your spend better than Amazon can? This company was in exactly that position. It let them take the discount offer they received, and in turn, realize another half million dollars of additional savings on a $5.7 million commit, solely because they had the better predictive model and the data to back it up. When you're able to predict your spend, you're able to negotiate better deals. The same was true back in the data center days. If you knew how many racks you were going to need by the end of year five, you could've negotiated it all up front. That's the futures market in a nutshell, and it's come to cloud, whether we want it to or not.

All of this is building up to a mistake that's the source behind a lot of the things that we encounter in the wild. They collectively all speak to the single biggest strategic blunder that we see, and that's thinking of the cloud like it's just another data center. You can do that, but it's expensive, fragile, and basically the entire historical business model. They're the payday lender of technical debt.

Here's the easiest way to figure out if you've fallen into this trap. Look at your architecture. Now look at this image. Now back to your architecture. Now back to me. Are you basically only using EC2 instances and disk volumes? If so, I've got some bad news for you. You're pretty much a data center over on the left side.

Another way to tell whether you're stuck in data center thinking is to take a look at the hour-by-hour cost metrics in your environment. Do they decrease significantly in the off-hours for your business? Does the spend ramp up as your user traffic increases? Or are you seeing what a lot of folks did over the past year: 80% of your user traffic evaporates, but your cloud bill doesn't budge? The entire value proposition of the cloud is the ability to scale up and down to meet your workload's needs, which is what everyone tells themselves so they can feel better about not actually doing it.

The entire premise of the cloud is that you can dynamically scale up to meet demand, then, and this is the part everyone forgets, you can scale the hell back down again. Further, you can use higher-level managed services that remove the operational toil from your environment. I know I've talked a lot about AWS in this talk because that's what my business does, but every mistake I've described here applies to any cloud provider. There's nothing here that doesn't apply to Azure if you add in a pile of licensing issues, or GCP if you sprinkle in a bit of healthy fear of them turning your production environment off because they got distracted by something shiny.

So you're more efficient as you become cloudier. But there's a dark counterpoint here. It's getting more efficient the cloudier it gets, but it's also getting way, way, way less predictable. The unfortunate reality here is that the cloudier you become, if you start embracing the idea of paying per consumption on a per-request basis, it's way harder to predict your spend in terms of dollars and cents.

The best way you can get to a positive outcome here is to identify your baseline costs of things that are going to charge you regardless of what else you do, and then take a look at your workloads for the super cloudy environments. Figure out what it costs to serve a monthly active customer or 1,000 of them, whatever metrics make sense for your business. Then you can turn it back around on the business planning folks in your organization. They're having to predict the future, too, and you can sort of draft along behind them as they do it.

The business is used to variability in their user metrics or whatever KPIs they're using as a lens through which they view their business. They're not used to IT spend being treated the same way. If you can tie the IT cost directly to the metrics used to ascertain the health of a business and the growth of its offerings, suddenly people will accept that very differently than they will you just shrugging when you ask them what the cloud bill is going to be next month.

Congratulations. You've sat through the slings and arrows I've hurled at a bunch of cloud decisions and come out the other side, still intact. Good for you. Maybe it's not all your fault after all.

Now, go to LastWeekInAWS.com/DOES, where I've put together resources for you. First, a PDF on how to cut your AWS bill, if you care. It's gorgeous. It has a platypus on it. It's something you can use to immediately start the cloud cost conversation internally. It has actionable things you can do yourself today.

Secondly, a high-resolution download of the AWS data transfer cost diagram that I talked about earlier. Oh, and before I forget, I also write the Last Week in AWS newsletter, which gathers the news from Amazon's ecosystem, gently and lovingly makes fun of it, and then it goes out every Monday and Wednesday to over 26,000 people.

I also host a pair of podcasts, "Screaming in the Cloud," which is a serious interview show about the business of cloud, and the "AWS Morning Brief," which allows me to indulge my ongoing love affair with the sound of my own voice. Again, I'm Corey Quinn, Chief Cloud Economist at The Duckbill Group. We help companies fix their horrifying AWS bills. Thank you very much, and enjoy the rest of the conference.