Honeycomb: All the World's a Platform

Log in to watch

Las Vegas 2018

Honeycomb: All the World's a Platform

Software platforms: they are everywhere. Gone are the days of simply building a service for users; now everything is a platform. (What does that mean?) We use dozens if not hundreds of platforms every day in the process of building and shipping software.

This talk is about how you can be an intelligent, mindful, judicious user of platforms; how to read between the lines of what a vendor is willing to say about their platform; and how to be a good platform for others.

Charity is a cofounder and engineer at Honeycomb.io, a startup that blends the speed of time series with the raw power of rich events to give you interactive, iterative debugging of complex systems. She has worked at companies like Facebook, Parse, and Linden Lab, as a systems engineer and engineering manager, but always seems to end up responsible for the databases too. She loves free speech, free software and a nice peaty single malt.

Chapters

Full transcript

The complete talk, organized by section.

Charity Majors

I've worked on platforms like Second Life, Linden Lab, Facebook, Parse. Parse most recently, and most damagingly to me, I mean.

And I'm currently the CEO and co-founder of Honeycomb, which was started in no small part to cope with the intractable problems that I had at Parse and Facebook.

I actually find it really hard to work on anything but platforms anymore. Why? Because it's really hard to go back to solving a problem just for one user when you have tasted the heady fruit of solving a category problem. It's so much more interesting and fun.

So last night I was up late and I asked some people, "What's a platform?" And boy, did we get some answers. Some of these were good. Some of them were like Mike Fiedler's, "It's a thing that you put other things on." Not wrong.

I was kind of surprised because I thought that maybe there was some canonical definition of platform that I just didn't know. Nope. Turns out, not really. Oh, there's more. There's way more. You probably can't read them.

But a couple good ones were... Let's see. Where was that one that I really liked? Oh, "A thing that we give people to speak from." No. Okay, so I can't find the one, but the one that I really liked was where they said it's not a platform if other developers can't make money off of it — if they can't monetize for themselves. And that's a rough way of assessing value, but if other people can't derive value for themselves, selling to other users or giving other users value, then it's not really a platform, it's just a fancy feature. And I thought about that for a while, and I feel like we could have a really great argument over whiskey at night, and I would be happy to take either side of that argument. It's an interesting question.

At Parse, we were a mobile backend as a service — basically Heroku for mobile apps. You're a developer, you want to write a mobile app, you just type, type, type, and you've got one. We handle all the push notifications, we handle all the database stuff, all the data stores. If you're a software engineer and you want to write a mobile app, we take care of everything. Queries — you can write any query that you want on your laptop and upload it, and I make it magic for you on MongoDB, no less. Also, you write JavaScript, upload it to the cloud, and I just make it work. It's magic. This is a terrible idea. Never do this. But for developers, it's fantastic. They don't have to think about it — until suddenly they have to think about it, but we won't talk about that.

What I think most of these definitions are getting at is this: you build and maintain a thing, others build and maintain a thing on top of your thing. So you kind of often are their outsourced ops team. That's an evil cackle, in case you weren't clear on that.

Why are platforms interesting? Well, it's kind of the same reason you could ask why problems are hard. You have no control over what people are going to do unless you anticipate what they might do that you don't like and stop them first. You can't go over and bonk someone on the head when they ship bad code. You can't even nudge them kindly or create an automated trigger. You don't have control. In fact, as an infrastructure person, the way I think of platforms is basically you're inviting their chaos to come sit on your servers for you to deal with. They have no chaos.

And as long as they're paying you, this can be fantastic. But not only do you have less control over the inputs, less visibility into what's happening, no ability to predict what's going to happen — this is like an N-squared problem, because they're all interacting with each other.

Raise your hand if you have no co-tenancy problems. Either you're asleep or you're lying. Either is fine.

Yeah. Oh God, I could tell so many war stories about this. Co-tenancy is probably at the heart of what makes platforms interesting and hard.

We were running MongoDB for over a million apps by the time I left. And when you're talking about storage, this is where — not separate the men from the boys, that's a terrible thing to say, I would never use that saying, but you get what I'm saying — this is where things get real.

This is an enterprise conference. I'm not used to enterprise conferences. I'm sweating about this.

This is where everything gets hard, though. Everything's all fine and good when you can tear anything down and spin another one back up without having to think about state. State is where things get interesting. And this is true over and over again — you see this in architectural patterns.

You can choose between: for perfect isolation, you would spin up a new replica set for every single user, every single application, every single app for every single user. A, expensive. B, that's a lot of coordination and orchestration problems. C, spin-up time is non-negligible. Basically, the trade-offs are going to be pretty bad upfront — possibly better down the road. This is kind of the Heroku model. They invested a lot of time and energy into the dynos. They were doing containers before containers were cool.

On the other side, you can do what we did, and you can just drop everyone into the same big replica set and be like, "This seems easy." And it is. It's super easy, until it isn't.

This is the right way to start, I think, usually — because most startups fail, and not because they didn't optimize enough upfront. Most startups fail for reasons of product-market fit, failing to attract users. Very rarely do you go, "Wow, I really wish that two years ago I had taken six months and really gotten this right." It was a great rough draft.

And most users who start using your platform aren't big users. It's that really sharp drop-off curve — where 20% of your users are easily 80% of your traffic. And you can't predict in advance who they're going to be. If you ask them, they will all tell you that they're going to be great successes. So you don't ask them. Don't even try.

You have less control over the inputs. You have far less visibility. You cannot trust your users. You can't trust anyone. You have the same amount of responsibility for their uptime as you would if they were your only customer. They don't care that you have 900,000 other customers that you have to keep happy as well as them. They care about their experience. They care about their users' experience. That is the only one that they care about.

Which leads to some really super fun observability problems.

Some particular lessons that we learned. So Parse was started back in the glory days of 2013, back when Ruby still seemed like a good idea. Like I said, usually the language you pick doesn't kill you as a company. In our case, it came really damn close — because it had the model of one process per request. And this works fine when you have basically a stack: web, app, database, and if you're proxying a request to the database, fine. Something gets slow, it's getting slow for everyone. Okay. But it's not fine when you have dozens of databases behind this one pool of Ruby application servers. Because now, say you have 36 database replica sets — well, times five nodes, that's 150 nodes. Something's going to be going down every week. And any time that application code is getting slow talking to anything on the back end — whether it's an HTTP request out, whether it's a request to the database, anything — it's going to start to consume all of the available workers. And before your monitoring can even catch it, the entire site goes down for everyone, all of your hundreds of thousands of customers, just because one person decided to run unit tests in production on your... Yeah, can't stop them from doing it. It's very fragile.

We undertook the most painful and yet most justified rewrite of my career, rewriting the whole goddamn thing from Ruby and Rails to Go. This was in 2013, 2014, and I wrote a long blog post about it. It's traumatic for me to read, but it might be fun for you to read.

This thing took us almost two years to rewrite, because we had to do it endpoint by endpoint. And I don't know if you know this, but Ruby has some things that are implicit — just a few. It guesses about things a lot. It's like, "Well, this seems like a zero. Well, this seems like an empty string. Well, this seems like either one, depending on something else that's in the blob." And when you're moving to Go, every single one of those things becomes explicit, and you find them all the hard way.

Don't use a language that isn't multi-threaded, is I guess what I'm saying. You'll save yourself a lot of pain.

Also, start investing in throttles early — very fine-grained throttles. Ones that will let you blacklist partially or entirely: obviously every application, every user and all of their applications, every application that's on a particular shard. This is somewhat customized to the scenario that we had, but you get the picture.

You can slice and dice distributed systems by so many edge cases and different ways, and every single one of them can probably take you down. And while you're figuring out what's wrong with it, you want to be able to just pretend it doesn't exist. You really want this.

You also want to think about velvet ropes for when you are doing that to a large section of your traffic and you want to start letting people back in, because I guarantee you that a lot of them have overwritten the perfectly nice, graceful retry logic that you gave them in their SDKs, and instead they will start pummeling you five times as fast.

So if you don't have a platform problem, why should you care about this? Well, I feel like it's interesting to think about platforms and tell these stories because you can extrapolate so many of the architectural lessons. And in fact, because of the nature that they're N to the Nth power — the fact that you have to care about exponentially more points of view — I feel like platforms tend to run into the interesting problems about two to ten years earlier than single-user stacks.

And there are a lot of reasons for this. I could talk about this for a very long time. It's really nice to have that mental model.

Now, it's no secret — I'll say this all the time — we are all distributed systems engineers now. You should all ask for a raise. It's true.

The model that we've had, the mental model that we have had for most of our careers, is failing us quicker and quicker for more and more people. The model of the LAMP stack, basically — where you're used to a world where you can take a pane of glass, take a look at your dashboards, intuit where the problem is coming from, jump in, and like magic, fix it. Now I love being a hero. It's my favorite thing. But there comes a point when you cannot fit all of the running components of the system in your head. You guys are enterprise folks, you should know this. There comes a point where nobody can fit it all into their brain, and you shouldn't try. You're doing yourself and everyone else a disservice if you try to keep it all in your head and reason about it there instead of putting it in a tool where you and your teammates can look at it together.

I feel like the correct model for thinking about systems in the near and distant future is not that of a LAMP stack, but that of the national electrical grid. Where you can't keep track of it all. You'd be dumb to try. You kind of have to unclench, release it off to the universe.

Because so many of the problems that you need to debug are hyper-local. Like, a tree fell over on Main Street in some small town in Iowa — I'm from near there, I can say that. You couldn't predict it. None of your unit tests were going to find that. Capture-replay isn't going to find that. And yet you need to swiftly isolate it, figure out why power's out in Iowa, and move on.

So those are hyper-local. And then there's ones where you have to zoom way out to understand what's going on. The example I often use is: if every bolt that was manufactured in 1982 is rusting twice as fast, and you're like, "What? What do these things have in common?" You could never have predicted it. There's no point in having a postmortem where you write a monitoring check for the next time one bolt is rusting twice as fast as every other bolt, so you can make a dashboard of all the different bolt-rusting rates. Come on. No.

But instead of that model, which has served us so well for 20-odd years, we need to be investing in tools and training ourselves to think about our systems with an exploratory approach, with an approach of curiosity, with an open-ended mind. You don't know if it's hyper-local, or the opposite of local, or right in the middle, or if it's some combination of five impossible things before midnight. Because these are the problems that we have in distributed systems. They're not the ones where you get paged and you're like, "Oh, that again." Treadle off, find my runbook, follow the instructions. We're automating those problems out of existence, I hope. And we have to, because they're increasingly replaced by every time you get paged: "Ugh, what on earth is that?"

Every time you get paged, you should have no idea what is wrong. But then what do you do? Well, you don't know. You haven't seen it before, therefore all the dashboards in the world are not going to do you any good. I really prefer to think of dashboards as technical debt. Every dashboard is just an artifact of some past failure. "What do we do? Every time the site goes down, postmortem. Let's make a dashboard so we can find this immediately the next time." Go and make a dashboard. Okay, fast-forward a year. How many dashboards do you have? Can you find the right one when you need it?

No, because the problem is not answering questions. We all have tools that can answer questions immediately. The problem is you don't know what the question is. If you knew what the question was, you could answer it. The problem is that you don't have a nice gift-wrapped bug report. You have some vague reports from unreliable narrators, and maybe an intuition that something might be kind of weird, and something that is always flaky but now just seems extra flaky. Or worse yet, has stopped being flaky and no one knows why. It's a different approach. You need different tooling.

And platforms are running into this before everyone else. And alright, origin story here. This is not a pitch for Honeycomb — but I mean, I'm the CEO, I kind of have to. Whatever.

So around the time that Parse got acquired by Facebook, I was coming to the horrified conclusion that we had built a system that was basically undebuggable by some of the best engineers in the world doing all the right things. It just couldn't be done. Someone would write in every day, "Parse is down." I'd be like, "Parse is not down. Behold my wall full of beautiful dashboards. They are all green. Go away." Arguing with your customers always works.

So they'd be complaining, and I'd be like, "Oh, this is Disney. Okay, I have to go figure it out." Now Disney might be doing four requests per second — mobile app traffic is not large. But I'm doing 100,000 requests per second. Never shows up in my time series aggregates. Literally isn't there. So I have to go, or I have to dispatch an engineer. Someone has to go track it down and figure out what's wrong. And this would take — because of the complexity of our system...

Okay, architecture diagram. So you can see the humble LAMP stack, very easy to reason about. You can see Parse a few years ago, very not. That cloud in the middle is like a few hundred MongoDB replica sets. There's a bunch of containers up there somewhere, and developers all over the world just writing and uploading queries that do a 5x full table scan because they can. And now I just have to make it work.

So I would spend hours, if not days, trying to track down why does this user think that we're down? Sometimes it's their fault, sometimes it's my fault. Let's be honest — usually it's their fault, but sometimes it's my fault. And I can't get too cocky about it, or I won't be doing my job.

And here's the other side that I often say: there's the national electrical grid — this is what should be in your head when you're thinking about the systems that you build.

So I tried everything out there in the market. And when I was leaving Facebook, the one thing that finally helped me get a handle on our problems was this aggressively hostile little tool at Facebook called Scuba. It hates you, it wants you to go away. It was hacked together overnight by someone ten years ago to help them get a handle on their MySQL problems. All it does is let you slice and dice in near real time on arbitrary dimensions and arbitrary high cardinality.

High cardinality means: say you have a collection of 100 million users, and the highest cardinality will be any unique dimension — like Social Security number or request ID. Still quite high, but lower would be last name, first name. Very low would be gender, and lowest of all, presumably, is species equals human. So this tool just let us break down by high-cardinality dimensions, and so we could break down by one in ten million users, and then every combination of everything else.

Now, if you work with platforms, you have probably tried to do this terrible thing, which is pre-generate all of your dashboards, but just for a particular user who's very important, or just for your top ten users. And everybody's just like, "Ow." I know, it's disgusting. But we have to do that because we can't just ask the question.

Anyway, so that didn't exist, so I decided to go build it. But I was really thinking it was just a platform problem. And over the course of the year of building it, I've realized that it's not a platform problem. This is a pure function of complexity. It's just the range of possible outcomes, possible answers that could be contributing to the symptoms that you're seeing.

And that's when I started to realize that platforms are exciting because they hit these problems first. It's not just one system that you're trying to understand. You have to wash your hands of the idea that you can understand the systems. They're not your systems. You're just here to observe them as best you can. So you have N systems, one for every user who's on your platform, plus the intersections of all of them, plus the giant system that they're all sitting on. How awesome is that? Super fun.

All right, I have two and a half minutes left. Let me get back to my other slides to see if there is anything useful left here at all.

Anytime you have to think about a single user, you have failed in some way. Having to think about any individual user is the kiss of death for a platform. Now, it's okay — you're going to have to do this every day. But every time, it's like SSH-ing into a server. It's an anti-pattern. Make a note of it. Remind yourself, "This is bad. I did something bad, or I wouldn't be here doing this."

Platform commandment two: keep your critical path as small and isolated as possible. You can't care about everything equally. At Parse, our API was of paramount importance, and we were very cautious about what we would let into that API service because it had to be up all the time. Push notifications — lower priority. Website — didn't care. So when the world goes down, you know what to prioritize. Everybody knows what matters. And you have to keep that set as small as possible, because that's the only way that you can actually commit to guaranteeing a high level of quality to those things if you're restrained by engineering hours, which I assume you all are.

Number three: it is the job of the platform to protect itself at all costs, including at the expense of any app. Above all, stay alive. This is important to remember when you are buying a platform, when you're consuming a platform. It is the platform's job to kill you to protect itself. It's fine. It's your job to care about your reliability and make sure that it's not the end of the world.

Expose useful stuff to your users — and by that, I mean useful to you. See the world through your user's eyes. And by this, I mean you must have high-cardinality tooling that will let you slice and dice and see exactly what... Is Disney — do they think that they're down because all of their requests are timing out at 61 seconds? Do they think that they're down because they uploaded a snippet of code that means they're returning a null response? Anytime that you can work to give the users some visibility into what's happening in your magic — it's a black box to them. They don't know, and you can't blame them. Once I started telling people at Parse that we were using MongoDB, here's what a query looks like, here's how not to do a full table scan — they got better, and they thanked me. And I was just like, "Well, I thought this was obvious." It's not obvious.

The most valuable thing in your arsenal: if you move from monitoring checks to explorable ad hoc tooling where you can ask any question and interrogate your systems, you can delete basically all of your monitoring checks except for request errors, latency, and some end-to-end checks that traverse the most critical code paths — the ones that make you money, the ones that your users care about and depend on.

Now, trick question: as soon as you put those in place, they're going to start flapping, and you're going to go, "Oh, this is a lousy check." No — you have a lousy infrastructure. You just don't know it yet. So don't disable them. Fix it. That is the pain that your users are living through. They'll be really happy with you if you fix it.

You must have — and this is kind of a pitch for Honeycomb, but also just: build better tools. We all need to be building tools that are native to the land of distributed systems, that let you explore, that don't have this ancient, big... All right, sorry. I'm done ranting.

Invest in every kind of throttle that you can — partial and total. And dogfood. Dogfood. Know what your users are feeling.

And lastly: all co-tenancy isolation guarantees are complete and utter trash. And this is just a restating of some of them. If you're buying, here are some questions that you can ask your platform vendor to drive them nuts, but also get a better sense for the level of quality that they will deliver and what you could expect when things go wrong.

All right. Thank you for having me.