Introducing Latency Squeeze Injection (LSI), the Next Generation of Resilience at Capital One

Log in to watch

Las Vegas 2024

Introducing Latency Squeeze Injection (LSI), the Next Generation of Resilience at Capital One

As an industry we’ve grown accustomed to the black-and-white definition of the system status – it’s either “Up” or “Down.” But what about the vast domain of insidious gray failures in-between that go undetected during “normal” operations? In this session, we share our experience in how we built ‘PRECog,’ a tool that enables engineering teams to proactively identify high-severity gray failures that had previously been overlooked before they resulted in degraded customer experiences. ‘PRECog” utilizes an advanced form of chaos experimentation we developed called Latency Squeeze Injection (LSI). We explore why we built PRECog at Capital One and how we use it to verify Service Level Objectives (SLOs) continuously, proactively identify service degradation points, and improve key aspects of system resilience such as retries, fallbacks, timeouts, circuit breakers, and more.In this session, Aaron Rinehart, Distinguished Engineer at Capital One, will share how they have used this new approach to confidently explore system safety boundaries to build more resilient and reliable distributed systems for their products and services.

Chapters

Full transcript

The complete talk, organized by section.

Aaron Rinehart

So let's go ahead and sort of dive into it. The methodology is called Squeeze Injection — that's why the title — and we're going to go through what that is, what that means. Naming things is hard, you know, I mean — okay. Next.

I actually stole these slides from James Wickett over here on the corner. I love O'Reilly's. I am an O'Reilly author. I've written several books on chaos engineering, security chaos engineering. So I love to kind of poke it — or, or really the memes — but feel free to be half-listened to the conference talks. I'll be around the rest of the day. If you really want to chat and dive deeper, talk about tooling engineering, get nerdy, I like to do that. So feel free, just nod along, take your email. I will keep this as entertaining as possible.

A little bit about me. I am currently at Capital One. I am a Senior Distinguished Engineer at Capital One. Basically I own the strategy for SRE, chaos engineering, resilience, reliability. I am the kind of person to build tooling and capabilities so things don't go bump in the night — or prepare you for that event. I am also, like I said, an O'Reilly author on the topic.

Oh, also, if you guys have interesting questions, I have three O'Reilly books to give away for good questions.

All right, so let's dig into it. So, fundamentally, the problem that we're trying to address is that our systems have evolved in complexity, but our human ability to mentally grasp what's going on — too many things, like inside of a distributed system, even in the three-tier era of building applications — it was complex. We never actually see lines of code execute. Our knowledge of the system and its complexity are very difficult for us to mentally even — with. So it's very difficult to understand what is the system and what is it doing right now.

So there are many definitions of resilience. There's actually a formal definition of resilience engineering from Erik Hollnagel. But I kind of like this one better — I saw it recently at QCon. Resilience is adapting to the unexpected. That's what it's about. It's about having things — the two pillars of resilience engineering are adaptive capacity and graceful extensibility. Those are the two terms. It's the having the ability to respond, having the capacity to respond — is what's important.

So what does all this have to do with chaos engineering? I'm getting to it, okay, I'm building. Who here is familiar with chaos engineering? First of all? Okay, everybody. Wow. It was interesting — because early on I give talks like, I don't know, six years ago on chaos engineering, I'd say "who understands chaos engineering?" — I get like one hand. But then I said, "who understands Chaos Monkey?" — like 80% of the audience. It's the same thing, you know? So it is interesting.

So chaos engineering essentially is the idea of proactively introducing the conditions by which you expect the system to operate, to verify that — under turbulent conditions or under adverse conditions in the system environment — that your system can actually respond. So, think about a failover, think about fallbacks, think about retry. Think these things we put in place for "when X occurs, Y will happen." These things that are put in place to keep us safe when unexpected things happen in the system. The problem is, is that that logic, that business logic we build for the failovers and other mechanisms, rarely gets exercised until the problem occurs. But the system has evolved and changed many times, and often it doesn't work as gracefully as you intended.

So chaos engineering — you can think about it, try to paint some new pictures on chaos engineering. Chaos engineering, if you think about it, is really, you know, proactively verifying that those conditions are still effective — proactively, instead of when the problem occurs.

So it's not about breaking things. I have to — this slide is at every chaos engineering, it should be in every chaos engineering conversation. We're not breaking stuff. It's not — you see articles on LinkedIn or people posting things about monkeys in data centers pulling cables. No, it is not about that. Our systems are fundamentally chaotic, right? They're, like I said before, they're so many pieces. They're moving, they're changing. It's very difficult to know what's going on. And it is chaos.

So, what we're trying to do is proactively verify — that this create order through understanding and instrumentation. All engineering, all of science revolves around two major pillars: instrumentation and measurement. You have to know what works, what doesn't, and know what to iterate and fix.

So we're trying to fix it on purpose, okay? What really you're trying to do — because, you know, when we move to software eating the world… we're not there yet, but we've moved significantly into that direction, right? Is that software never decreases in complexity, only increases in complexity. Let's say you have a complex system, right? And you want to make it simple. What do you have to do? You have to change something, right? Well, that change — every time there's an inherent relationship between making a change and adding more complexity. Somebody's got to read and understand and know and test that code, right? You're really moving the complexity around. So it's learning — you can't really reduce it. But what the important thing with chaos engineering, what makes it valuable, is you can navigate the complexity through creating understanding and order.

So there are challenges. So chaos engineering is a tremendously valuable exercise. I don't know how many people here are doing it now. Interesting — ah, that's, wow. That's pretty good. Guess talk to me afterwards. Let me know — are there are challenges with doing it in the traditional approach? It's still valuable. Don't hear me wrong. This is still valuable, right?

So at Capital One, they've been doing chaos engineering for eight years, I think eight years. Topo [Pal] was there before. They fell over the entire company — all 2,500 applications four times a year. East to west, west to east. You know, it's a major — actually. But they find things, they find significant things — things were talking across region that shouldn't have been. You do find things, but there are some issues with traditional approach to it.

So I'd say limitations. One of the limitations is the approach is not continuous. We love to think — and I think this is partly a human thing — is we love to think of failure in terms of a hundred percent failure. Because that's what we see. We see that thing break. It broke, right? What we don't think about — and no one thinks about — partial failure, right? Because it didn't fail yet. But it's there. They call that gray failure. We're going to talk about that.

So it's not continuous. Yes, doing it via the CI/CD pipeline, post-release — that is kind of continuous, but not exactly right. I'm a big believer that chaos engineering should move beyond the pipeline, because lots of things change inside of your environments outside of the pipeline. Think things like network experimentation, be an example.

So predominantly it's two: when you introduce — let's say you force a failover, right? And let's say it didn't go right — that that's a lot of toil and friction for an engineering team under delivery pressure, right? To have to stop everything. What went wrong? Try to fix it, right? Yes, they found something. But is that value sustained with the loss of productivity and delivery? It's hard to quantify, it really is. So there are ways to do this. There's a balance of doing the traditional way and what I'm about to show you.

So we typically think of the world in terms of "system is up, the system is down." This is a Gaussian curve — a bell curve, right? Well, we actually operate in a vast area of gray, partial failure modes. The problem with it is there's very few mechanisms for gray failure. One you've probably heard of — there's SLOs, right? SLOs — it's not a way for instrumenting it. It says, "I have calibrated my application, my service, to be calibrated to handle 200 milliseconds of latent degradation," right? It's like you're drawing a line in the sand, right? If you cross it, it could be an indicator that something bad is happening, right? Your errors are rising — something is happening. It's a way to divert your attention and go look and see what's happening.

What about the latency squeeze injection? What I'm about ready to show you is a way to instrument and test for gray failure.

So, gray failures — I think Microsoft was early days and sort of identified and talked about gray failure. There's a paper in 2007 written by some fellows at Microsoft talking about this problem of gray failure. It's becoming more and more important. It's actually a really interesting problem space in terms of system operations.

But in terms of — like, I like to describe it in terms of product. Like at Capital One, let's say for example, I log in to my app and I can't check my balance. I'm probably just going to log in later and see if it works. But if every time I get into the application, there's inconsistent experience, right? Or I'm getting latent behavior. I'm not sure if it's my phone, the wifi, the app — well, what's going on here? I'm more likely to take my money elsewhere, right? So the idea is that these gray failures are often the ones that impact customer the most. This is why they're very important. I'm going to get a handle on this.

So what is latency squeeze injection? So it's originally inspired by — I came up with this idea with Casey Rosenthal, the guy who created chaos engineering at Netflix. The idea was sort of based on Netflix's Chaos Automation Platform (ChAP). Netflix runs hundreds of thousands of experiments a day using this platform. The idea was to give engineers feedback on — implicit SLOs. You have your calculated SLO, but there is an implicit SLO in the system. There's the amount you can tolerate right now of degradation, right? So this methodology is actually a way to test and verify SLOs — whether you're out in Jupiter or close to the moon in terms of where you're at.

The reason why I was able to do this at Capital One — Capital One's very homogeneous, very much AWS — and the ability to inject latency exists with AWS's Fault Injection Service.

So the kinds of things that engineers really gravitate towards this tooling, because they don't have feedback on retries. People kind of guess — retries and your timeouts and, you know, things like connection pooling, even some resourcing — you're kind of guessing. Best guess, right? But this is a way for them to usually find out that the logic wasn't tuned correctly during an outage. 'Cause there's no other way to instrument and find that out, right? But engineers are heavily using it for that reason. But you can use this methodology for more than just that.

So one of the things I really like about this methodology — I'm gonna get into actual how we do it — is it allows you to identify and map a system safety boundary. The point where your system actually starts to degrade. Like whether you're on solid ground, approaching the edge of a cliff, right? The point when you leave solid ground and you're approaching the edge — I call that the safety boundary. You have not failed. The high-severity incident has not happened yet, but you're heading towards it. But engineers — you have no intuition to know where that is until it happens. And then you're scrambling trying to figure it out, fix it. You're on a war room, on a bridge, right? That's not the best time to start figuring this stuff out.

Okay. So here's an example. Follow me here. So with latency squeeze injections — what we're trying to do is instrument for gray failure. So who here is familiar with sonar — the Navy uses sonar. It's an audible "ping boom." And then it comes back, the waves come back, and they can see how far things are away, and they can map like the ocean's floor and things like that. Well, we came to this idea that you could use latency similar to how sonar works.

Latency is an interesting phenomenon, actually. We typically think of it in terms of the network, but it's not — latency's the host, it's the network, it's — if you're on Amazon, it's Amazon. Everyone's contributing to this factor. It's TCP/IP. I mean, it's fundamentally the internet. It is so unpredictable, 'cause it's — all those factors come into play, right? So you can monitor for it kind of, but it's pseudo — there's pattern in the chaos, but it's somewhat random, right?

So the idea is that we introduce controlled waves of latency. So let's say I have an SLO of 200 milliseconds for availability latency, okay? This is an example. What I will do — I will proactively introduce 120 milliseconds into the service. I will look for HTTP 500 server errors to see if it starts to spike. 120 — nothing happens. Break it down. I go to 140 — same thing, nothing happened. I get to 160 — all of a sudden it's a brief spike. I have found the point where you're starting to head towards the edge.

The pattern is in the data. I'm still working on the data analysis to try to figure out what to do with it. But there's always this weird blip-flat-drop. And I call that the safety [boundary]. I'm not a statistician, right? I wish I was, 'cause the data is really interesting. But just to see that pattern every time is an interesting phenomenon.

The idea is that when I hit that 160, I see that spike — we immediately break down the experiment, notify the team. "Hey, under the following conditions, customers could be experiencing a bad time. You believe your service is calibrated to 200 milliseconds. You're heading over the cliff quicker than that, right? You need to" — it's a way to proactively give a heads up to know that something bad may be on the rise. It could be your downstream/upstream dependencies could be causing — it could be a problem, right? There could be a problem with your retry logic, your fallbacks, your circuit breakers not kicking in, those kind of things.

So there's a couple different patterns. I'm going to go to the methodology now. How you do this. The goal with this talk was to talk to you about a new way of doing chaos engineering — an instrument of gray failure — but also give you enough to kind of know what to build yourself. We're working on trying to figure out how to open source something like this so folks can take it and build it themselves. But it's not really terribly difficult to build either.

The idea — so we have several different variables that we tweak. There is the length of latency, how long of a duration. As part of building this, you know, my intuition for latency was not what it actually is. The pattern for latency is much different than you think. It's not a quick spike. Everybody thinks it's like this big spike and then I'm having trouble — but that's not how it really works.

But we do these stepped waves, step up. You can do as many waves as you want, but we step it up and inject increasing levels. The idea is to slowly — instead of shooting your AMI in Amazon in the head, right, killing the machine — you're slowly making it uncomfortable to see if things start kicking in. Is your circuit — are you starting to horizontally scale? Are your auto-scale [groups]? Are your circuit breakers kicking in? Are the connection pools filling up? Right? Like, you can kind of see it unfold by putting it under conditions and making it uncomfortable.

But doing chaos this way allows you to do this kind of stuff more in prod because you're not killing it. You're not causing it to go down. There's absolutely no reason why that should fail. It also allows you to go from doing hundreds to millions of chaos experiments.

So we came up with these different patterns. So the short bursts — it's like, different bursts of latency trigger different kinds of safety mechanisms. So the longer burst is better for circuit breakers. If you have circuit breakers in your environment, try to trigger our connection point — that's 'cause short burst won't really trigger that as well. Short bursts are better for the retries, stuff like that.

And then we came up with this 24-hour test. We had this issue that kind of the fundamental precursor we needed to get started with applications across Capital One was — we needed kind of a Hello, kind of, you know — and we're in the middle of our journey in rolling that out across Capital One. And it's — if I don't have an SLO, I don't know where to start. I don't know where to start the injections. So that's what we came up with: the Le Mans test, named after the famous 24-hour car race that Porsche always seems to win. We do it for 24, 48 hours of continuous waves to try to profile the system, to understand kind of where is your latency. The idea is that you find — you see your spike in the 5XX errors, right? I saw it between wave three and four. And I go back and I recalibrate and make changes and I start again. That's kind of what this slide is about.

So we do this preliminary assessment. If they have an SLO, it's much easier. If they don't, we have to do like the Le Mans test to figure it out. And then the first time, any chaos experiment — whether it be this methodology or the traditional one — usually doesn't work the way you think it does. Your system needs recalibration. So you run the test, you recalibrate, and then you continue that process until you're good.

And recently we built sort of a Jenkins capability — and we still use this Jenkins — uh, yeah, don't, uh, yeah. So use Jenkins. Anyway. But we — after every release in QA, we run this — the last known good test, where it came out good. We run that in QA after every release. That's a new thing we're experimenting with.

So the first pilot application we did this with was probably the most critical application for Capital One, which was our login, right? They very high — it's, I think it's like 10,000 TPS, something like that. But it's critical — if you can't log in, you can't do a whole lot with our applications, right? They were experiencing all kinds of failures. They're very concerned about, you know, floods of people logging in during shopping seasons, things like that. You know, it's very business-critical.

So when we started doing some experimentation with them, originally, we started finding out — actually this is a common finding. Engineers don't really conceptualize circuit breakers correctly. They think about it more like a network, or — I don't know — people think about it like a network thing and it's not right. And they put it in the wrong place. It's the first thing we found out — circuit breaker never kicked in 'cause it was in the wrong place. You can't just randomly put them. They have to be strategically implemented, right?

We learned that we were able to improve our health checks. We found out that — obviously I've already talked about the retries — and were not calibrated correctly. So every time you make a change, you have to go back and keep running this. But the idea is that, you know, after you're done kind of running these — sort of calibrating with this tooling — you are able to have a much more healthy response and series of mechanisms to keep you up and running.

So at Capital One — it's funny that like a lot of incidents, high severity incidents come through latency, right? So it's been interesting to continue to build this. But for me — I know when I've built something valuable when engineers come to me wanting to use it, right? Engineers really, really use this predominantly for the retries, the fallbacks, the failovers — those kind of mechanisms — 'cause they don't have any other way of getting feedback. Connection pooling is also something that recently has been interesting to use this for.

So anyway — we're on track to go from hundreds to millions of chaos experiments with this methodology. We still do the traditional approach of chaos engineering — your Chaos Monkey types of things. You are injecting 5XX errors to force the failover to the other region. We learn lots of things.

A couple other interesting aspects to this — is that this may be a valid way for testing third parties, because all I'm introducing is degraded internet. I'm not violating an SLA — the degraded internet's there, it's happening randomly anyway. It's a great way to sort of test for third-party SaaS agents running your environment — that your third-party dependencies, just to make sure that when latency comes from them, you're calibrated correctly to respond. And you have the adaptive capacity to do that. That's an area I plan on exploring more with this.

But yeah — in short, that is all she wrote.

This stuff's hard, guys. You know, there is no — there's not a lot of tooling in resilience. You know, there's a lot of practices. And, you know, my passion is to bring tooling and capabilities to help humans. These things — our systems are large, they're complex. It's not getting any better with generative AI. It's getting even larger. We kind of skipped over this complexity issue. I think generative AI obviously has a role somewhere in here to help.

One of the issues I'm struggling with too is like — I don't know if John Willis is here. John was at my talking in Europe a couple months ago. Is that, Aaron, there's value in this data you're generating. I just don't know how to abstract the insights, because right now, currently, it's a bunch of 5XX and 2XX charts for regions and AZs. It takes an SRE or a full-stack kind of senior engineer to read it. You know, the idea is that people take that data and give them feedback on those mechanisms we talked about — insights, you know, which can be tough. So.