Chaos and Reliability: A Surprising Friendship in the Enterprise
Chaos Engineering is often characterized as “breaking things in production,” which lends it an air of something only feasible for technologically elite or sophisticated organizations. In practice, it’s been a key element in digital transformation from the ground up for a number of companies ranging from pre-streaming Netflix to those in highly regulated industries like healthcare, telecommunications, and financial services.
Many enterprises are grappling with application modernization at an ever-increasing scale, and leveraging chaos-informed experimentation as a facet of their SRE practices can help them get their arms around the complexity of their systems. Understanding the complexity of distributed systems is foundational but critical to true observability. These practices can inevitably lead to clarity in metrics like SLOs, grounded in reality instead of guesswork.
In this talk, Troy Koss (Director of SRE at Capital One) joins Courtney Nash (researcher at Verica) to explore some of the myths of Chaos Engineering, and how he’s put it into practice at multiple enterprise companies to foster a culture focused on reliability. Join them to learn how not chaotic it can be to adopt chaos engineering and how effective it can be at accelerating your SRE journey. You might be surprised to find out how close you already are to getting started...
Chapters
Full transcript
The complete talk, organized by section.
Courtney Nash
Hi. I'm so excited that everyone from DevOps Enterprise Summit Virtual is joining us today. I'm Courtney Nash. I am a researcher at a company called Verica, and today I'm here joined by Troy, who will introduce himself in a second, to talk about chaos engineering and reliability in the enterprise, two unexpected friends.
And I'll let Troy introduce himself.
Troy Koss
Yeah. Excited to be here. I'm Troy. I'm currently at Capital One, leading one of our SRE organizations here. And we are excited to share with you the experiences that I've personally been through and some of the myths and whatnot that are part of chaos engineering and hopefully get you comfortable with embarking on your own.
Courtney Nash
Cool. So this is when I do the share screen part. Everybody gets to enjoy the awkward presentation stuff.
Okay. That's us. We just talked about ourselves, so we don't have to hang out on that screen for very long. So a couple of myths before we get started because the very name is a bit much. Chaos engineering does not sound like something that most people want to get going with, much less in the enterprise. But really, if you think of it in terms of experimentation, then it becomes a much more approachable thing to be considering. And it's really a practice of experimentation that helps you get your arms around your systems.
Most of us, I would assume, are here because in part, we're building and maintaining and operating very complex systems with high business and production pressures, and no one person can get their arms around how that all works within an organization that is decomposing monolith to microservices and has upstream providers and cloud providers and all kinds of things. It's just too much complexity to be able to hold it all in your head.
And so this process of experimentation tells you ideally what the boundaries are like. Where's the cliff? Where are the cliffs? Multiple cliffs. Are you driving at them 90 miles per hour? Are you slowly wandering towards them? It's really hard to know a lot of the time. And so the goal of this is to just get you comfortable with experimenting within those systems in safe ways.
So myth-wise, that name gives us a couple of things that we wanted to cover. And along the way, Troy and I discovered that we're both giant plant nerds and like flowers and stuff like that, so you get plant metaphors and maybe some goats.
So the first myth of chaos engineering is that it's some kind of mythical advanced capability. Netflix, Amazon, big organizations, you've got to be that scale to be able to do this. And it turns out it's actually the other way around, or that's how they got there. So chaos engineering, the discipline, was born out of Netflix's transformation from the data center to the cloud. And I don't know if people remember this, but there was a period of time when Netflix's availability reputation was not good, and things were falling down a lot, and people were pretty mad they couldn't watch their movies and their other things. And so folks at Netflix started hunting around like, "We've got to do something. How are we going to be able to do this?" And that process of figuring out how to experiment safely on their systems and then expand the size and the scale and the sophistication of those experiments is why Netflix is Netflix now.
And so it's often viewed in the wrong cycle, and so they really started there and got to reliability that they have now by experimenting on their systems. And other companies have started to do this as a part of digital transformation journeys, which is why we're here talking to you all today, including those even in highly regulated industries like healthcare and finance and banking. And sometimes when I say that to people, they don't believe me, which is why Troy is here, because he has, which is super cool. So he's going to talk a little bit about his experience with this first myth, and I'll let him take it away.
Troy Koss
Yeah. Well, you definitely don't have to be a master gardener. And the one industry, too, as well, as you didn't mention, was telecom. I was at Verizon in telecommunications industry. And we were growing rapidly and still are growing as a company and modernizing our applications, the monolith, the microservice architectures, the data center to cloud journeys.
And one of the things to deal with that complexity was grounding ourselves in a site reliability engineering program. And that was something I was fortunate enough to be a part of kicking off and getting started at Verizon. And we really used it as a practice for us to try and change the way we think and ensure we have the reliability that we were known for. As Verizon ranked the most reliable network, it needed the most reliable culture and most reliable practices to embrace.
And it's really that shift of moving from a reactive state where we're on and everyone's happy to we're off and there's an incident and how many incidents and how fast we get up, and then it goes back to, okay, we're on and working. But really how well are we working? And that's that proactive shift in understanding our systems and getting ahead of that and measuring in a proactive nature, how well are things going?
And as we started embarking on that, if you could kick to the next slide real quick, we noticed that even that was a hard journey, like adopting SLOs and getting going, and we'll talk a little bit about that a little bit later. But chaos engineering became a quick and easy, dare I say, easy way for us to embark on this real system dependency understanding and comprehension.
You probably find out that you don't know all of the edge cases and how your system works and whatnot. So what better way to do that than to run verifications to see how things behave?
When we were in the Kubernetes space and moving to containers and Kubernetes, the thing that we focused on was how can we build a reliable core cluster configuration for teams that meets our standards and needs, and are we actually doing what we think we're doing, and verifying, running tests and verifications and hypotheses on that consistently.
We find that, in some cases, your pull time for your images, you'd expect a small image versus a large image would pull faster than one another. And then when you run verifications to test that, you find that is not the case, and you see that pull times are sporadic and different, and then you find out that there's some network configurations that are happening, and you're going from different VPCs and moving over. So it's such a good learning experience, and it's pretty rich in that nature.
There's also a space to tap into for understanding if you're secure and you're safe and reliable as a part of reliability engineering, right? And looking at images that you're deploying into your cluster and seeing if they're vulnerable, if the vulnerability detection that you have in place is actually working. Deploying known, controlled, vulnerable images into your clusters and seeing that you have the knobs turned right and the thresholds set. And oftentimes we found that we didn't, and that's okay, and that's what we learned, and we got ahead of it, again, in that whole shift of proactive versus reactive nature.
Courtney Nash
And I want to ask you a question, which is, you sort of mentioned you're running Kubernetes at Verizon. What would you say, either individually, team, or organization-wise, the sort of maturity with that particular technology at the time?
Troy Koss
Yeah, I'd say it was probably on the more introductory, novice, intermediate level. Getting to that advanced level in containers and Kubernetes in particular, and orchestrating that, comes with a lot of time and experience and really working on those systems. And it's a skill set that's highly sought after and people are evolving and growing into.
But we were definitely early on, and I think it's understanding how things work and what happens when a node goes down. Does my application scale the way it's supposed to? And when we move to containers, did we cut all the things that are a staple out? And things like that, that you discover in moving. But yeah, I would say to answer your question directly, it was pretty early on in the journey. And I think a lot of places are. Everyone I've seen is pretty early on.
Courtney Nash
Well, it turns out I've seen the same thing. Or at least...
Troy Koss
That's good.
Courtney Nash
... the smaller side. I ran a small survey a few months back of about 50 organizations that we'd had some contact with potentially, or had gone out to, reached out to, that were using either Kubernetes or Kafka and trying to understand, again, sort of the maturity of which people are dealing with this.
And I was really surprised to find, first of all, that one of the biggest chunks of organizations we talked to were really big 10,000-person, basically enterprise types of organizations across a... We had a pretty good range of roles, but you see folks that you'd expect to see in there.
But the thing that really surprised me was the maturity of experience with these kinds of really complex technologies that people are using in full-scale production systems at 10,000-person companies. I was like, "Wow." So, some people are probably like, "Ah, terrifying." But that's the state of the industry right now. We are trying to grapple with the complexity, and we're using tools that both help us do that and that add to it at the same time.
And so I think that hopefully I don't completely destroy this point to the people are sick and tired of it, but you don't have to be a master gardener. Most people who are looking at doing this are really starting at it pretty early on, which I would argue is the better way to go.
And speaking of going, or goats, that's what we get next. That's myth number two, which is that chaos engineering, as I just said, we have pretty complicated, sometimes chaotic systems. Why add more? That seems like a terrible idea. Don't do that. And so this point is really about how chaos engineering isn't about adding chaos, it's just seeing the chaos that's in your systems. It's letting the goats run wild, but in a pen. You can see them, they're in a pen.
So, I'm done with this metaphor. I'm sorry. You all are probably really sick of it. So I will turn it over to Troy to talk about his experience with chaos in systems he worked on.
Troy Koss
Yeah. Definitely. And that notion of we're going to introduce more chaos, we need less chaos. Why do we need more chaos? And it's really like, look at the definition of what we're trying to do. It's like you're preventing the chaos. You're trying to get ahead of chaos. You're trying to get ahead of the unknown.
And some methods and means as a part of an SRE journey that, again, many teams are on in our digital transformations, is adoption of SLOs. Not to be confused with SLAs, but SLOs. And having those as a consistent way to measure our systems and our services to be able to know the bounds of what we can experiment with or what we can't, and are we meeting the customer expectations?
And we actually didn't even have formal SLOs at the time when I was at Verizon and adopting chaos engineering. And in fact, we used chaos engineering to help us look at SLOs and understand SLOs more closely in running verifications to find out where should things be set to in terms of their SLO. Should it be 200 milliseconds? 250? And that's what everyone tends to, unfortunately, gravitate towards, nice even numbers. But maybe it's 186 or 187.
That's one thing that we discovered in doing that, is what are the appropriate ones? Finding SLOs that are set, and we were chatting about, grounded in reality, of what does the SLO need to be, rather than just guesses and a SWAG, to say the least.
At Capital One here, we're developing a lot of tooling to have SLOs and put SLOs in place. Agnostic tooling that can handle the ever-changing tool dilemma of what tool are we using today, what flavor of the week for APM, but building tooling so we have SLOs in place so we can start embarking on a lot of these things, like in running chaos verifications and experiments to understand our systems and learn from them. But having consistent measures in place to ensure that myth that you just spoke about, about introducing more chaos, doesn't happen. In fact, that we're within our bounds, we're within a safety net, a responsible place to be. And it's pretty good to start adopting SLOs as a measure to help with that.
Courtney Nash
Yeah, and I think the phrase you used when we first talked about this was it allows you to experiment safely.
Troy Koss
Yes.
Courtney Nash
Which I really like.
Troy Koss
Yeah.
Courtney Nash
And so to that end, to begin experimenting, we'll stop with the myths and we'll start with what do you really need in place? Because I get asked this a lot, and a lot of the things you've explained to me in your journey, Troy, I feel like really hit these particular points.
So the first one is instrumentation to be able to detect some sort of either degradation or lack thereof in your system. And I think a lot of times this hints to the first myth, which is that people think they need to have really sophisticated sort of observability systems or whatnot. Use what you've got, and even to that point, Troy just said that at Verizon they didn't have SLOs yet. And so you don't have to be at that level, which is maybe a lot of organizations aren't necessarily there yet at having those set. But so just use what you've got. Whatever kind of logging, tracing, what have you, use that and you'll refine that as you go, too. As Troy said, you'll find out what's working there and what's not working there as well.
So the next few prerequisites, we're going to do a little more of this back and forth again. And the second one, we get to have a little bit of a chat and more goats, because really who doesn't love goats? So beyond instrumentation, you need social awareness, which this one particular goat definitely lacks.
It's really important to be explicit with everyone who might be involved in terms of what you're doing, to what end, the expectations and the outcomes. Chaos engineering sounds scary. If you're already bought in on this, let's say, you might not be, Troy wasn't, then you're going to run into some resistance like this. In a lot of changes, but this one sounds particularly nerve-wracking.
And so not telling people is sometimes tempting, right? Like, "I'll just go run some experiments and then it'll be great, and then I'll show people the results." Except it might not be great because you don't know. That's the point.
Troy Koss
Right.
Courtney Nash
So, you really do have to sort of build the beginnings of people willing to go on this journey with you. And it's really easy to talk about that in the abstract and for me to be like, "Yay, do this." But people are often like, "No way. I don't get it. How?" And so I really want to hear from you, Troy, about, I know you had your own personal sort of trepidation about this, but then organizationally, how did you all actually take that first step? What did that look like for you?
Troy Koss
Yeah, definitely. And I echo the sentiment that you're giving off, which is you don't want to be that goat that you're showing on the screen there. You don't want to be that bad one that's just nuking things and talking about how great it is. But yeah, there's a few things to keep in mind.
One is you don't have to start in production. Understand that that's an evolution you get to. You don't start there and start just doing things there, running your verifications there and your hypothesis there. You want to keep your scope small. You want to be able to keep it in a limited fashion. As I mentioned, we focused on the Kubernetes platform itself and the underlying infrastructure and how that orchestration of the clusters, and that was a small enough scope place where we had dependent parties involved, but it was smaller, and we were able to articulate the blast radius and contain the blast radius as well.
Run simulations on testing some of your hypotheses out with fake data and other fake systems just to prove it out and to understand it. And one thing that I do want to hit on though is while I say keep things small and contained, you definitely want to hit things that are effective, and Courtney's going to talk about that in a second here, but make sure that the work that you are doing is definitely something that's meaningful and you're working on systems that definitely matter.
I think to get that buy-in and to get that value is the low-hanging fruit are fun and easy to get, and those are nice, but make sure that you definitely address some things that matter most to the enterprise, things that have that value tied to them.
Courtney Nash
Yeah. That's a perfect segue. Thanks, sir.
Troy Koss
No problem.
Courtney Nash
So I wanted to take a minute to talk about hypotheses because it's easy to throw these words around. We've done this. We talk about verifications or hypotheses or all these words. But there's some science to this, and the notion of experimentation is grounded in that longstanding tradition. And so in my opinion, the really key part is what that hypothesis is, right? So you have a control state, you have something that you're going to, some perturbation you're going to introduce, and you have a hypothesis about what's going to happen.
If your hypothesis is that broken things are going to break, then that doesn't really help you. And also, the point of this is to understand your systems better. If you already understand that about your system, then you just spent a bunch of people's time confirming something you already knew. I understand sometimes you want to do that so you can get buy-in or budget or whatever to fix it. I totally can relate to that. You'll get there, but I think as Troy was saying, you'll get there by showing people things they didn't know about how their systems work.
And so those hypotheses should be things that uphold expectations, because then when those aren't right, then it's like the light bulb goes off for people, right? So if you do have SLOs, you might have a hypothesis or statement along the lines of this service will meet XYZ SLO even under conditions of high latency in the data layer. Whatever. And that should be contextual to your business, to your customers. That should make sense, right?
And then if it does, great, and if it doesn't, then you're now really... And that's why if you have SLOs, it's a great space to play because SLOs are directly about those kinds of business-critical outcomes, right? And so I think that's a really nice alignment is if you're playing around in SLO space, then you're really doing something, like Troy said, that's actually meaningful to the business. So have really well-formed hypotheses that are meaningful and contextually relevant.
So, I want to go back to the don't break, fix broken things bit, and finding things you didn't know about your system. So, Troy has a good story about that, so I'm going to pass it back over to him now.
Troy Koss
Yeah, definitely. And just to reiterate one more time, like for the third time's the charm, is you don't need the SLOs to get started. In fact, like I said, chaos engineering can help you get there. But they're definitely a good enabler, to Courtney's point.
And another point that you made earlier, Courtney, was you really don't need a lot. You have metrics, you have logs, you have alerts in place. Teams are trying to adopt some sense of observability for their systems, respectively. But sometimes when you run the hypothesis, like I mentioned earlier, about a vulnerable image and you're like, "Well, no, we know we stop vulnerable images." And then you put a vulnerable image out and it actually deploys and you're like, "Actually, we didn't." So you find these things and there's true value, especially in the security domain, too.
But one thing that also can come as a byproduct of it is you run your hypothesis. You think that you have the necessary alerting and safeguards in place and instrumentation that you've always had. Like, we have our alert policy and it will go off when bad things happen to our system. And when you start running chaos experiments and verifications, you learn soon that sometimes your alerts weren't set correctly and that they don't go off like you think that they go off when things happen as you're running these verifications. And it's a good thing.
It's a good thing to find out that those things are out of place in the controlled environment where you're running these experiments, rather than when you actually have a production outage and you don't know you have a production outage and the alerts don't go off, and then their MTTD and your MTTR become chaotic, and then everyone's scrambling to get a resolution.
Finding them out in these controlled environments is a super great place to be. You get two takeaways. A, you learn about how your system actually responds during that verification, whether it's, like I said, you mentioned injecting latency into your requests, whether it's taking down nodes and seeing how things respond and how long time it takes for applications to redeploy, et cetera. But you find that, and then you also found out that your alerts weren't good. So your observability as a byproduct becomes enhanced and enriched.
So it kind of all ties in to that whole culture of reevaluating and constantly being able to assess your system. And it's all, as I mentioned at the very beginning, is that shift from reactive to proactive and being able to get ahead of when that thing happens, when that event happens that we are so fearful of. But yeah. I echo your remarks.
Courtney Nash
So this is that reactive/proactive dynamic that you're talking about, that shift. The last prerequisite for chaos engineering is being able to... Ideally, you're proactive, but at some point you've got to react to the experiments and the results of those. And this is the flip side of the coin of buy-in. The more work you put in up front on the buy-in front, the more likely you're going to have alignment to actually respond.
And this may sound obvious, but this can be where chaos engineering efforts die on the vine, because teams are busy. We have a lot of work to do. The thing you did might actually have impacted some other team or some downstream thing, and now you've got to kind of get those folks on board. So I feel like it's a really almost obvious but critical prerequisite. It's a big cultural change, which most of you all here should be pretty familiar with trying to make happen.
I think the thing that's really great about this one is, like Troy said, if you can start small, limit the scope and the blast radius and everything, you get a good virtuous cycle going, right? Where people see the benefit of the experimentation. They put the changes in. Ultimately, you move on to bigger and thornier things, is really how that works.
But at that point, hopefully, you've basically put that cultural infrastructure in place where everyone is actually excited about this stuff instead of terrified by it, and sees the benefit of that experimentation. So I like to refer to this as your, I said, cultural infrastructure. We like to talk about other kinds of infrastructure a lot. But this one is also incredibly important. And so nurture that and don't forget that other people are going to have to get involved in the implementation side of it and be prepared to sort of help them do that.
And that's our prerequisites and our myths, and so I will hand it over to Troy to close things out with some final thoughts on chaos engineering and enterprise SRE.
Troy Koss
Yeah, definitely. And the whole cultural piece that you just hit on is, again, that is what SRE is. It's, yes, there's practices and there's toil and terms and buzzwords and all the good things that come with it. But it's a culture, it's an approach. It's like, how do we address these problems in a consistent way?
And all of the things we just discussed about chaos engineering and the kinds of experiments you can run and all the different sorts of things, it's really just part of it. It's really part of SRE in my mind, at least how I've defined it, in that you have to understand your systems. You have observability, resilient architecture, and all these patterns and things that we're following as our critical intents as part of our SRE program here at Capital One, and chaos engineering is a part of that. It's one of those intents.
There are teams that are going to be of different levels of maturity as you embark on your SRE journey, and you can think of it as like a menu. I always joke, probably because I like food, but there's a large menu of items that you can dive into on your SRE journey and pick what actually works for the teams, and chaos engineering should be one of those because there's different levels of maturity. There isn't a particular way to do it.
But if you set out the items on the menu, start early, and don't wait to understand your systems. You ultimately want to end up providing a better experience for your customers and a more reliable experience, which is most important.
Courtney Nash
Cool. Thank you so much for joining me today, Troy. And for everyone for joining us and for putting up with goat jokes and we hope you...
Troy Koss
And plants.
Courtney Nash
And plants. Not jokes, just plants are great. Goats are jokes. And that's it for us, and we hope you enjoy the rest of the conference. Thanks.
Troy Koss
Thank you.