How Many Nines Are Enough?

Log in to watch

London 2020

How Many Nines Are Enough?

In this talk, Gremlin CEO Kolton Andrus shares insights from years at Amazon, Netflix, and now working with a wide array of customers across various disciplines and industries.

He'll describe what each level of availability looks like, the challenges faced at each stage, and the trade offs required to achieve the next nine of uptime.

Chapters

Full transcript

The complete talk, organized by section.

Kolton Andrus

Hello, my name is Kolton Andrus. I'm honored to be able to speak at the DevOps Enterprise Summit, London. Today, I'm going to talk to you about how many nines of availability is the right number for your team, your organization. I'd like to begin by giving public examples of why this is important. As we've seen the shift to COVID and online systems being paramount over the last few months, we've seen an increase in load and stress on these systems, and as a result, we've seen more outages related to them. Whether it's Zoom, our ability to game online, or our ability to make online trades, our opportunity to interact with these businesses and our systems online is paramount, now more so than ever.

Why are we seeing more of this failure? Well, the answer is a lot of the design decisions we've made over the last 10 and 15 years come at a cost. We've prioritized decoupling our systems, we've prioritized speed of innovation and being able to enable teams to move quickly. But with this has come a cost in complexity. No longer is the day that an architect can hold the entire system in their head. Now there's many moving pieces, and they're changing often, if not daily, hourly.

What this results in is what I lovingly refer to as the microservice Death Star. These are examples from amazon.com in 2009 and Netflix in 2012. And just looking at this picture illustrates the point. It's hard to comprehend it. It's hard to make sense of it. There's an inherent complexity in our systems now. This is the chaos that we have to deal with, the chaos that we're here to tame.

The old world approach of testing is no longer sufficient. We used to focus primarily on our code, unit testing and integration testing, what we'd written. But in modern distributed systems, the dependencies that we take and their healthiness has a big impact on whether our system behaves correctly. The configuration needed to run a system within production is also paramount. The timeouts, the thread pools, the security groups, the auto-scaling. Our infrastructure is more ephemeral, can come and go at any point. And our people and our processes are critical to being able to operate our systems well. Many engineers are now finding themselves as SREs or on call and may not have had the experience and the opportunity to practice operating these systems at scale.

What we're seeing is a trade-off of this ability to move quickly that comes at the cost of reliability. And what we would really want is to be able to move quickly and maintain that reliability. To do that, we must shift the curve. We can't simply take the same approach that we have before. It requires a new way of thinking about things, a new approach.

From my time as an engineer on call at Amazon and Netflix, and my time building and operating systems such as these, the best answer that I've found is chaos engineering.

Now, chaos engineering means different things to different people. Some people believe it means we're going to randomly cause failures and see how the system and the people respond. And while there's truth to that, I think the definition that's best is around thoughtful, controlled experiments designed to reveal weakness in our system. Akin to the scientific method, we're going to go out and test a hypothesis, and because we're here to prevent outages and to build reliability, we're going to be very thoughtful about how to cause those failures so we don't inadvertently make things worse.

Now, some people feel like this is a little counterintuitive, and when I'm home for the holidays or hanging out with my family, the vaccine analogy has been one of the best ways to explain this. Now, we're going to inject a little bit of harm into our systems, same way that we might inject a little bit of harm into our bodies. But this is so our systems and the people that operate them have an opportunity to respond, to learn, and to build an immunity to those types of failures. Once we've seen a failure and we've mastered it, we'll be much better prepared to handle the next failure.

As I mentioned, we never want to cause a failure, we never want to cause an outage or customer pain as part of this, and so we're doing this by being thoughtful of what is the blast radius. What is the potential impact or side effect from an experiment that we're going to run? We always want to start with the smallest experiment that will teach us something and slowly grow it as we build trust and confidence in our systems. We might begin in development or staging by testing it as a single host or three hosts. And if we find a critical error, great. We've mitigated risk, and we're able to fix it much earlier in the process. But if it behaves how we expect, we continue to scale that till we've tested all of our staging environment. When we're comfortable there, we're going to move into production, but we're going to reset that blast radius back down to the smallest piece, a single device, a single user, and then grow it again.

It's important that we're testing both at small scale and the large scale. At the small scale, we might be catching a null pointer exception or a failure condition we haven't tested for. At the large scale, we're testing how our system handles duress. Do we shed load as appropriate? Do we back off of downstream dependencies? Do we have enough capacity to handle that influx when things begin to slow down or degrade?

So I want to step back and give a little bit of context. As someone that's served as a call leader or an incident commander for 10 years, I have a couple of tips and tricks about how to manage an incident and what's important when an incident occurs.

First is let's talk vernacular. Let's define some terms.

First of all, we need a good metric to track customer health and behavior.

At Netflix, we use stream starts per minute. At Amazon, we used orders per minute. This is some measure of can the customers use our system and are they happy?

Now, what happens when bad things occur? Well, we can break an incident into a couple of pieces. How long does it take us to understand that there's a serious impact? That's the time to detection. And this usually takes a couple of moments as we're comparing week-over-week data, or we're waiting for a threshold to be hit.

Once we've detected it, we're going to page and alert the people to respond and resolve this incident. This is the time to engagement. How long does it take for those people from the time that they're paged to get on the call and start working on the issue at hand? In my experience, this can range anywhere from a couple of minutes to 10 or 15 if someone is caught in an odd situation, is out and about. One time I was call leader driving home on my motorcycle and got paged and had to pull over on the side of the freeway and manage that incident from the shoulder. That time to engagement was a little longer than normal, but faster than had I continued my journey home and then joined.

And then once we're on the call and addressing it, how quickly can we resolve it? That's that time to resolution, and really it's time to mitigation. We're in triage mode. We may not fully correct the issue, but we want to restore service to our customers and ensure that the system is operating as well as possible.

And then once we've fully resolved that failure, how long does it take until another failure occurs? Each of these are a metric and a measurement that are good to know and are good to measure. Because if we want to improve our reactive approach, if we want to improve how quickly we respond and fix an issue, each of these will need to be optimized.

So as I mentioned, we want to think about what are these metrics that really capture the customer experience, the value that our platform is providing, that give us a good signal of what is healthy? And we want to be able to tune our thresholds and our alerts and our SLOs so that those meet those goals.

Now, maybe you're on call for the first time, maybe you've been on call for 10 years, but I have a couple of tips and tricks for when you're on the call and on that conference bridge and managing an incident.

First of all, I'm a big believer that you need one person with the authority to make decisions. This is a key part of being a call leader, the judgment to, in the face of a lack of information, piece together your best guess and make a good call. This is often debatable, but in the heat of the moment when you're dealing with incomplete information, you need someone familiar with the context of the system and how past outages have gone to be able to guide the actions being taken.

Along those lines, I'm a believer that we don't want to be changing many things at the same time. We want to coordinate our efforts across our teams because if we make a change and it fixes things, we want to know what we did. And if we're making three changes in parallel, we may be unsure which one actually resolved the issue. It's also good to ensure the team is acting together, and we don't want an individual off on their own making changes, potentially improving things, but potentially making them worse without the knowledge of the group as a whole.

Now, when I first join these calls, for the first five minutes, I'm giving a status update every 30 seconds or every time two or three people join.

It's important for them to be able to know what's going on, what actions have we taken, and what would we like them to do.

Typically, when a service owner or someone is joining an incident call, the ask is, go look at your dashboards, go look at your service, and try to determine if you're participating or a part of this or if we can exclude you. And it's important to know who's not involved because these are people that have been woken up in the middle of the night or people that have been disrupted from their day jobs. And if they're not playing a part and we don't need them to do work, then we want to be able to excuse them to go back to sleep, to go back to their jobs so that they can focus on other things at hand.

So let's talk about what is the right number of nines for you. The short answer is not everyone is Netflix, and there's a cost-benefit analysis to how much we invest and what we're able to achieve. So I want to provide a little context about what does the world look like in each of these high-level number of nines, two nines, three nines, four nines. What could we be doing to improve and get better, and what are some of the costs of that improvement?

So we'll start in the two nines world. In this world, we're having three to four days of outage over the course of a year. This is the floor, in my opinion. Things are failing, they're failing often. Our customers likely have a perception that our service is broken or has some issues. We probably haven't invested in having the company focus on this effort or even a team focus on this effort. So likely this is one person's life, one person that holds a lot of the tribal knowledge that is responsible when things go wrong and steps up to help fix things.

For those of you that are familiar with "The Phoenix Project," and if not, I'd highly recommend it, this is Brent. Brent is the bottleneck here. He's the one who is reached out to when things go wrong. He's the one that is burdened with keeping the system up and alive. And it can be taxing and exhausting for a single person to focus on this.

In the two nines world, we may not yet have monitoring and alerting. We may have very basic logging. We probably have some unit and integration tests, but we haven't yet gotten into the world of a good deployment pipeline or more sophisticated tests. We probably don't have an incident management process. It's probably whatever folks feel is best to handle the issues at hand. And we may not have designed or built a lot of redundancy into our system. We may be running with a very bare-bones approach. And so the good news is these things are easy steps that we can take to improve. We need monitoring and alerting. The analogy I draw is to flying an aircraft without an instrument control panel. There's no way that we would do that. We need to understand how things are acting up, how things are responding, and how things are operating. Obviously, there's a lot of good advice behind build and deploy pipelines, making it easy to follow a set process, to iterate often and to ship our code often. And this will help us to improve the quality and catch issues earlier in the process.

It's important for us to have an incident management program. Now, this may be very lightweight, but what is the process when things go wrong? And how do we go about addressing it? Who's in charge? And it's important for us to have redundant capacity, whether it's zone, whether it's at the host level. If something goes wrong, and something will always go wrong, what is our backup plan? Now, this is where I'm a believer that fire drills are an important piece to help train teams and prepare them. My on-call training in my career has sadly resulted with, "Here's a pager. You're smart. Good luck. You'll figure it out." And I think that we, as an industry, can do much better about training this next generation of SREs and operation folks to know how to handle these incidents. There's a reason that many of us have grown up running fire drills, and that's because when a fire breaks out, we need people to not panic. We want them to respond safely and calmly and thoughtfully. The same applies when we have a major outage in our system. It's a stressful event. You might have a VP or a C-level on the call. You know that customers are being impacted, and it's all hands on deck to fix it as quickly as possible. You may not know what action to take or exactly what's happened, and that puts a lot of stress on individuals. And that opportunity to practice and prepare ahead of time allows us to role-play, to ask questions, and to build some comfort about an uncomfortable situation. Now, from the chaos engineering world, there's a few things that we can take and apply here to make our lives better. We want to go understand what happens when our alerts and our monitoring has been set up. Have we tuned it correctly? Is there a lot of signal, or is there a lot of noise? And as someone who's operated these systems, if there are 300 alerts and they all make noise constantly, your engineers will quickly tune it out and stop listening. And so we need the smallest set of alerts and monitors that provide us insight and value without taxing us. And we need an opportunity to practice. We need to let people pretend that there's a real outage that they join, they join a call, they get engaged, they look at their dashboards, they log into their hosts. There's a lot of little details there of things that could go wrong and delay an incident. And by going through them in advance, we can really prepare. And this type of investment doesn't take that much time. This could be a monthly exercise with our teams. It could be a quarterly exercise where we get the whole company together. But with just a few hours of investment and a little bit of tooling, we can really improve from that two nines world into a three nines world. And so what does the world look like when we arrive at three nines? Well, failure is happening less often. We've moved from days of failure to hours of failure over the course of a year. In this case, when failure does occur, it's likely our customers are annoyed. And if they have an option, they may choose to go to one of our competitors or another service if they're experiencing failure. But overall, the system feels like it's working correctly, and most of the time, people are able to have a good experience. We've moved from the world of this being one person's problem to where the critical services are now part and parcel of ensuring the system's operating well. We likely have a set of tier one services that are the pieces we know cannot fail and a set of tier two or other services that are okay or a little bit more tolerable if they fail. And at this point, we're beginning to capture the learnings. We have an opportunity to review our incidents, to talk about what we can do better, and to start to begin to share those learnings amongst our teams.

So this is really where we've arrived at an SRE team or operations teams, and it's no longer just a burden of Brent's to figure it out and fix it.

So in this world, we likely have logging and monitoring, but it still might be noisy and scattered. And so this is the opportunity for us to come in and tune those thresholds, tune those alerts, and make sure they're actionable, and that when they're producing noise, it's something we can act upon.

We're building and deploying more often, and now we're seeing code changes coming across teams and more frequently. So this is where it makes sense to start layering in things like canary deploys and failure testing, more in-depth insight in our pipeline that tells us if we've addressed past issues that could have occurred, and if our system is really ready to be deployed and run in production.

In three nines world, incident reviews might be happening, but we may not be doing a great job capturing those or sharing them. And this is really what helps inform a company's best practices and what the best approach is to improve their overall system. And so we can now begin to share those and teach the rest of the company what we're learning when things go wrong, because failure is really an opportunity to learn and improve.

And in this world, we might have redundancy, but it may be at the zone level. Maybe we haven't moved into regional redundancy. And there's an important aspect that comes with this. Once we begin to run in multiple regions, we have, one, a safety mechanism, and if things go wrong, we can shift traffic to another region. But with that safety mechanism comes a responsibility to test it and ensure that it behaves the way we expect. If we don't, we have what Adrian Cockcroft calls availability theater. We think that we're protected, but in reality, we aren't. And so it's important to be able to exercise these types of failure modes often.

At Netflix, we performed region evacuations every other week, and in the beginning, they were slow, and they were painful, and they took a lot of time and a lot of teams' efforts to get together. But each week, we got better and better, to the point that it became a five-minute automated process where only the core SRE teams needed to be involved. The other reason it's important to often exercise these is that the system changes. There might be a new scaling boundary in one of the regions that a service will hit if we fail traffic over to it. We may find that our proxy code or the way in which we're shifting traffic has a bug in it or it's changed. And so by testing it on a regular basis, when we need it to save us, we can have the confidence that it will, instead of resulting in two outages at the same time.

The other piece in distributed systems that's key, and if there's one thing you take away from this talk, it's this: it's that we have to go test the failures of our dependencies. We're building distributed systems, and the adage goes, someone else's computer out of my control can cause a failure in mine. And the number of outages that follow this pattern are plentiful. And so we want to go out, and as a service, we want to say, "What are my critical dependencies? What are my non-critical dependencies?" And we want to go through and carefully fail each one of those. If I'm unable to load data from S3, can I continue operating my service? If I'm unable to reach my database or my internal identity service, do I have a fallback and a cookie or another mechanism that I can operate with? And by going through and thinking through these scenarios and testing them, number one, we can sift out what's critical and what's non-critical. And for the non-critical failures, we can ensure that we gracefully degrade and that they don't become a customer-facing issue. And for those that are critical, we can be aware of the rough edges and where we need to invest to improve those.

So great. This helps us get into the four nines world. It's a sweet place. I've lived there for a time. It's less stressful. You're getting paged less often, and you're feeling better about the quality of your system.

But in this world, when a failure occurs, what we're seeing are that a lot of the low-hanging fruit has been picked. And so when a failure occurs, it might be a nasty failure. It's often two or three different things going wrong at the same time. It takes more time to diagnose and to understand and can be a little trickier in general.

In this world, customers aren't noticing failure. We've moved from hours of failure to less than an hour of failure. And hopefully, when these failures are occurring, they're brief, and they're for only moments at a time. And if the system is able to self-heal and customers don't notice, great. We always want to ensure that customers have a great experience and that we're winning those moments of truth.

In this world, it's more than just the critical teams that are firefighting. This is learnings and best practices we've shared across the company. More and more teams have had an opportunity to prepare, to practice, to understand what occurs. And by preparing and practicing up front, they're getting paged less. So ultimately, they're spending less time related to operations work because they're able to do it upfront and amortize that cost, as opposed to paying for it when things go wrong.

And as a quick aside, an outage is more expensive than just the revenue you lose and the customers that are unhappy during the time of that outage. There's often dozens of engineers involved, lots of work after the outage has happened to understand the contributing factors and all of the things that influence that outage. We often need to meet as a group and discuss how we can improve it, and from that will come a set of action items that we need to go fix to ensure that these failures won't occur again. And all of that becomes a very time-intensive and expensive process. But like an iceberg, one beneath the water, one we may not think about often as we're prioritizing features and customer-facing work. And so prudence says that by investing in this upfront, we're actually saving ourselves a lot of time and pain in addition to the revenue loss and brand impact.

This is really what we're striving for, a culture of resilience, a culture of learning and sharing, of practicing, of acknowledging that failure occurs and helping ourselves through it so that we realize that this is a team effort, that we can help each other improve and, as a result, build higher quality software and have a better customer experience.

So in the four nines world, observability is ubiquitous. We have everything everywhere we need. We know what's occurring. And now we can start to layer in things like anomaly detection, better analysis, predictive analytics, so that we can see failures coming more easily and more early. And when they occur, we can act on them more quickly.

We're doing unit tests, integration tests, performance tests, and failure tests as part of our pipeline, but how we deploy and roll out changes, whether it's software or data, becomes important. I've seen systems that do regional deployments for their software, but global deployments for their data plane, and have been bitten by seeing the result of that failure hit worldwide instead of isolated to a region. And so by being thoughtful about how we deploy things and how we roll them out slowly, we're able to learn more about our systems and provide a better experience. And that canary experience, to me, is key. We can do artificial testing and synthetic testing all we want, but there's nothing, in my opinion, that's a real replacement for the diversity of customer traffic in production, the load that it provides, and the edge cases that we might hit.

Now, in this world, we're doing game days, we're doing blameless postmortems, we're doing trainings. But one of the dangers, ironically, is that we become complacent. When failure is occurring less and less often, we may take our eye off the ball or begin to focus on other things. And this can have a pendulum effect, where we feel really good and we regress back into an earlier stage. And so by being vigilant, by practicing, and by making sure we're thinking and talking about these issues that can occur, we can maintain that high level of availability and ensure we're in a strong point.

And being region-redundant and running an active-active architecture may no longer be sufficient in this world. We may need multiple cloud providers or multiple infrastructure providers so that we can mitigate those black swan events. If one cloud provider has a major outage, but we're able to continue operating on another cloud provider, while those events might be rare, they could be very impactful to our business.

And so this is where we're really stressing things at a deeper level. What happens when there's packet losses between our data centers? How do we handle these peak traffic spikes? What happens when something key, like DNS, fails? We're no longer able to route customers or traffic within our systems. And so in this world, again, we're testing more frequently, but it's just part of the process. We don't need to invest a vast period of time. If a team is spending an hour a week or an hour every other week thinking about this, practicing and preparing, we can really save a lot of pain and time spent in outages and really improve that customer experience.

And so this is where the world we want to get to, and candidly, I haven't worked for a software company that's hit five nine yet, and I think the bar is extremely high here for us to hit.

But in this world, we're gracefully degrading and we're self-healing. The system is able to correct a lot of the issues without the intervention of people. And due to that engagement time, that becomes paramount. We don't have five minutes. We have five minutes for the year, so we can't wait five minutes for one person to join a call. We need as many things to be hidden from the customer and to continue operating well in the face of that failure. What we're really focusing for is being like a utility at this point. When you turn on the water, it works. When you flip the power switch, it works. And when you pull out your phone and you log into a browser to transfer money or to buy something, it just works. And that's what people are going to expect more and more as time goes on.

To me, this is really just an efficient quality engineering culture. That everyone is responsible for the availability, the performance, and the efficiency of their code. That those are core engineering tenets that we're thoughtful of, and whenever we find opportunities to improve, we are.

And we're sharing these learnings, not just with our team, perhaps not just with the company, but with the community at large. We're able to talk about the failures that we've encountered, the lessons that we've learned, and really help lift up those around us by teaching them the hard-fought lessons we've learned at 2:00 in the morning when the coffee hasn't kicked in and when the system has been in a state of duress.

And this is really the future. The world that we want to live in. The world where software just works.

So I hope you found this useful and engaging. We'd love to have you come participate in our community so that you can learn more, so that you can share these stories, so that we can help improve the reliability of the Internet overall and all of our systems. Thank you very much.