Why Does Capital One Test in Production?

Log in to watch

Virtual US 2022

Why Does Capital One Test in Production?

Sr. Director, Software Engineering · Capital One

Director, Software Engineering · Capital One

In the IT industry, testing in production has always been considered an anti-pattern. However, in Capital One we have been successfully testing our critical digital customer facing applications in production for over 18 months now! Why do we do it?

The answer is simple, the alternative to proactive chaos engineering is reactive crisis management.

Chaos engineering is not new to the industry or Capital One, what's new is the scale of experiments being executed on a regular basis and a set of closely integrated software solutions we are utilizing to make it successful.

Chapters

Full transcript

The complete talk, organized by section.

Bryan Pinos

Hi everyone. My name is Bryan Pinos. I'm a senior director of software engineering at Capital One. In my role, I'm responsible for keeping Capital One's banking and credit card services always on for our customers.

I do that through enabling teams to proactively find latent defects in their applications and infrastructure through chaos experimentation. And we develop a suite of in-house tools that enable our teams to remediate problems in production through automation.

With me today is Yar Savchenko. Yar, would you like to introduce yourself?

Yar Savchenko

Definitely. Thank you, Bryan.

Hello, everybody. My name is Yar Savchenko, and I happen to work for the same company as Bryan, Capital One, and I am a director of software engineering. I work in the same areas as Bryan, and we provide availability engineering operational support to all of our critical applications.

And not only do we do that, but we also try to find those latent defects that Bryan mentioned, and this presentation is all about how we're doing it.

So first, a little bit about Capital One and who we are.

When most people think of Capital One, they either think of the Visigoths getting ready to rampage, or credit-card customers until they find out that they are Capital One customers, or they think about Jennifer Garner asking, "What's in your wallet?"

What most people don't know, though, is that we're the first bank running entirely on the public cloud. We're pretty proud of that. We're a 25-year-old company, and to think about that, we started out in the data center as a traditional, legacy bank, and then we migrated to the cloud through a transformation that took several years. And now we're 100% in the cloud.

All along the way, we've been led by our founder, Rich Fairbank. We are also one of the top 10 banks and credit card issuers, and we're the second-largest auto-loan originator. We have 100 million customers. We have 50,000 associates. Many of those associates are technology associates, and many of those are software engineers.

At Capital One, we don't say that we're just a bank. We like to say that we are a tech company that does banking.

Bryan Pinos

All right. Thank you, Yar.

Yar Savchenko

Thank you, Bryan. So now we get to the main portion of our presentation, and hopefully the title that intrigued you all: Capital One tests in production.

If you have been in tech for any amount of time, you know that testing in production is considered a bad word. That means that you did not fully test your code, you deployed it to production, and something went horribly wrong, and you negatively impacted your customer.

That is obviously something that every company out there is trying to avoid. However, at Capital One, we do test in production. And not only do we do it every once in a while, we do this on a regular basis, both planned and no-notice events. And you know what? We're not afraid to admit it.

So the question that we get asked a lot is: why do we do it? Why do we test in production?

Well, to be totally honest with everybody attending the session, it's because maintaining consistency between lower QA environments and production is extremely difficult. There are numerous reasons for that. Some of it comes down to cost. Spending millions, sometimes hundreds of millions of dollars, to make your QA environment just like production is not cost-effective for a lot of organizations.

In addition to that, there are so many changes that happen in the lower environment at all times. So bringing some level of consistency, to make sure that your codebase is the same across production environment and QA, is almost impossible. There are ways that you can enforce that with a simulated production environment and others, but it's extremely difficult, and sometimes it isn't worth the trouble.

One of the things that we deal with is: how do we generate the load that we see in production? If you have an application that consists of a number of microservices, and the customer can use some of those microservices depending on their transaction, how do we mimic this in a QA environment?

We can do that with some tools that are available on the market. We can develop our own tools. But the complexity of it, to truly mimic what a customer does with your application, regardless if it's a mobile or a web application, is very, very challenging.

Again, just like mimicking your production in a QA environment can be done, absolutely. But the cost and the effort to do that sometimes isn't justified.

So what are we doing about it? We're conducting chaos experiments on a regular basis, as I mentioned before, both planned and no-notice. And the way that we're able to do this is by utilizing industry tools such as AWS Fault Injection Service and other tools, as well as some of the internally developed Capital One tools that allow us to test various failure scenarios in a controlled environment in production.

And having those tools, we also have a program wrapped around it. One of our biggest programs that started several years ago is Game Day. And as part of the Game Day program, we're able to consistently, on a monthly basis, conduct chaos exercises on our most critical applications in production. And let me tell you, we have been very successful so far.

Bryan Pinos

So how do we do this at scale? That's what this really is all about.

If you have a simple application running on a few EC2 instances and a database, it's probably somewhat easy for the engineers to really understand the full dependency map and the complexity of it. But when we start scaling up from a few EC2 instances to thousands upon thousands of EC2 instances, and an entire ecosystem of microservices that all depend on one another, it gets too complex for the engineers and the production-support teams to really keep in their head.

So we look to tooling. As we instrumented and automated our tooling to implement chaos experiments, we looked at what was out in the marketplace, and we decided to leverage AWS Fault Injection Service. In addition, we also leverage AWS Systems Manager, but then we paired that with some of our internal tools, like Cloud Doctor, to manage our entire environment, to understand the complexity of our environment, and to tailor those tools to how we would use them in our environment.

We do all of this to try to simulate app-layer failures. Because when an app-layer failure occurs in the natural production environment, we always have to postmortem and figure out what happened, and customers are impacted. If we can simulate app-layer failures in a small percentage of the production environment, we can see what happened, and then we can make the application more resilient against it.

That's one aspect of our chaos experimentation. The other aspect is how we look at our infrastructure, our environments, and how they fail. AWS will tell you in their best practices that you have to be resilient against failures, especially within a region, and that's why they have provided multiple availability zones.

At Capital One, we like to simulate availability-zone failures to show that our applications, APIs, and microservices can always stand a single availability-zone failure. Is there enough capacity in the other availability zones to handle it? When a failure occurs, does the primary database automatically move to one of the other availability zones? Do all the instances reconnect to it? Do the containers all reconnect to it automatically?

These are things that we like to simulate in the controlled environment so that we understand what could happen in a real environment in the middle of the night, when everybody's sleeping and can't react really quickly. We want it to obviously automatically heal and ensure that we protect the customer's experience.

Additionally, it doesn't happen often, but there have been times where there are regional failures, when an entire region at AWS might go down for one reason or another. In those occurrences at Capital One, we like to simulate what would happen and ensure that we have the capability to run out of a single region. So we set up large testing events to do this, and we use tooling and automation to simulate those types of events.

Some scenarios we just can't actually instrument and simulate. In those cases, we host exercises internally where we talk about what we would do. This is a little old-school, but in some cases it's still required. Let's say all of AWS went down. What would Capital One do? Those are the types of scenarios that we still have to think about.

But look at this from the layers. We look at that as kind of the least technical. As we move backwards, we can go through and instrument and simulate different layers. And I think Yar mentioned it. We have game days. We also do specific chaos exercises that we leverage to test against specific architectural standards at Capital One. And then we also do what we mentioned before about regional isolation: how do we isolate all of our services to a single region and make sure that we can operate out of that one region?

But in order for all of this to work, you have to standardize deployment. In order to standardize deployments, you have to embrace infrastructure as code. Without that, you end up having uniqueness, or there end up being things that change because of the human aspect of going out and building those things.

In order to understand the complexity of all the dependencies in the cloud and the call flows, you have to invest in tooling. In a lot of cases, some of this is available in the marketplace, and some of this is stuff that you might have to build internally.

And then lastly, you've got to get rid of the manual intervention. You've got to do targeted exercises. So when we have exercises that we can do, like Yar mentioned, they're no-notice exercises, where we don't want to give notice that we're going to perform a chaos experiment. We want to just go in and do the experiment and see how the application naturally acted and self-healed.

That's how you've got to get to that, so that your engineers aren't getting paged at two in the morning, or on Christmas Day, or on New Year's Day, and they're able to enjoy their time with their families because the system is able to self-heal and continue to provide that customer experience.

Yar Savchenko

All right. Now let's talk a little bit about benefits.

I think both Bryan and I mentioned that it does take some time and investment to get to the point where you are consistently doing chaos exercises. So what are the benefits? We'll use Capital One as a case study and talk about the benefits that we have realized.

By conducting numerous chaos exercises, both planned and unplanned, we have identified latency. Latency is a bane of existence for anybody that has ever troubleshot a network problem. We can't beat the speed of light. So the more data you have to push through the wire, and the further your data centers or regions are apart from each other, the more latency we'll see.

In cases where you have a transaction that gets bounced multiple times between two separate data centers or regions, if you're in a cloud, you see a latency increase with each bounce back and forth. Sometimes that could still be within acceptable parameters. However, if something does change, one of your components fails, it doesn't have enough capacity, or your primary database that you're writing to moves from one data center to another or one region to another, your latency could exceed the timeout threshold, and then you're starting to negatively impact your customers.

Or it could get to the point where your timeout could be set to 30 seconds, and no customer is going to wait 30 seconds for your application to load. Let's be honest: people are not willing to wait for something to load. We have been using Netflix. We have been using all of those tools that are available to us immediately at the click of a button.

By conducting these exercises, we have found a number of cases where moving components away from each other introduced latency. We were able to resolve those, and we'll talk a little bit more about how we address them, sometimes by rearchitecting the whole application, which does take a while. Sometimes it's a quick fix. Sometimes it's moving one component, or maybe sizing it correctly.

Talking about sizing, I will move to the next most common finding that we have had: capacity. Again, it doesn't matter if you are housed in on-prem data centers, in the cloud, or hybrid, where you have some components in a cloud and some in your on-prem data centers. Being able to size your resources correctly is very, very important.

The benefit of cloud is that you can scale up pretty much unlimited. AWS, Google, Microsoft, they all have almost unlimited resources, and you can scale up your servers to pretty much any reasonable number. There is a cost to that. But taking cost out of the equation, sometimes you don't size your cloud resources correctly.

So when the traffic shifts from one region to another, or from one data center to another, you just don't have enough pure computing power to process all of that request, and you start dropping customer transactions, negatively impacting them. Again, that is something we're trying to avoid.

In conducting chaos exercises at Capital One, we have found a number of cases where our critical applications just weren't sized correctly for a spike in user access. Those spikes sometimes happen for reasons that we know and expect, sometimes when people get paychecks, and sometimes they just happen due to other reasons that we don't expect or can predict.

The other aspect you have to take into account is: how will your critical system perform under extreme load? If you're running in two data centers and you're splitting your traffic 50/50, then your systems are running at 50% utilization. What happens if one of those regions or data centers fails, and your single one has to service 100% of the traffic? Is your gateway or your load balancer sized appropriately to be able to handle that extreme load? Those are all of the great questions that you will hopefully answer by conducting chaos experiments.

Whenever you conduct chaos experiments and you have findings, it's very important. If you have been in the IT industry, nobody likes process, because process does add some red tape and slow things down. But wrapping a process around addressing the findings that you come up with from the chaos exercises is extremely important.

At Capital One, all of the high-severity findings, most of them -- I wouldn't say all -- have been resolved in 30 days or less. As I mentioned, there are some cases where whole applications have to be redesigned or rebuilt from scratch, and that does take months, sometimes years. However, if something is not sized correctly, that can be addressed fairly quickly, addressing those latent defects that we talked about and making your environment a lot more resilient.

Bryan Pinos

So are there risks? Well, of course, there's risk to testing in production. But does the value of those tests outweigh those risks? We think yes.

Things you've got to watch out for, though: sometimes there's unexpected impacts. Yar talked a lot about latency. Latency occurs sometimes when you're not thinking about how the call traffic is bouncing between regions. And so you eliminate or move one component to another region, and you see a bunch of latency.

It could be actual failures. Things aren't set up to work the way we thought they would have worked, and we see failures. Can that happen? We can actually have a real incident at the same time we're doing a chaos test in production. That's possible too.

So you have to be prepared. The key to managing risks is mitigation. We have some things that we've learned along the way that we think are important.

One, you have to have a playbook up front that defines the agreed-upon work with your business stakeholders, your product stakeholders, and your tech teams to really understand and say: okay, if this happens, if this number of customers are impacted, if things get this bad, we'll abort the test. Establish that up front so that it's not a line-of-scrimmage call that you're making in the middle of the heat of the moment, but you actually understand the rules that everybody's playing by. And ensure that whatever you do, you can undo in under five minutes.

This ensures that you have complete control to eliminate the impact that you might be causing on your customers during some sort of chaos experiment or game-day event that's happening in production.

But really the key is: you have to know there's a problem. Real-time monitoring of all your critical systems and transactions during and after the exercise is of the utmost importance. You have to understand what your steady state is. If you normally have a half a percent of error rate, then you need to know that before you start your test. That way, when you start looking during your test, you're not saying, oh, there's a half-percent error rate, is this because of the test, or is this always out there? We need to understand that up front.

Understanding what your steady state is before the test, what things look like during the test, and then returning to steady state after the test is also very important. Because if you inject latency somewhere into the call path and then you don't verify after the test that the latency goes away, then you could have a very big problem.

Another thing that we've done to mitigate the risks associated with production is maybe the first time you do the test, you don't do it at your peak production load. Obviously, I hope you would do it maybe in a non-prod environment first, although we know it doesn't match. But at least you can prove out the test.

Then take it to a production environment. Maybe start in the middle of the night. I know nobody wants to be up all night, but start in the middle of the night, prove it out, show how it might work, and then bring it more and more into production times.

We found that this also helps bring our business and product stakeholders along the way, because if you go to them and say, hey, our peak time is XYZ and I want to test at that time, they're going to look at you like you've lost your mind. So bring them along, because it's not just the tech stakeholders you have to bring along. You also bring along your business and product stakeholders.

Lastly, I know for some events we like to do what we call no-notice. That is, we're going to break something and see if automation fixes that, if our alerting works, if the engineers hop on a bridge and fix it. We're testing not just the technology, but also the people and the processes that go along with that technology.

But then in some tests, you also want to have your dedicated engineers on standby and ready, because if something breaks, if something goes sideways in a way you didn't anticipate, make sure you can fix it so that you're not impacting customers for a prolonged amount of time. That's obviously the most important point of this whole exercise: how do we protect the customer experience? How do we ensure that we're always on for those customers when those customers need our services?

And if our testing injects more problems for our customers than it helps us in the future, then we're not really getting the value out of the testing that we need. So it's important that we do all that we can to mitigate the risks to our customer experience along the way.

Yar Savchenko

All right. So now we get to the what's-next portion of the presentation.

We talked about some of the great things that we have accomplished with chaos engineering or chaos exercises at Capital One. We talked about the benefits that you could realize from conducting these types of exercises, as well as some of the potential risks and mitigation techniques.

But what do we plan to do next at Capital One? Again, you should never stop growing. Status quo is the worst thing that you can achieve in the IT industry, and we want to make sure that we continue to push the envelope and test different scenarios.

What we plan to do next year is to bring all our critical applications into the scope of chaos exercises. Right now, we primarily test with our customer-facing applications from the digital side, both website and mobile app. However, we plan to expand that to other parts of our environment, such as call centers and IVR systems.

Those have always been considered separate from the IT domains. However, chaos could really be a beneficial tool to test the resiliency of those systems as well, to provide better customer experience and brand, as Bryan mentioned.

In addition to that, third-party vendors. I would say most companies out there utilize third-party vendors to some extent, and chaos will allow us to test and proactively identify gaps that we might have with utilizing those third-party vendors. It could range from a connection point to a specific vendor, to what happens if the traffic you get from that vendor slows down -- back to the latency that we brought up across several topics of this presentation.

Being able to effectively test these third-party vendors is very, very important to ensure that your environment is fully resilient.

In addition to testing third-party vendors, we want to be able to test on the highest-volume days with no advance notice. As Bryan mentioned, if you're just starting on the chaos journey, obviously don't do this as your first exercise. But at Capital One, we're getting to the maturity level where we can do a chaos exercise on double-payday Friday, one of our highest-volume days, where our customers are logging in to check their account balance, pay their bills, and do a lot of various financial transactions.

We want to introduce chaos on that day to see how resilient our environments are without letting anybody know. That is the key, because when a real incident happens in production, usually nobody is aware that it's going to happen and there's not enough time to prepare. So doing this in a controlled environment is the key to build that muscle memory across all of the engineering teams.

In addition to that, we want to integrate what we call planned failures during the chaos experiments. There are a number of tools that engineers love using. For example, Zoom, as this is what we're recording this call on; in addition to that, Slack, Splunk, New Relic. There are a number of monitoring tools, and part of the experiment that we do when conducting game days is asking the engineers not to use a specific tool.

That gives them a chance to ensure that you're not tied to a single monitoring tool or a single communication tool. In case that tool goes away during a production incident, they could easily switch to something else and continue to troubleshoot and resolve problems.

What we are also working on is generation of fake HTTP error codes. I think we all know and love 404 as a common error, and the 500 error codes. The ability to inject those codes into the application stack will allow us to determine how downstream applications react, and do so in a very controlled manner, because we can control the number of errors we inject.

The last thing is validating automated recovery techniques. I think Bryan really hit on this key point, but we want the system to self-heal. We want any incident that occurs in production to be self-resolved. So the engineers don't have to get called in the middle of the night.

Introducing chaos in our production environment allows us to test those recovery techniques and make sure that your triggers are set correctly, that you are moving traffic between regions or data centers at the appropriate time and not causing yourself more latency or more issues.

Our end goal is to ensure that the game-day exercises can be executed on the highest-volume day, where we have the highest number of customers logging in, unannounced, when nobody outside of a single team is aware that they're happening, with a single click. We don't want to make it very complicated. We don't want somebody to sit down and have to manage and operate it through multiple tools. We want to use a single tool, with one click, that allows you to start the experiment and, if something goes wrong, to roll back.

That is our ultimate goal. And once we're able to achieve all that, then we will achieve a higher level of resiliency. And I think that is the goal for everybody attending this conference.

With that said, I'm going to pass this on to Bryan for closing comments.

Bryan Pinos

All right. Well, thank you everybody for attending our presentation today, and we hope that you found the information useful. We look forward to answering your questions in the presentation Slack channel. Thanks.

Yar Savchenko

Thank you.