Continuous Chaos in DevOps
DevOps has brought cultural change in the delivery of software and Cloud is a big enabler of DevOps.
Cloud is ubiquitous, however failures are unavoidable and unpredictable raising uncertainty and are risk to the systems. Adding continuous chaos to DevOps culture helps building anti fragile applications in Cloud. Traditional approaches can’t predict all failure modes. Chaos engineering is a discipline to simulate these failures.
In our Cloud Detour solution, we describe how mission critical applications using compute/storage and other services in the cloud can register themselves in an automated way to test against several failure scenarios and demonstrate their resiliency level. Cloud Detour itself is management free with Serverless implementation and can be integrated with application CICD pipelines.
Chapters
Full transcript
The complete talk, organized by section.
Gnani Daththreya
The topics that we are going to cover today are tech transformation in Capital One: what is the tech transformation, how we are building our anti-fragile applications, and then we are going to talk about an in-house engineering solution for chaos testing and different types of failures we induce as part of our application resiliency testing.
So where we are today, right?
In Capital One, we started this tech transformation journey around four years ago. It's an exciting journey, and we are having an amazing time as part of the technology organization in Capital One. You will hear more about that in tomorrow's keynote presentation from Topo Pal from Capital One.
Four years ago, when we started, the timing was perfect. Everything came together. We made the decision to go to public cloud, and we made the decision to embrace open source-first solutions. We embraced the microservices architecture and DevOps culture. That was the beginning of the journey four years ago.
Fast-forward to today, this is what our architecture is like. Any net new applications we develop, we develop cloud-native using microservices architecture. And we slayed a number of monoliths, and we decomposed them into microservices, and we have them running in our public cloud. Hundreds of microservices are running in our systems, or in our public cloud.
So what are the side effects of this one?
The side effects of going into public cloud, leveraging microservices architecture, is that you have increased the complexity and dependency among many services to provide your features.
And now what happens is we have this rate of change in how much we deliver software. The rate of software delivery is increasing, and our scale is increasing, too.
So when you have the rate of software delivery increasing and scale increasing, failures are bound to happen, and failures will happen. That is what we are going to talk about: how we are changing the culture within our development teams to embrace failures. Failures will happen.
So when you have the scale problem, we use a number of cloud services. Cloud services will fail. And because of our DevOps culture, our rate of software delivery has increased tremendously. That will fail.
So now we need to embrace failure as part of our development. That's the culture change we are having in Capital One at this time of our tech transformation journey.
Here is what it is. Amazon's CTO said it: everything fails all the time. So when we deploy our applications, starting from our infrastructure layer, managed services layer, we have hosts, we have containers, we have N-tier applications multiplied by dependencies to others. Anything can fail anytime.
So how are we building our applications and embracing these failures so our applications are anti-fragile and resilient? And what solutions and tools have we developed within Capital One to provide in the hands of our engineers so our development teams can proactively deal with failures so that they don't catch such types of failures at the worst possible moment in production?
So that is what this chaos engineering is embedded into our DevOps culture.
How do you navigate complexity and chaos within an organization?
Organizations like Capital One, which are financial institutions, or think about healthcare organizations or government agencies, they have cultural differences. They have different skill levels. You cannot expect the same level of skill level in a software engineering company in enterprise.
So how do you deal with that complexity of different skill levels? Also, when you have these complexities, when you navigate through it, are you ready for embracing failures? In this case, major decisions like cloud migration and chaos engineering, how will you embrace within enterprises?
Chaos engineering is all about anti-fragility. Let's talk about fragility. When you package, say, some wine glasses, you want to package them and say that it is fragile, handle with care, and give it to your shipping carrier.
But will you be so confident that you packed very well and you want to declare that, "I am anti-fragile. Please mishandle it. We'll be just fine"?
Can you get to that culture?
That's exactly what chaos engineering, embracing failures, is all about.
At the pace we are shipping things, can you be resilient? So when an outage happens, are you still operating at acceptable customer satisfaction levels? Are you resilient? Are you available?
When the regional-level outage or a data center outage happens, are you automatically running your application in a different region or a different data center? And do you continue to be available, even with limited functionality?
Are you reliable even during small outages or even major outages? Are you reliably operating at your SLA?
For example, when the S3 outage happened earlier this year, Netflix was still operating at acceptable levels of reliability. So it means they have designed their systems around anti-fragility. Basically, they are not fragile. So that's where chaos engineering comes in.
So how do we embrace failures?
In this case, chaos engineering is not testing. It's more of an experimentation. It's a cultural change.
When you do chaos engineering, when you do a testing, you know that you are going to click a button, something is going to happen. You are going to assert against what's going to happen.
But when you do an experimentation, you are going to study the behavior of the system. You are going to define your steady state: this is how I am going to be operating in the normal times. Now I am going to make a small change to my system, or a large change. What will happen to the system?
You may be able to predict what is going to happen, but the experimentation is going to be more a study than, "Hey, this is what's going to happen. I'm going to be just fine." So sometimes it may not be acceptable what happens. So you go back to your resiliency and fix things.
For example, you cannot even do chaos engineering if you are not at least at a minimal level of things designed around your system. If you are not designed around multiple availability zones, for example, in an AWS region, then you may not even be willing to kill an instance or kill instances in an availability zone because you are absolutely sure that your system is going to fail.
So first you will work around your resiliency limitations, then you will come to a culture of being able to test chaos engineering.
Why do we call chaos engineering an experimentation?
So when you do small changes, you will be tolerable. For example, you terminate an instance. You designed your system around availability, so your system may be able to handle it, barring a couple of people losing their sessions. Your system performance may be overall acceptable.
Let's say your caching node times out, and you may be just fine. So you have designed to take advantage of some of the design decisions that will help you survive those failures.
But can you do both? You have a caching system failure at this time. One of your instances disappears at this time. Do you know what will exactly happen to your system? Will it still operate at the acceptable level of resiliency and availability?
This is more of an experimentation. Now you don't know what to test. How will you test? Testing this manually is not going to be easy. People just waking up to create failure. So that is where we need innovation and automation and integration with the DevOps pipelines, things like that we will need.
So train in the calm. Before the storm hits, how do we train in the calm so our applications are resilient when really the storm happens?
And chaos engineering is this discipline of intentional disruption to your applications, inducing intentional disruptions. So you study the application failures and actually have a plan for how you will tolerate such failures so you can build anti-fragile applications.
And here, Netflix, many of you may be aware, has released the Simian Army as an open source project. It's probably eight years old at this time. And we are totally inspired by the concepts of chaos engineering from Netflix Simian Army.
However, we couldn't use it as is for Capital One. Ours is a large company with hundreds of applications in a different operating model. And our requirements span across a number of cloud providers, a number of cloud services. It's not a single cloud service.
For example, in AWS, you have Auto Scaling groups. Simian Army, or Chaos Monkey, goes after attacking Auto Scaling groups. But for Capital One, we use AWS and we use other cloud providers. In AWS, we use a number of services: Auto Scaling groups, ECS Container Stack, RDS, EBS volumes, a number of them we use.
So we needed a solution where our solution can work in creating disruptions across any number of services across the cloud providers. That is our design goal. And today, we have made it successfully work for a number of services in AWS Cloud.
And speaking of our DevOps culture, we strongly promote inner source-based projects within Capital One. We set this chaos engineering tool, which we named CloudDetour, as an inner source project, and we have participation from a number of development teams in contributing to the features.
The way this tool will operate is in each of the AWS accounts, this will be installed as your foundational component. And me as an application developer, developing an application, when I want to test my application for different types of failures, I don't need to do anything in the sense that me as the app owner doesn't have to create tools or install tools. It's all provided as a foundational service.
I would have to go and subscribe my application for the CloudDetour disruptions.
So let's take an example. I am developing an N-tier web app, right? So what are the number of components I will have?
I will have an infrastructure component, let us say Auto Scaling groups, and maybe I'll have a container stack, like an ECS cluster. And on top of that, I will have ECS services or the containers themselves. And then I will have application components running inside Docker containers, and there could be a number of processes. And this application will be talking to external services. Maybe I'll be talking to an S3 bucket. Maybe I'll be talking to another external microservice.
All of this makes my ecosystem of my application.
So what do I do now? I go to CloudDetour, which is the solution, the chaos engineering automation solution that is running in our AWS account. And I make an API call or a CLI call describing my application, saying, "I have all these layers." And for each layer, I say, "I choose this type of disruption."
For example, if it's an Auto Scaling group, I will tell, "Hey, go and terminate machines randomly." And I have my N-tier application: go and terminate my web part of the N-tier application. Slow down my database.
For each of the components, we have N number of disruption types. The app owner will choose the many components that make the application. For each component, what is the disruption type? And for each disruption, what is the schedule?
I want to do it every Wednesday by 12:00 noon. I don't want my application to be disrupted on Mondays, but Fridays are okay. You could choose all of these permutations and make the CLI call or the API call to CloudDetour, and it'll schedule disruptions to your application.
If the app is built resiliently, while CloudDetour goes and starts tearing apart your application, your application should be just working fine.
That is where we have additional features like health check. We go terminate one or more, or any number of components of your application, depending upon the request, depending upon your application profile and request. And we also fast-follow that with a health check and tell you, "We terminated your application and your application is performing just fine." Awesome. Your application is resilient.
And we terminate one or two components, and if your app does not work properly, then you have that and you're saying, "Oh, maybe we thought this component is redundant, but it's not. Maybe failover is taking longer." You discover so many things that you thought would be working when you actually induce failures to them, one at a time or in many combinations.
So that is how CloudDetour helps applications test resiliency for their apps.
So where is this coming in as part of the CI/CD pipeline?
For many of our applications, we have categorizations based on the SLOs. So we need very high quality of service, meaning it has to be up and running all the time. So we deploy those applications using different availability zones, using different regions, and using the circuit breaker pattern. All of that stuff is part of the application design.
But how are we testing that on an ongoing basis, right? And for high-fidelity or high-quality-of-service SLO applications, this is required: that as part of your CI/CD pipeline, before you deploy your app in higher environments, we need to see the proof that you have passed.
You claim this app is this high quality of service. Let's test by terminating different components or causing degradation to a number of components and see if the app is still that one.
Now, you intentionally make that a requirement as part of your CI/CD pipeline. You capture the results, and then you know, okay, this is really resilient, and then you let it go into the production environment or the higher environments.
And we don't want to stop there. We want to turn this on on an ongoing basis in the non-prod environments to begin with.
So different types of combinations, or maybe you are dependent on an external service. When that goes down, you know that effect. So we want to turn this on on a weekday, causing disruptions, and disruptions should be part of your daily operations.
And eventually, we want to turn this on in production. So we create those live disruptions, or the mini storms, and sustain those things, and that should be part of our daily operations.
So when really something knocks out, like the S3 failure we had in the beginning of the year, or like last year we had a DynamoDB failure which caused many other services to fail, we would probably be in a better place when we have all these things in place.
So that is the power of CloudDetour, or the chaos engineering automation tool that we developed, and how our app teams, they don't have to do anything except subscribe for disruptions. Make disruptions as part of your pipeline testing. That will intentionally make you think, these are my areas of failure, layers of failure, and you will build solutions so that when such a failure happens, how you can react.
At least it will make you think there are such failures that are going to happen, and you will have some solutions for handling those things.
So that's, at a high level, how we operate in our CI/CD pipeline: embed chaos engineering subscription, and then turn it on in the ongoing environment and make it part of their daily operations.
Sathiya Shunmugasundaram
On the various features of CloudDetour, we have several cloud service disruptions, which we'll cover in the next slide, which disruptions we'll be able to cause.
Cloud provider abstraction. By this, currently, we have designed our system based on AWS Lambdas, completely serverless, which will help the AWS cloud. But we have an abstraction layer that will consider services at a higher level, saying compute services, storage services, networking services, database services. And as we add additional providers like Google Compute or Azure, we will have to add those plugins.
As part of our open source initiative, we are planning to open source CloudDetour, most likely in early quarter one of 2018. And by that time, we will be completely, fully vested with the AWS implementation, and we'll be looking for contributions for Azure while we develop, and we'll be also looking for industry contributions.
There are different types of disruptions: localized disruptions and API-level disruptions.
Localized disruption means those are happening within a server. For example, in a Linux server, your process can fail, your CPU or memory can be hogged, and network traffic can be severely impacted by IP table changes, things like that. Overall, everything happens within an instance, and you have difficulties.
Then API-level disruptions means you are not able to access things like S3 or EC2 Container Services, things like that.
Also, we have an approval workflow. It means, in the Netflix case, maybe, for example, somebody schedules a disruption and your application will be impacted. Maybe he's an authorized person, but still we feel like, as a financial institution, we need more accountability of who created it and who approved it. So we have systems that verify the owner of the particular service, and they'll be contacted, and they have an option to approve or decline, things like that.
And that provides an audit trail of who approved, and every disruption is logged into a database, and the reporting can be done on what happened to each service.
Health checks and preconditions. Health checks are very important. It's not a complete monitoring solution. But let's say if your application availability can be tested using an HTTP URL or a CloudWatch alarm, for example, CloudDetour, after each disruption, will go and check your health for X number of times or X number of minutes, whatever you wish.
It will create a report of what happened over that time, and it will send a detailed report of, "Hey, after your disruption executed, we ran health check five times, five minutes at a time. First two times it failed, and rest of the times it succeeded." So overall, your health may be okay.
We also integrated with things like PagerDuty to do escalation if the health check fails.
And CloudDetour has a CLI and an API and, as I said, the reporting.
A couple of reasons why we didn't want to invest in the Chaos Monkey and completely, CloudDetour, our homegrown solution, is in the corporate environment, there are a lot of things like proxy and things you have to take care of. Netflix Chaos Monkey doesn't have proxy support.
Netflix Chaos Monkey is a Java-based application. They might have other versions now, but at the time we were writing, it's a Java application. We have to deploy it in a server, and we have to deploy it at application level. It means if you have 10 applications, you have to have 10 different deployments handling each application.
In fact, we decided to do serverless because one day we were planning to demonstrate Chaos Monkey, and one of the developers, the night before, terminated the Chaos Monkey instance. And in a few hours before the demo, we didn't find it.
So the demo scope was just terminating an Auto Scaling group. We decided to quickly write a Lambda function that kind of did the same thing, and the demo just went fine. So that triggered that we should do a completely serverless implementation so that you don't have to manage it actively in a cloud environment.
And principally, it can also be extended to data centers using the principle, but our implementation mainly relied on AWS Cloud in the first cut.
These are all many different kinds of disruptions we have supported. Netflix Chaos Monkey only does Auto Scaling groups and EC2-level disruptions and nothing else. But they do availability zone issues, complete region issues. That's what they are doing.
So we have Auto Scaling groups, the same things like regional failure. We will terminate every single Auto Scaling group instance in a US East region. So if you are really resilient, you'll be working from West without any impact, things like that.
EC2 instance level, as I said, memory, process, EBS, network failures, ECS container service, ElastiCache service, S3 bucket. So if you block the S3 bucket policy, what happens? So you cannot access the bucket from US East. So does your application still work from US West? Or how does the user work?
Many people may not even be testing those kinds of scenarios. They think that S3 is 99.99 and they'll be available all the time throughout the year.
Things like RDS. What happens when you fail over from a primary to a secondary server? How much of your application will be impacted?
We ourselves have tested some of the open source products like Consul, which kept on losing its quorum when a master failed. So we were struggling to simulate it, but when we had CloudDetour, we ran a 72-hour endurance test. During the time, every two hours, we kept on randomly terminating one instance.
There were times like at least one in three, master was terminated, and we had so many logs that showed the behavior: why the master was terminated, and it couldn't form the quorum. And we were able to work with the vendor providing those results.
Without an automated tool like this, it wouldn't be possible to do such kind of test comprehensively.
So some of the final thoughts. We did a number of applications, close to 50-plus applications in the last several months. We have piloted, we installed this tool, and the applications came and subscribed for disruptions. And those applications found tons of things otherwise they wouldn't have uncovered.
So for example, when they thought, "Okay, I have my application. When something goes wrong, my machine will be terminated, a new machine will come up, and everything will be happy." So when we introduced the disruption where we go, you keep the VM or the EC2 instance, but go and shut down the processes inside that, the folks, development teams, did not think of that as a situation.
They always built for, "If it goes, my whole VM goes down, and everything will be happy when the new VM comes up and it bootstraps with all the processes." So those are some examples.
And the other one, it's all about the degradation of the services. For example, access to third-party APIs or access to S3 bucket. If it is broken, you have means of handling that one. But we severely degrade that path from your machine to any API call. It is barely working.
So we introduce situations like that, which are more real-life-like situations, and we as a team are learning a lot.
And now when teams build solutions, it goes into our design feedback. So in your DevOps culture, you have that feedback from the ops saying, "Oh, these are the things that you need to take care." And some of those are making it back into the design book so the applications can effectively take care of it.
Do we have... I think we have no time? Okay. I think we don't have time for questions, but thank you all for attending.
Thank you to the speakers.