DevOps at Capital One: Focusing on Pipeline and Measurement
In my previous years’ talks at DevOps Enterprise Summit, I spoke about starting and scaling of DevOps at Capital One; importance of Open Source, Open Technology and Innovations in DevOps.
This year, I will present Capital One’s journey of maturing in DevOps and Continuous Delivery. My presentation will cover our current areas of focus: Delivery Pipeline, Flow and Measurements. I will also share some of the problems we faced and what we did to solve them.
Chapters
Full transcript
The complete talk, organized by section.
Topo Pal
My name is Topo Pal. As he mentioned, I am a Senior Engineering Fellow at Capital One, which basically means nobody reports to me, and I can do whatever I want.
A dream. Yes, a dream job.
So Capital One. Before I go on to my presentation, I need to talk about Capital One. We have millions of accounts at Capital One, business units including card and bank and all the other financial services. We are one of the largest digital banks in the U.S., and number one as far as InformationWeek's Elite 100 companies go.
And we are about 20 years old. I stress upon this fact because if you compare the other banks and how old they are in the U.S., we are a startup in the banking industry. And that's what makes us very agile in nature from the get-go.
We believe that we have a different DNA. We build our own software, so we have about 8,000 engineers at Capital One. We build on public cloud. That's interesting because being a bank and being on the public cloud don't go together very well. It's very scary, but we actually do.
We build with microservices architecture, meaning that we build components that are independent of each other, managed by individual teams.
We firmly believe upon open source and open-source technologies. We not only consume open source across the whole enterprise, but we also contribute to open source.
And we believe in DevOps, security, and continuous delivery. I put DevOpsSec because every time I talk about DevOps, I see security is missing. But missing security from a banking industry is very scary. So we put DevOps and security together. In daily terms, we use the term DevOps, but it really means DevOps with security for us.
My personal journey: I joined Capital One about six years ago as an enterprise architect, taking care of our SOA architecture and all that good stuff. Did a lot of PowerPoint presentations in my life, but before that, I was a developer, and I'm still a developer by heart.
I created strategy around DevOpsSec, and I act as DevOps evangelist across the enterprise. And then I moved on to Shared Technology Group. I'm the product manager of our continuous delivery tools platform, which starts with GitHub, ends with Chef, Ansible, and all our cloud tools.
DevOps evangelist still now. And I'm core contributor and community manager of Hygieia, our first open-source product, which is basically a DevOps dashboard. There are a lot of users around it, and I'm trying to build a stronger community around Hygieia.
And this is what it is. We actually won the Open Source Rookie of the Year 2015. I'm very proud of that.
Our Agile and DevOps transformation: from waterfall to Agile, manual build to automated build, manual deployment to automated deployment, manual test to automated test, data center to public cloud, closed source first to open source first. This all happened in a short span of five years.
It sounds like it's a lot of years, but when I look at the work that has been done over the last five years, it looks like it was yesterday that we were fully waterfall and doing the classic sense of delivery and throwing things over the wall to ops and QA and security. And then security raises the red flag: "Stop. You cannot go to production because you have some kind of vulnerability in your source code."
There are a few other key things that have happened. We were mostly outsourced at that time, five years ago, about 80/20. Now we are mostly insourced. It's just the reverse of it.
Vertical silos like dev, QA, ops, security to product-based team. That takes care of the full life cycle of the product.
Dev, ops, QA, RM. So we had developers, ops engineers, QA analysts, release managers to engineers. So we have engineers right now. Some engineers write application code, some write infrastructure code, some write test code, and we also build tools around release automation, and they also write code.
I presented our journey in DOES 2014 in San Francisco. It was building out automation steps. And then, in 2015, I talked about scaling DevOps and open source and cloud and innovation across the enterprise. In '16, I talked about measuring and improving and creating more maturity model around DevOps. And I'm going to carry on the same thread in this talk, too.
Typical DevOps success story for our organization is code commit from random five years ago to hundreds per day. Integration monthly to about 15 minutes. Deployment from manual deployment to automated deployment. And then QA performance, we used to deploy monthly. Now it's about four a day. It varies from team to team. Some teams do more, some teams do less, but that's an average.
Production deployment from monthly or quarterly to once a sprint. And testing, it was fully manual to automated.
Now, this is the area that we started focusing on over the period of last 18 months or so. How can I increase QA performance and production deployment from these numbers to more frequent? And I'm going to talk about that.
So we asked our developers what DevOps means, and everybody came up with the same answer. It is this. Some talk about automation, some talk about writing infrastructure code, some talk about building servers quickly, some talk about test automation. But overall, everybody has the good idea of what DevOps means, and the answer is, it depends upon who you ask.
So when we talk about DevOps in Capital One these days, we don't talk about DevOps in general. What we talk about is this: the goal of DevOps. So instead of defining DevOps, what is DevOps, to why DevOps is important to us. Which basically means delivering high-quality working software faster.
Now, there are three words or phrases in this sentence that are important to us: high quality, working, and faster.
High quality means no security flaws, no legal flaws, minimum defects, and all that good stuff. Working means end-to-end, it really works. And then faster. That's the biggest question. How fast is faster? As soon as possible?
And can you do it faster or better? There's a challenge between doing faster versus doing better. But we'll see whether they mean the same thing or one complements the other, or how it works.
So we looked at DevOps surveys, too, and you all know about it. If you don't, then look up the DevOps survey by DORA and Puppet Labs, and it kind of points to a particular topic of faster versus better and all that metrics around it.
And we found that everybody talks about faster and better together in DevOps world. And I started thinking, I might have heard this before when I was a kid studying physics, and I found that it is not something new that has been discovered. Bernoulli has been talking about it long ago, since before we were even born.
And the principle is, if you constrict the flow of a fluid, then you can actually increase the speed and lessen the pressure. And that's the diagram. I don't remember it anymore, even though I have a PhD in physics. That was a long time ago.
But that's basically it. If you have a continuous flow, and if you restrict it, then you can actually increase the flow or, by the way, you can actually reduce the pressure.
And you can draw a parallel to that by saying that, hey, if you have a pipeline from commit to deployment, then you can actually control the flow or increase the flow by making sure that you are delivering a smaller chunk at a time, which is a basic principle of Agile DevOps.
You deliver a smaller chunk at a time, thereby increasing the speed. Or, by the way, developers are less pressured. I'm very happy if I can actually commit my code, see it in production, and go home and not think about it, as opposed to delivering something in production that I coded a few months ago and I don't remember, and some defects pop up, and I go like, "What happened? Somebody must have coded after I coded, right? So it's not my fault anymore."
So again, the same principle, commit to deploy, and it's called pipeline. I have seen many pipelines over the period of five years, not only from Capital One, but from also other organizations. And I actually can categorize them in three broad categories.
One is this. You can see branching strategy, parallel branches going nowhere almost. Kind of seems like they're getting deployed together, but they actually don't. They all develop on parallel branches, and after a few months, they think, "What happened? How do I merge the code and deploy to production?"
That's one kind.
This is another kind. There are complicated pipelines, very interdependent components put together. Each have their own code repositories and branches, and you really cannot figure out where the pipeline starts and where it stops.
And this is my favorite one. Everybody has a pipeline that needs people to actually manage the pipeline. There are holes in the pipeline that leaks, some test cases fail, some build fails, and you need actually people to fix the pipeline.
So what we focused on is pipeline design, measure, and how I improve.
First, pipeline design. Pipeline must have 16 gates, so we kind of listed 16 criteria that a pipeline must have. I call it Ten Commandments of DevOps in hex.
They go from source control, branching strategies, static analysis, code coverage, vulnerability scan, open source scan, artifact version control, auto-provisioning, immutable servers, integration testing, performance testing, build/deploy testing automated for every commit, automated rollback, automated change order, and then zero-downtime release. Whether you do it by canary or blue-green or something else, it doesn't matter. When you deploy to production, your production system, from customer's perspective, has zero downtime.
Pipeline measurement. So this is another area that we started focusing on, and we reached out to DORA, and we did some initial surveys. We talked about the questionnaire and the whole survey, and it got improved. And I think we contributed some of the questions in the DORA survey, too.
We looked at that, and we also used Hygieia. If you have not seen this product, please take a look at github.com/capitalone/Hygieia. It actually allows you to measure your cycle time and points out the stoppage time in between.
So we looked at this, and actually, this is a Hygieia screen where you can actually point out your stopping points between two stages. The idea is that you do not need to know why that stopping point is. You just ask the developers, "Hey, why are you stopping for, for example, five hours and 21 minutes on an average between my two stages? Maybe pre-prod and prod or pre-performance and performance." And ask the developers to reduce that.
There are many reasons why that stopping point exists, and we want to focus on that.
So between the studies, we found two opportunities or areas of opportunities for us. One is branching strategy, and one is process.
The branching strategy is kind of important because we recommended trunk-based development, and everybody freaks out when you say that, all the development teams, because they do have competing requirements from different business units.
But we do recommend trunk-based development. Not many people can do that. So the other option is this equation.
Now that's crazy. What it basically means is that if you know that you are going to deploy so many times in production, then your merge time to your trunk or the release branch, or whatever you call it, has to follow this particular equation.
So, for example, if my CI time is about an hour, if my continuous delivery time is about three hours, and my deployment time to production is about three hours, and my goal is to release three times a day, then you make sure that you merge your code base to your release branch or trunk base every three hours.
Now, you decide how frequently you want to deploy to production, and you already know your CI time, CD time, and deployment to production time. So you figure out how frequently you want to merge to your master. So that's the guideline we try to provide to our development teams.
Pipeline improvement. We talked about processes. And processes are pillar in DevOps pipeline because there exists something called CAB, Change Approval Board.
I don't know how many people are there because when I see the meeting, it's a huge list, and there are some groups also involved. So on the call, there are like, I don't know, 57 or 65 people saying yes and no.
And I actually asked some VPs that, "I know that you approved this change order. Do you know what it is?"
And he goes, "No."
I said, "Then why do you approve it?"
He goes, "Just a process."
I have not seen any single VP or senior leader getting penalized for breaking production. If that doesn't happen, where is the accountability, and why do they even come into the picture?
But there exist some things that we actually have to comply to, which brings us to rules and regulations. And we talked to the audit and our risk management office, and these are the risks that they actually point us to, and I think these are real risks.
The risk is intentional damage, meaning somebody actually wants to deploy to production and create some problems for our company.
There are unintentional damage. And being a developer, I have never done that. I've never produced a bug. It never went to production. Everything was cleaned up before it went to production. And that risk everybody is very focused on.
In order to do faster, I am committing faster, I'm doing builds faster, I'm deploying to production faster. But did we test everything that we needed to? Did we do the security compliance check? Did we scan our code to make sure that we are using libraries that are really open source versus not commercial, and all those things?
And then untested code in production. Now, this is a critical area for everyone. How do I prove that everything is tested? I don't know. We just wrote enough test cases, I guess.
So, what's your code coverage? I don't know. We have about 1,000 tests, but does it cover everything? I don't know. Everything passed. That I know.
So that's a very interesting discussion altogether, whether we have enough test cases. And how many test cases do I run in one CI run? Do I run all the thousand test cases? Do I run 50? Do I run 10? What is the proof that my code is actually working or is going to work in production?
So these risks are real. And I think instead of manually checking all these risk controls, and everybody jumping onto the call saying, "Yes, I agree. Yes, I agree," there is a better way to do that. The better way is actually to use DevOps principle to apply some of these risk mitigations into the pipeline.
So our hypothesis is DevOps security and CI/CD provide a better control around risk and security mitigations.
We have come up with about 30 practices to satisfy audit and compliance, and they are actually built into the pipeline.
If everything is source code, which by definition is in DevOps world, my application is source code, my infrastructure is source code, my test cases are all source code, no one needs access to production.
So the first thing the audit and risk management office asked us, "So you have developed this pipeline. You have all automations in place. Who has access to production servers?"
And my answer is, "No one."
And they go, "What do you mean? Hold on, hold on. Come back. How many people have access to production?"
I said, "You know what? If you are matured in DevOps, you actually don't need access to production. Having access to production means that you are not that matured. So our goal should be that final goal that we are fully matured. It's a fully automated pipeline that goes from code commit all the way to production. Nobody needs access to production."
And when you say that, everybody gets so scared because now you do not have the key to your home.
So, our whole idea is have these 29 principles built into pipeline with the assumption that nobody has access to production boxes. However, there will be some needs when you need access to production because your pipeline, let's say, takes one hour, and you have a production bug that needs to be fixed immediately. And somebody knows that if they can get into the production box, they can change some configuration to make it work.
At that time, you actually break glass, which basically means you check out a temporary access to your production box, get in, fix it, come out. And, oh, by the way, as soon as you get into production box, every keystroke of your keyboard is going to get logged and will be analyzed for detective control perspective later on. So that's our hypothesis for emergency break glass.
And we actually came up with this during one of our DevOps Enterprise Forum two, three years ago. There was a white paper, "An Unlikely Union: DevOps and Audit," which actually points to some of these principles, and we are still working on, and I'm a part of this forum. I'm a very proud participant of that forum. And we are still kind of working through it in much more details.
So this is what it looks like, our pipeline with all the 29 controls kind of built in right now. So right now, what we call this is "certified pipeline," where all these points are manually checked.
Where we are going with this is we are going to fully automate this so that every time there's a release, you can actually automatically check for all these 29 things, requirements that you need to satisfy your audit and compliance.
And we are going to do that by using our own open-source tool, Hygieia, which collects a lot of data along the pipeline tools and actually proves that when you release, all these 29 requirements are satisfied.
Result. Production release, once a sprint, as I said in my last slide, now one-plus a day. And it's not for everyone. I should be very transparent about this. We have about 100 application teams that are going through these automated checks and pipelines. They release one-plus per day.
And I have taken some numbers randomly. Number of applications with release automation is about 20-plus. It was actually last November. So right now it's about 100.
Maximum number of releases in one day for one application has been 34. So imagine some application team is actually going through this automated pipeline, fully compliant, no need of separation of duties, et cetera, and releasing to production 34 times.
Our goal. So this is with the segregation of duties, which has transformed into doing the whole deployment end-to-end without segregation of duties. So that was our goal last November, and this late winter, we have been successful around this, releasing to production without segregation of duties. And we can actually show the proof that there has been no violation of separation of duties along the way that can support the initial risks that there's no unintentional damage, there's no intentional damage, and no untested code in production.
So we are going to open-source that pipeline model again, as I said.
And then we do also have a fork, an enhanced version of LGTM. I don't know how many people know about LGTM. Looks good to me. It's a lightweight code review tool that was open source, and they stopped it, so we are taking it over. We are adding more features, and some of these compliance rules will be actually embedded into the tool. And we are going through our open-source process to actually open-source this.
The most difficult part in open-sourcing a tool, in my experience, has been how to name it.
And this is my favorite slide: all of Chuck Norris's change controls are full cycle, and they are always approved. That's where we need to go, and that's my favorite T-shirt also.
Thank you very much.
Q&A
Any question? I have three minutes. Three minutes, so I can answer three questions.
Q: Are your compliance tools built in-house? Are they all built in-house, or are they...?
A: Yes, they have been built. We have our homegrown release automation tool that actually makes sure that the pipeline can be certified, and all these 29 criteria have been embedded. What we are working on is to automate those checks so that I can just look at the data and make sure that we already meet those.
Q: How do you work with regulation, satisfying regulation?
A: So it took us about 18 months. We initially started with meeting them and telling them what we are going to do, and they got scared. And then they came back with their viewpoint, and we had multiple meetings. And then we all came to a conclusion that we all agree to those three risks. Whether we apply the legacy way of doing things or the new way of doing things, we compared both the sides, and they said maybe the new way is better because now actually I can see the data and make myself comfortable.
Q: When you say that you don't have access to production, it's usually the production data which breaks the application. So how do you get around that, the access, the data...?
A: So if you look at this, the continuous delivery pipeline and multiple-time deploying a code to production, it's very infrequent that the bug is due to data. It's because the data is kind of stable. We are not adding new columns to our customer table. That's been there. I don't know when it was last touched.
So it's always the infrastructure, the configuration, the application code, which is changing more frequently, and there's a more chance that there will be bug introduced in that part of the pipeline rather than the data.
So yes, somebody may need access to some data, but that can be controlled, and only a few people can access that, and you need separation of duties and all that. But as far as the application is concerned, nobody needs access to production boxes.