DevOps at Capital One: Focusing on Pipeline and Measurement
In my previous years’ talks at DevOps Enterprise Summit, I spoke about starting and scaling of DevOps at Capital One; importance of Open Source, Open Technology and Innovations in DevOps.
This year, I will present Capital One’s journey of maturing in DevOps and Continuous Delivery. My presentation will cover our current areas of focus: Delivery Pipeline, Flow and Measurements. I will also share some of the problems we faced and what we did to solve them.
Chapters
Full transcript
The complete talk, organized by section.
Topo Pal
My name is Tapabrata Pal. I go by Topo. Don't call me by Tapabrata, because you may not get an answer back because I don't know who you are. My Twitter handle is @tapopal, so please tweet.
And I can't tell you how honored I am to be on this stage for the third time. This is my third year, and I'm loving it. It's like homecoming for me. I see old faces. Just love it. But I remember how scared I was on this stage three years ago, and I'm still scared.
About Capital One. Most of you know about Capital One: "What's in your wallet?" How many people in here have a Capital One product in their wallet? Thank you. Awesome.
Most of you know Capital One is a credit card company with millions of accounts, one of the largest digital banks, one of the topmost banks in the U.S. right now, number one in InformationWeek's Elite 100, and we are just about 20 years old. And this is pretty significant because if you look at our nearest competitors, they're more than 100 years old. So we are kind of a startup in this whole industry, and we want to be just like that because we feel that we have a different DNA, in the sense that we build our own software.
Our engineers write their own software. We produce them. We send it to the customers, such as you, and you use them. We build on public cloud. That's pretty big. Being a bank, using public cloud is a big step.
Microservices: all of our latest product suites are built on microservices architecture.
And using open source. Five years ago, Capital One used to be a closed source only company. Right now, it has to be open source first. And we love DevOps, security, and continuous delivery. We use it.
My personal journey at Capital One: I joined Capital One about six years ago as an enterprise architect, and then I started shifting to... Gene loves architects. So I started shifting to DevOps and DevOps security. I created strategy around DevOps security for the whole enterprise. I acted as DevOps evangelist, helping stand up our shared tools platform.
Today, I belong to our shared technology group as a product manager of our continuous delivery tools platform. I'm still a DevOps evangelist. Between now and then, what changed is that at that time, I was driving some changes, and today I'm learning new stuff. And I'm also a core contributor and community manager of our flagship open source product called Hygieia Dashboard. Thank you.
Talking about Hygieia, as part of our DevOps transformation, we always wanted some dashboard to be used. We could not find it, so we wrote it ourselves. As I said, we build our own software. So myself and one of my colleagues, Amit Malkin, we started designing and writing some scratchy way, some tools, and finally it came out to be Hygieia. We open sourced it during OSCON 2015, and since then, it has grown big and bigger day by day. We actually won the Open Source Rookie of the Year 2015 this year.
And if you like it, please go ahead and use it. And let me suggest this. If you like it, and if you are developing it, and if you want to develop it full time, please come and talk to me in absence of your manager.
And I'll up the stake. I'll up the stake a little bit. If you are an open source contributor, if you love open source, if you are passionate about open source, you have an open source product that you are managing today, but if you want to do that full time, please talk to me in the absence of your managers. Thank you.
Talking about Capital One's DevOps journey. Within the last five years, this is what has happened: from waterfall to agile, from manual build to automated build, manual deployment to automated deployment, manual test to automated test, data center to public cloud, closed source first to open source first. These are all common, but I think the most impactful changes that have happened are these.
Number one, mostly outsourced to mostly insourced. That's, I think, key to transform your company to be a technology company as of today.
Second, vertical silos to product team. There's no DevOps, QA, release engineer. No, nothing like that. It's all product team. The whole team manages the whole thing from start to end.
And instead of DevOps, QA, and release management role, we all have engineers. Everybody writes code. Developers write application code, ops write infrastructure code, QA engineers write test code, and release managers actually write release orchestration code.
Now, as I said, this is my third year on the stage. In 2014, I talked about building automation steps for DevOps transformation. 2015 was scaling of DevOps using open source, cloud, and lots of innovation across the company. And this year, we are going to concentrate on measurement, improve, and how to mature more.
A typical DevOps success story is what I presented last time. It was kind of this: code commit from random to hundreds per day; integration monthly to 15 minutes; deployment manual to automated; to QA performance monthly, about two, to about four per day; production release changed from monthly or quarterly to once a given sprint; and testing was from manual to automated.
Now, this is all good. This really is a big achievement from moving to waterfall and manual steps to fully automated steps. But we are not quite happy with these numbers.
As Gene was suggesting, the signature of a good lean company is the number of deployments, frequency, and the speed. We are not happy with our own speed.
And let me put a disclaimer right here about the deployment. For us, a deployment means a change of source code into production. It does not include configuration changes. It does not include styles or contents or all those things. It does not. It's just pure, simple code deployment to production. It does not matter if there are 10 servers, or one, or 100. It's still one deployment. So given that, I think these numbers are pretty significant.
So 2016 is "What's in your pipeline?" matching with "What's in your wallet?" We have started asking our teams a simple question: "What's in your pipeline?"
So we have been doing DevOps security in our organization for about five years. Our engineers know what DevOps means. And guess what? Everybody says the same thing.
To me, if you ask my opinion, I think we need to stop defining DevOps. Instead, we should be asking, why is DevOps? I think that produces a better result than trying to define what DevOps means.
In my point of view, DevOps' goal is this: delivering high-quality working software faster. And there are three words that are important here: high quality, working, and faster.
Now, if you look at the first two, high quality and working, nothing changed there. Waterfall has produced high-quality working software for years and years. What changed was this: faster.
Now, we used to do one release per quarter. Now we do one release every day, or every week, or every sprint. Is that fast enough? I don't know. I mean, how fast is faster? Where do you stop? And actually, what is the good measure of being fast?
But it is proven in the industry that the faster you go, the better you get. And these guys have proved it in DevOps Survey. Who has not read DevOps Survey for this year? Everybody read it. That's awesome.
So they have proven that if you go faster, it becomes better. But then I was thinking, there has to be scientific proof for that. Now, Nicole will stand up and scream at me, saying that this is scientific, and yes, it is. But being a physicist, I need some other kind of proof. And I started thinking, where did I actually see this proof somewhere when I was a little kid studying physics and all that? Guess what? I think Bernoulli proved it in the 18th century.
It was Bernoulli in the 18th century that actually proved that faster is better. What he proved was if you constrict the flow of a fluid, it increases speed and it lessens the pressure.
So think about a pipeline and you are sending water. You constrict the flow. The velocity will increase, but that's the secondary result of Bernoulli's theorem. The important thing is that it lessens the pressure, meaning that if you constrict the flow, meaning smaller batch size, you move faster. And not only that, the internal pressure in your delivery teams decreases.
Well, that was a bit stretched, but I just wanted to give that out. I thought that was funny.
So we started looking at our pipeline. How do we actually take a code commit through a pipeline all the way to a production deployment?
Now, as I said, we looked at some sample pipelines inside and outside of Capital One just to understand what pipelines look like. Now, I have categorized these pipelines into three ways.
One is this. Sorry, how do I go back? All right. You remember these pipelines? A set of pipelines move forever, right? Where do they meet? They don't. It appears that they'd meet somewhere, but that's optical illusion. Mathematicians will say that they meet at infinity, but that's not good for us. This is called parallel branching. Multiple branches just keep moving, don't know where to go.
This is the second category of pipeline. You cannot tell where it starts and where it ends. This is like a complicated pipeline. You know that code commit is happening somewhere, but where is it going? Nobody knows.
I love this one. It needs an army to manage this pipeline, and we see that every day. And to me, if you ask me, I have seen a pipeline where code commit happens, build happens, and directly goes to deployment. That's not a pipeline for me. If your pipeline does not contain information security scans and et cetera, and automatic test cases, it's not a pipeline, period.
To summarize, for this year, or 2016, our goal was to look at our pipeline design, measure and improve the pipeline, fix bottlenecks.
The first one was pipeline design. So we defined for the whole enterprise, every pipeline must have 16 gates: source code version control, optimum branching strategy, static analysis, more than 80% code coverage, vulnerability scans, open source scans, artifact version control, auto provision, immutable servers, integration testing, performance testing, build/deploy testing automated for every commit, automated change order, zero downtime production release through blue/green or canary, and feature toggle.
Now, these 16 gates are used to measure each and every product team's movement through the DevOps transformation, and it is seen by the CIO at the top level. Now, these are 16 gates. I call it Ten Commandments in hex.
Pipeline measurement, second. So we looked at multiple ways to measure our pipeline and our team's DevOps performance.
So Jez Humble showed me an earlier version of a survey that he was designing last year in this conference. We looked at it. I went back, I talked to my folks at Capital One, then we decided we had to run some POCs, and the POC came out to be good. And we actually contributed or influenced the survey a little bit in terms of security scanning and all those things, and test data management. And those are included in the survey.
So we ran the survey across the whole enterprise, and we got some good results. A few things came out of the survey. One is that we have a way to go. And the second thing that came out that was very encouraging for us, that a few of our top-performing application or product teams are at the industry level. So that was very encouraging. That proved that we are moving in the right direction.
The second was our own Hygieia. So in Hygieia, you actually can measure the flow rate, or the commit moving from your commit stage all the way to production. And a sample screen looks like that. You can actually see that the commits are moving through your pipeline from commit, build, dev, all the way to production. And you can actually see the stoppage time.
So our theory is, instead of trying to understand how to speed up, try to reduce the wait time so that you can increase the speed finally. Because you never know where the wait time is and which wait time is meaningful to you.
As you can see, sometimes I have seen developers trying hard to reduce their build time from 25 minutes to 15 minutes, where the test cases run for one hour, right? So do you spend a lot of energy and time behind reducing the build time, or do you try to speed up the wait time, or reduce the wait time to speed up testing?
Now, let the developers decide which wait time to reduce. You just make that transparent through a dashboard, or some report, or whatever. That's the whole goal.
Opportunities that we found through both the studies are kind of two at the high level: branching strategy and process.
Now, branching strategy is a very critical one for DevOps success, and Jez Humble will tell you all about it if you give him time.
What we did was that we actually formed a team to look at our branching strategy and how we can improve it. What we came up with, studies and numbers, is that we want trunk-based development. That makes your pipeline flow faster. But it's a tough choice. You cannot tell each and every product team to use trunk-based development. It's not going to go anywhere.
The other option is give some kind of formula that allows teams to decide what kind of branching they will do. It doesn't matter whether they are doing feature branching or whatever. As long as they can decide what their CI time is or CD time is or production deployment time is, and they can figure out how many times they want to deploy to production, it will give them a number that if you want to do this, then your merging to master has to be within this time frame. So with that guidance, you can actually derive to some conclusion about the branching and let it be driven by the developers themselves.
Pipeline improvement. Improving process. Automated release process and revisit audit and compliance. This is huge.
The core of this bottleneck is what every big enterprise sees every day: CAB, change approval board. Before you go to CAB, you actually have to take pre-approvals. So approvals before pre-approvals, it's like boarding an airplane, pre-boarding and then boarding, right? I don't know. I'm sure it is much more complicated than that.
This year, we worked very closely with our auditors, risk compliance office, to actually understand what is that process as regards to audit and compliance and how to improve it. Now, we had a lot of sessions with our audit compliance teams. We did a lot of back-and-forth brainstorming, and we came to a set of hypotheses, and I'm going to present those hypotheses to you shortly.
But before that, let me say that we came to a common ground, and the common ground is the risks are real. You can have intentional damage, bad actor scenario. You cannot ignore that. You can have unintentional damage. As a bad coder, I can produce some code that can break the production and allow the hackers to come in. And then untested code in production. I can code, but there were no test cases, so the code went out to the production and broke everything. It's not my fault. The test cases were not there. What can I do?
So as Gene says all the time, there's a better way, and that leads to the set of hypotheses that I'm going to present.
DevOps, Sec, and CI/CD provide better controls, and we can prove that. What we came up with is a set of 30-odd practices that we think can satisfy audit and compliance.
The argument is, if everything is source code, no one needs access to production. For emergency, there has to be a process called break glass. Otherwise, nobody gets into production.
And last year, through this DevOps Enterprise Summit forum, there was a white paper called "An Unlikely Union: DevOps and Audit." So many of these hypotheses actually came out of that white paper. And this year, there is another forum paper coming out, and I think we all enrich from that white paper. It's pretty good.
Result of doing all this. Now, as I said in one of my previous slides, we are not very happy at our production deployment speed. Now, that has gone up. It used to be once per sprint. Now it is one-plus per day.
And actually, we have many application teams going through this new process, and they are seeing not only one-plus per day, but sometimes on an average, it's 10-plus per day. And some of the high numbers are, first of all, number of applications with this kind of release automation is 20-plus, and the maximum number of releases in one day for one application was 34.
So again, as I said, our deployment is measured in terms of one event of deployment to production, whether it's 10 servers or 1,000 servers, doesn't matter. Whether there was a whole set of new infrastructure built on cloud or not, it doesn't matter. It's still one of such releases.
Now, this was with segregation of duties. That means somebody was actually clicking the button to send the code out to production. So we are not happy with that either.
Our goal is to actually do release automation without the classic segregation of duties. That means our theory is, with all those 30-odd practices, if you can certify, quote unquote, a pipeline like a clean room, then code can go from commit to production without any problem. As soon as somebody touches that clean room or the pipeline, then that flow breaks. It needs to be approved or tested or certified again.
And with that kind of automated control in place, I think we can achieve that. We are running some control tests. I'm not going to declare any success yet because it is still a control test. We are trying to figure out if this really works and if it is scalable.
Now, preparing for this is not easy, right? So we had to build our own tool. Our release management team, which is now a release engineering team, is a set of engineers that actually write code. They actually developed a release engineering tool in-house that allows this kind of thing to happen.
And by the way, we also forked out our open source project called LGTM, Looks Good To Me, on GitHub. We actually modified that to build some of these new capabilities in there. We want to give it back to open source LGTM, but they did not take it, so we are going to actually open source it ourselves as a new product.
Coming soon to open source, as I said, a secure and compliant pipeline model. So we're going to actually model up those 30-odd practices and open source it for all of you to use and contribute and comment. So that's number one.
A forked and enhanced version of LGTM. We are going to open source that, too. I think that will satisfy some of the audit concerns that your auditors or risk compliance officers may have.
And we also have open sourced something called Cloud Custodian. It's a new cloud management tool. It's not a cloud management tool per se, but it is a policy enforcer. At a central place, you can actually enforce your cloud behavior so that you can meet some of the basic audit security need. Take a look at it. Search Cloud Custodian GitHub. You'll get to that particular page.
And this is my favorite T-shirt. It says, "All of Chuck Norris's change controls are full cycle and they are always approved."
I love Chuck Norris. He can make things happen. I think as a collective force, I think we can make this happen.
And with that, I'll say thank you. Let me know how it goes for the rest of your conference.