Banking on Continuous Delivery
Tapabrata Pal has more than 20 years of IT experience in various technology roles (developer, operations engineer, and architect) in the retail, healthcare, and finance industries.
Over the last six years, Tapabrata has evangelized and led the company’s DevOps initiatives. He is currently Senior Director and Senior Engineering Fellow focused on DevOps, and Continuous Delivery at large scale in a regulated environment. Tapabrata is also the community manager and core contributor to Hygieia open source project.
Previously, Tapabrata spent some time in academics doing doctoral and postdoctoral research in the field of solid state physics.
Chapters
Full transcript
The complete talk, organized by section.
Topo Pal
My name is Topo Pal. I'm from Capital One.
There are two name tags here, and it's on purpose. So, as I said, I'm Topo Pal. I'm Senior Director and Senior Engineering Fellow from Capital One. And we actually did this work together with Jennifer Brady. She's our Director of Technology Governance. So I just wanted to put both the name tags on there just to share the credit. If you like it, all the claps will go to me as well as her.
Jennifer is a former Audit Director, and she's currently IT Governance Director, and she's responsible for both our control automation and data analytics team that kind of governs our CI/CD pipeline. She works extensively with data scientists, data engineers, and developers, and this is kind of a new area that we started working on at Capital One, where we bring audit and data analytics together.
Myself, I'm a core developer. I'm a DevOps evangelist, and that's why I show up at these conferences, to learn more. I'm also the product manager of our shared continuous delivery tools platform that starts with GitHub, goes all the way to Chef, Ansible, and so on and so forth. I'm also the creator and core contributor of Hygieia DevOps Dashboard, and we'll talk about it briefly later on.
Before I start my actual talk, let me talk about Capital One. We are actually majorly in the U.S., but we do have our presence in the U.K. We have millions in accounts. We are one of the largest digital banks in the U.S. We have been named number one in InformationWeek's Elite 100 company.
And this is the thing that I like most about ourselves. We are just about 20 years old. Our closest competitor in that industry in the U.S. is more than 100 years old. So we think that we are a kind of startup in that industry.
And when you say we are kind of a startup, we do have a different kind of DNA, and the DNA shows up on multiple fronts. Number one is we build our own software. Instead of relying on vendor-supplied software, we build our own. We build it on public cloud. We probably are the first bank that has embraced cloud, the public cloud, openly.
We build our own software using microservices technology, and that doesn't mean that all of our applications are microservices-based, but that's where we are going. And, of course, if you are playing in this area, you have to embrace open source openly and strongly.
We rely upon continuous delivery. Our goal is to ship products to customers' hands as fast as possible, as and when needed, without any constraints.
Now, these led to many of the flagship products that Capital One has produced. You can actually talk to Alexa, or you can have a gender-neutral chatbot for your banking application. We are also the first bank to actually put a core banking application on public cloud. That happened last year.
We don't only create our own software for our own use, but we also open source some of our products that we developed internally and are using. One of them is Hygieia, just as I talked about. It was the first open source product that came out of Capital One. Cloud Custodian to govern your cloud policies, and Hydrograph. If you go to github.com/capitalone, you will find about 24 projects. 119 developers are contributing to all these projects, and there are about 12 teams that are actually managing these projects.
So this is kind of not the open source kind of software that was developed internally and was thrown at GitHub and then forgotten about. We actually take care of those products. So if you want to use any of these products, you'll definitely find help on many fronts.
Now, let me talk about our five-year journey. We started with waterfall, manual build, manual deployment, manual test, data center, and we were closed source first. So that was about five years ago, or more than five years ago. It was a traditional software organization.
From that point onwards, we moved on to Agile, fully automated build, automated deployment, automated test. We are on public cloud. And the best thing I like about it is we are open source first. Five years ago, if I said, "I want to use this open source tool," I would be asked why and what's lacking in the commercial tool. Today, if I want to use a commercial tool, then the questions that will be asked to me and all the developers at Capital One is, "What's stopping you from using an open source product instead?" So that's a drastic change, and developers just love it.
Among these changes, there are certain things that happened that I like most. One is we went from mostly outsourced to mostly insourced. Back in the days, five or six years ago, we were 80% outsourced. All of our developers were offshore. Right now, it's almost the reverse. Mostly it is insourced.
We had vertical silos: your development team, your ops team, your QA team, your release management team. From that, we went to a product team, and I'll talk more about this. The product team means everybody that is needed to deliver a product and support it in production is actually the team. They don't have any external dependencies with other teams to make things work.
As I said, we had DevOps, QA, and release managers. So those are kind of the roles. From that, we went to a role called software engineers. Developers of code write code. Ops writes infrastructure code. QA writes test code. And release manager actually engineers the release process.
So this was kind of the five-year journey, and we actually had a name for it. It's called JFDI, and I'll talk about JFDI. It basically means just do it.
So that led to the innovation of the kind that you are seeing in our organization today. And I talked about it in many conferences, especially the DevOps Enterprise Summit. In 2014, we talked about building out automation steps. In 2015, we talked about how to scale DevOps and open source and cloud and foster innovation. And in 2016, we talked about how to measure, because at the end, it's all about measurement. What did you get out of transforming into DevOps and into a modern-day software engineering company?
And since last year, 2017 and beyond, our goal is, number one, slay the monolith. And it comes with a hashtag, and it is on purpose, because everything that is important today needs a hashtag. So, slay the monolith. As I talked about microservices, going from monolithic application to microservices architecture-driven application is our number one goal.
No-fear release. Now, if you are in a development team like myself, I know the fear that I feel when I try to release something in production, and that is because there are a lot of things that are unknown in the release process, at least from my perspective as a developer. So I'm afraid of releasing anything to production and go home happily. So that's our second goal for 2017 and beyond.
The next one is: you build it, you own it. And I will talk about that.
You build it, you own it. What does that mean? We started with, you coded it, you build it. That means there's no single team separately sitting in some corner of the room and actually building your software. If you have coded it, you probably should be building it, because if build fails, you are the first person who knows why it failed.
Since you built it, you test it also. There's no point giving it to another team: "Hey, this is the build that I just did last night. Can you test it?" No, it's your responsibility to test it, to make sure that everything works.
Now that you have tested it, why don't you deploy it too? What's the point of building something and testing something and then handing it over to somebody to deploy to an environment for testing, or production, or whatever that is? You tested it, you deployed it. It's your responsibility.
That's the last step. Since you deployed it into production, why don't you own it? Meaning that if there's a problem, if there's an incident, it's your responsibility to figure out what failed and how to solve it.
The next one is no-fear release. As I said, what are people afraid of? Afraid of their speed, fear of breaking it down, fear of being out of control, meaning that if we are trying to release software as fast as possible in production, these are all the things that actually come into mind as a developer, or ops engineer, or test engineer. You're always scared.
And fear of being non-compliant is a major one. The first thing that audit compliance team will ask you: "Okay, you are releasing as fast as possible to production. How about compliance? How do you know that risks are being mitigated?"
At the end, what we want out of DevOps is this. We want to release software to production as fast as possible, just like this shiny car. What you don't want is one of these running into some house and breaking your car, or even this, hitting a light post on the way, or some rock fell on your car and spoiled the car. And the last thing we want are these people showing up in your backlog, and that is the compliance police that I've been talking about, and I'll be talking about more in this.
So what is important in a CI/CD pipeline is safety. How do I make sure that while I'm delivering software as fast as possible to production, I'm actually doing it safely? And that's kind of the rest of the talk.
Now, when you talk about safety in continuous delivery, you basically are talking about audit compliance, risk mitigation, how to minimize risk, and all that.
From a former auditor's perspective, when I talk to Jennifer, who is the co-author of this presentation, she has given this talk many times, and the first question she asks the developers: "How many people here actually like auditors?"
I didn't raise my hand, so you must have noticed that, and I didn't see any single hand. And that was my feeling, too, that as a former auditor, when she asked me, "Do you like auditors?" I go, "No, absolutely not. They are not kind of the people I love to work with."
But at Capital One, love, peace, and harmony between auditors and developers. I was just kidding. I just lied. And that's the only time I'll be lying in this whole talk.
So we work together with audit and compliance team and risk office, and try to figure out what does it take for a developer to ship code to production as fast as possible and being safe.
And the thing that she taught me, and these are the pieces that actually she's a presenter for, and I kind of go on the side, but since she's not here today, I'm going to be kind of talking on behalf of her.
Compliance, right? We all talk about compliance. Now, from her perspective, compliance is a bad word. It better be governance. We all know that governance is good. Some developers don't like governance, but when they are actually explained what governance means, they kind of appreciate it, just like me.
Compliance versus governance, from her perspective, is: compliance is nothing but checking the boxes. Have you tested? Yes. Has somebody reviewed your code? Yes. Have you tested in a performance testing environment? Yes. It's kind of checking the box. That's no good.
Governance, on the other hand, is awareness and active management of the risk. Instead of saying, "I have tested," I actually test and prove it to myself and others that, yes, I did test.
There are three lines of defense in the whole compliance, governance, and regulatory. Number one, first line: who owns the risk? Risk of failure at production, risk of leakage, risk of security vulnerability. Who owns the risk? That's the first line.
The second line is there's a group of people who should set the policy based on the risk and monitor the risk. And the third line is, of course, independent assurance, which is external agencies.
Now, what's the developer's role in governance? If you tell a developer, "You actually need to be governed," they will be very unhappy because they think that they are not a part of it. They are being governed by outside.
But the developer's role in governance is the whole awareness of what we are trying to do here through your CI/CD pipeline. Risk mitigation is one of their responsibilities, and follow control-based practices. So we have auditors, risk compliance office, external auditors, and internal auditors actually set all the controls and best practices. As a good citizen developer, I should follow those.
Question is, why do you need controls? As soon as you talk to a developer about controls, they feel like somebody's trying to dictate how you should work. But when I talk to the developer and when I take Jennifer Brady with me and explain the whole thing, things are a little better.
Controls are there to protect you and the company because, in the spirit of deploying to production as fast as possible, sometimes we forget about many things, especially with security, testability, performance, and resiliency, and other things. These controls are there to actually make sure these are the guardrails that you kind of drive in between.
Provide assurance around financial reporting. At the end, it's your company, and the success of your company, that defines the success of your own career. And then provide comfort to investors. If you talk to the investors of Capital One and say that, "We just deploy code. As soon as somebody commits to GitHub, it goes out to production," I don't think that they'll be very happy about that.
And then the last thing, and this is my favorite quote that I always keep in my mind. Deming said, "Uncontrolled variation is the enemy of quality." So when we talk about control, we are basically talking about quality. How do you make sure that the thing that you're deploying to production and giving it to your customers' hands, they are of the highest quality?
Now, around this, there are some minimum set of controls that we kind of define that are absolutely needed for us to be successful in continuous delivery, as well as being safe in the continuous delivery environment.
Number one, we always need two sets of eyes. And this is the classic separation of duties, and I'm going to talk about this more.
Principle of least privilege. If you have access to some system, or some server, or something, then there must be a reason for it. If you don't need it, you shouldn't have it.
Unauthorized change monitoring. Who is making changes to your production box and your production instances? If it is not authorized, it should not happen. It should not be changing without any proper authorization. And I'm not talking about manual authorization, and I'll talk about those a little later.
Now, when Jennifer and I actually worked together and came up with this kind of framework, what I found is that automating the pipeline is the easiest task. Setting up a Jenkins pipeline, or whatever that tool you are using, setting up code scanning job, or test job, or even performance test, and then Ansible, Chef, and all, automating the whole thing is very easy. I mean, comparatively.
What is tough is building on every commit, static code analysis on every build, scanning for open source vulnerability, static security scan, automated tests, and more. These are all easy again.
But the biggest hurdle that I found is to prove this: ensure that a single developer cannot make changes to production bypassing all controls. Now, that is the biggest hurdle. It's easy to say that, but to actually implement that is very hard.
Now, when you go to developers, they'll talk about pull request reviews. Every GitHub pull request is being reviewed. I said, "What's the guarantee of that?"
"Well, GitHub has a checkbox that says don't allow merging of pull requests without being reviewed by a second."
I said, "How about the GitHub admin? Can they change that? Or the repo admin?"
"Yes."
"How about the repo admin changes that, and then puts some code into the commit, and then merges without the review? Can that happen?"
"Yes."
So the control fails right there.
So those kind of things, and it goes at every step. You cannot really make sure that this is there. And until this is there, you cannot be sure that your CI/CD pipeline is delivering code to production without a second set of eyes other than yours.
So what are the options? Separate teams managing the pipeline. So that means we are going back to the old days where there's a team sitting in the corner of the room, and after you commit your code, they're actually building the pipeline and running all these automation steps to go to production.
Or separate team just to perform production deployment. That means you have gone all the way to performance or pre-production environment, and as the last step, you are relying upon some other team to actually do the production deployment.
Or we hire professional button pushers, meaning that everything has been automated, and there's a button, production deployment, and you go to pre-prod, all automated, and ask somebody to just push the button just for the sake of it, just to prove that there's a second set of eyes, and the second set of eyes are the eyes of the button pusher.
Now, what's the problem with that? First of all, this is going back to the old days. Yes, I do have everything automated, but I still need to depend upon somebody in the middle of the night to push my code changes in production.
Now, here are some assumptions. We have enough button pushers available, right? Think about that. You are a big enterprise, and you hire a bunch of people whose jobs day in, day out is actually pushing buttons. Who would like that?
And they cannot code, because as we said, you need a second set of eyes and separation of duties. There's a thought that separation of duties means that if you are a developer, you cannot actually do anything other than development. You cannot definitely send your code to production yourself. You need some non-developers to push your code to production.
So if you have hired a button pusher to deploy your code, they cannot code. Now, you cannot train them to do anything else because as soon as they learn coding, like Ansible scripting or Chef Cookbook or anything, they are now developers. They cannot deploy anything to production.
So the only expectation from them is when they actually press the button, they really know what they're doing. You coded, you tested, you went all the way to pre-production, everybody's happy, and then you are relying upon somebody who has absolutely no knowledge about the code. You are relying upon them to send the code to production. How good is that? That's like checking-the-box compliance. It's not governance.
So Jennifer and I, and the audit team, we kind of went head to head a lot of times over the months, and then we thought that, you know what? Let's take a step back.
And we kind of realized this: the secret of change is to focus all your energy not on fighting the old, but building the new one. So that means we came to the realization the old model of separation of duties, risk management, and all that does not hold good in the modern-day CI/CD world.
So what we did was we went back to the root of DevOps, which lies in manufacturing industry quite a lot. And we said that let's bring and look if there are something in the manufacturing industry that we need to port into our modern-day CI/CD environment. And we found a concept called clean room.
Now, this is not something new to me. I did my PhD in semiconductor physics. In my old life, I spent a lot of time in clean room and kind of knew what it is. It just didn't click to me that that could be helpful here, too.
By definition, clean room is an environment typically used in manufacturing, including pharmaceutical products or scientific research, as well as semiconductor engineering applications with a lower level of environmental pollution, such as dust, airborne microbes, particles, chemical vapors, or, in other words, bugs.
So we defined a software delivery clean room, and it goes like this.
All product pipelines are identified and registered. That means the thing that you are deploying, you can trace it back through your pipeline all the way to a source code that actually produced that, whether it's infrastructure, or test, or actual application running, or the middleware. It doesn't matter. You have to have the ability to trace it back to something in the source code that changed, that created that new deployment.
Everything is under source control, as I said. Your application is source controlled, your infrastructure is source controlled, your configuration is source controlled, your tests are source controlled. If any of these are changed, then that's a new application that you are going to deploy in production.
Every change is peer-reviewed. So we made sure that instead of implementing the classic separation of duties, the best way to implement not only the best practices in Agile and DevOps, but also best practices of risk mitigation is every change happening in your application, infrastructure, test, or any other configuration has to be peer-reviewed before it gets merged to the main branch.
Production changes occur only via code changes, as I said. One of the questions that audit and risk people will ask you: who has access to production? And our new answer to that question is nobody. Nobody has access to production.
And just because everything is source code, there's nothing going on in the production that is not in the source code, and hence, nobody should be able to log into the source code to make some changes. Now, a lot of teams within Capital One and outside of Capital One told me they're very mature in CI/CD pipeline. However, they need access to production. My answer to that is: you are not mature if you still have access to production boxes.
Every code change goes through various levels of testing and scanning, ultimately to production, and that is the CI/CD pipelines that we talked about. Pipeline should stop or alert if things fail. And then all the evidences are captured through these pipelines and evaluated at the near real time.
So imagine something is flowing through the pipeline. If any of the checks fail, then you should stop the pipeline and have somebody look at it before it's going to production.
Evidences are analyzed for discrepancies. Did something change without a proper peer review? Did something change without a proper security scanning? Was there a performance testing requirement before it went to the production? All these kind of things.
So we have defined a software delivery clean room, which looks like that. It's an arrow from code commit to release, and there are various controls in place, and we just talked about the clean room requirements. So these are definitive controls in the pipeline that are purely data-driven.
There's nothing called if you're a developer, you cannot release code. Yes, if you're a developer and somebody other than you have reviewed the code, whatever that code may be, it can go through the pipeline all the way to release, provided it meets all the other requirements that are on display.
Now, these are purely data-driven. So it's very hard to do data-driven audit control in a CI/CD pipeline in today's modern world that still meets the classic governance policies and risk mitigation controls.
So before I go and talk about how we do it, I want to just share some results. In 2016, number of products deploying multiple times a day was about 20. In 2017, it's about 300. Average number of deployments per day was about one in 2016. 2017 is four. Maximum number of deployments for a given single product in a single day was about 30 in 2016. It's about 50 now.
And overall, about 25% of our applications are actually in this software clean room that is going to production. Our goal, again goal, by the end of this year, we should have 50% of our applications in the software clean room.
Now, how do you do all this clean room? And this is Hygieia. So Hygieia, in essence, is nothing but a data collection tool that collects data from all CI/CD tooling and collects them together, analyzes that. It also provides a dashboard, but underneath that, the data analytics platform that we are creating around Hygieia is coming handy in this regard.
Based on the data collected, let's say from GitHub commit stream, I can actually tell you between release A and release B, all the commits have actually had peer review on it. If not, their pipeline stops. That's just an example.
Now, after doing all these, I came to a realization, and I get asked this very often. Are you well-managed if you're doing continuous delivery? That is the first question you'll hear if you go to a company who's just starting off on this journey. How about risk mitigation, and governance, and controls, and all that? Are you well-managed if you are doing continuous delivery?
And after spending a lot of time, years on this, I came to a realization that the question should be the other way. Are you well-managed if you are not doing continuous delivery?
And the underlying principle is if you are doing continuous delivery, you are using these tools and automations and all the data that are generated by these tools, and you can actually use the data to prove that you are better governed, better managed, and you have higher quality.
So at the end, this is my favorite T-shirt. I wish I could actually physically have that, but I just have the picture. It says, "All of Chuck Norris's change controls are full cycle, and they are always approved."
You don't need a separate change approval board to go through the painful process of that before releasing software to production.
Thank you very much.