Log in to watch

Log in or create a free account to watch this video.

Log in
San Francisco 2017
Share

Better Governance: Banking on Continuous Delivery

Continuous Delivery in a regulated environment can be challenging. Being a regulated company, Capital One must ensure that it remains focused on managing risk and strong governance while adopting DevOps and Continuous Delivery.


Often this question is asked “Are you exposing the company to undue risk if you are doing Continuous Delivery?” Our experience tells us that “You are not well-managed and as effective as you could be if you are not doing Continuous Delivery.”


In this presentation, we will dive deep into some of the core governance areas and describe how Continuous Delivery is helping us in better IT Governance. We will describe how we have created (a) Shift-Left of many controls (b) Automation of control monitoring and (c) a better engineering experience. We will share some results of this work and discuss some questions that are still un-answered.

Chapters

Full transcript

The complete talk, organized by section.

Topo Pal

It's a great honor to be here for the fourth time. The first time I did it was in 2014, and then since then, Gene keeps calling me, and I keep coming back here. But this time, I'm not alone.

I'm going to go through the introductions. Before that, let me just build out the slides.

I go first. I go by Topo. I'm a senior director and senior engineering fellow at Capital One. At the very core, I'm a developer. I still code. I'm a DevOps evangelist at Capital One. I am also the product manager of our shared continuous delivery tools platform. And then I also am the core contributor of Hygieia, the DevOps dashboard that we open-sourced a couple of years ago.

Jennifer Brady

Hi, I'm Jennifer Brady. Unlike Topo, this is my first time joining you and first time presenting, so thank you, and bear with me.

I'm a technology governance director for Capital One, where I've been a little over two years. My background includes 20 years in risk management, audit. Like Gene said, I'm a former internal audit director, also data analytics, mostly in the financial services industry. So I've dealt with regulators most of my career as well.

My team is responsible for evaluating new technology initiatives, such as moving to the cloud, and looking at the risks, making sure we have the right controls in place for any of the new technology initiatives that we're working on. So Topo and I work very closely together.

We also have an engineering team that looks at our controls, like our manual tech controls, and sees how we can automate them. And then I also have a group responsible for using data analytics to help monitor technology risk and/or monitor our tech controls. So I work with data scientists, developers, data engineers, and data analysts pretty much every day.

As you know, Capital One is a Fortune 500 company. You've probably seen our commercials with Jennifer Garner, Samuel Jackson. What's in your wallet?

We have a large retail bank in addition to our credit card company, commercial bank, and auto loan presence. We are about 20 years young, based in McLean, Virginia, which is the office I work in, and Topo's in our Richmond office.

And Capital One has a strong focus on technology, which Topo's going to tell you a little bit more about.

Topo Pal

So as Jennifer was saying, we are 20 years old or young. Our nearest competitor is about 108 years old. And so we consider ourselves to be a startup in this particular industry.

And just like any startup, we think that we have a different kind of DNA within ourselves, such as we build our own software. Not COTS products, but we build ourselves our own software that we think our customer needs.

We build on public cloud. That's, I think, everybody knows by now, that Capital One is big on public cloud, and we run our software on public cloud.

We build our software using microservices architecture, using open source. So at Capital One these days, if you want to build something and if you're not using open source, then you'll be asked as to why you are not using open source.

And of course, we are big into continuous delivery.

Now, what that does is it allows us to create products like this: Capital One Alexa app, or a gender-neutral chatbot that you can actually use and converse with your banking account, if you will.

And this year, in August, we are the first bank to be running bank core processing on public cloud. That's a huge achievement for us.

We don't only create banking applications, cool products that our customers want, but we also create software that we open source. An example is Hygieia, DevOps dashboard that we open-sourced two years ago. Cloud Custodian and Hydrograph, these are some of the new open source that we open-sourced over the last 18 months or so.

But that's not all. We actually have 25 projects. About 109 developers are actually developing those projects, and about 12 teams that are taking care of these projects. So I actually encourage you to go to github.com/capitalone and take a look at these products.

This didn't happen in just one day. It took about five years. In five years, we went from waterfall to agile, manual build process to automated build process, manual deployment to automated deployment, manual testing to automated testing, from data center to public cloud, closed source first to open source first.

All these things happened. But in my mind, the biggest things that happened that changed our culture and changed our focus to our software engineering are these.

We went from mostly outsourced to mostly insourced. We tore down vertical silos and created product teams. Product teams that actually had everything that they needed to build the products. Starting from developers, to QA, to testers, to ops people, everybody that's needed to build a product was in the same team. There were no silos. There was no wall. Everything was broken down.

And we had a term for this. We called it JFDI. JFDI: just do it.

Instead of having DevOps, QA, RM, we had engineers. There's no different kind of roles as to you are dev, you are ops, you are QA. Everybody is an engineer. Everybody writes code.

Now, I actually went through this journey over the last four years on this stage.

In 2014, we talked about building our automation steps. That was kind of a starting point. And then 2015, I talked about how do you scale DevOps, and open source, and cloud, and innovation across the whole enterprise.

In 2016, I talked about how do you mature, improve, and measure our DevOps success.

And this year, I'm going to talk about the things that we said we are going to do for the rest of this year, starting 2017 and beyond.

Number one, slay the monolith. Now, the hashtag is there because everything that is important these days needs to have hashtags. So slay the monolith is number one, and you know what it means. It basically means the things that you're going to build will use microservices architecture.

No fear release. We often see that developers are scared to release code in production, and we want to break that and just emphasize that you need to release to production without any fear.

You build it, you own it. And I'm going to go through this again. The whole concept is if you have coded your application, then you must own it in production too.

Now, let's go through these one by one.

You build it, you own it. It's kind of a natural progression because first we said, you code it, you build it, which is the CI. We used to have teams that manually built developer software before, but then we changed that to say, if you have coded it, you build it, which basically means you automate your build process.

Next, we said if you built it, you test it. If you have built it, then who else can test it? So that kind of created the model of test automation and ATDD, acceptance test-driven development.

Next was, well, you have tested it, then you deploy it. Why should I deploy your code that you built and you have tested? So you deploy it. So that went through the deployment automation steps and all.

And then if we have done all these things, go ahead and own this thing too. Once it is deployed to production, take care of it just like your baby. You support the production environment. You support customer call, and you respond to incidents and all that.

No fear release. Now, when it comes to no fear release, things get a little different because we first need to understand what is the fear?

Fear of speed. Now that everything is going through the delivery pipeline, we are all scared about the speed. What's going to happen if we just deploy multiple times a day to production environment?

Fear of breaking down. Of course, when I'm doing things faster, then there are more chances that things are broken. And then can I deploy to production knowing that it might be broken?

Fear of being out of control. That is very kind of internal to every developer, that I'm releasing things to production, and am I out of control, or am I within the control of all the other requirements that I need to care about?

And the last thing is fear of being non-compliant. Every time I go and speak to people outside of our company, people ask, "If you're doing DevOps and continuous delivery, are you compliant?"

What you want, basically, is this: continuous delivery, high-speed delivery to production.

What you don't want is running into some kind of building or a light post. We don't want rocks to be falling on us. And we definitely do not want these guys chasing us down.

So what that brings us to is safety in continuous delivery.

Jennifer Brady

So many of you heard I'm a former auditor. How many of you dislike auditors, dread seeing us coming? Come on, own up to it. I'm sure there are plenty of you out there. I won't be offended.

Well, let me show you an auditor's perspective on developers: wild, wild west.

Why do I think the wild, wild west? I saw at a former company one comma cause a half-a-billion-dollar error. One comma. Can you believe that? Without appropriate controls, it is the wild, wild west.

But at Capital One, we have peace, love, and harmony. Well, not really. Not really. But here we are on stage together, dressed very differently, but on stage together. It happens.

We have our developers thinking about compliance. But wait, not compliance. That's a bad word, and let me tell you why.

I prefer, and we in the industry prefer, governance. And why governance? Why do we want governance and not compliance? Compliance kind of means just checking a box. I'm just filling out a bunch of requirements. I'm not really thinking about what does this mean to me?

Whereas governance is actively understanding, managing that risk, awareness of, in everything you do, thinking about what is the risk of this, and how does it impact me and my company? You don't want to cause that half-a-billion-dollar error by missing one comma, do you? I don't think any of us do.

So let me talk now about the three lines of defense. How many of you are familiar with this risk management framework? Anyone out there? Okay, a few of you. I'm going to enlighten the rest of you.

This is a best-practice risk management framework. A lot of regulators are using it to evaluate enterprise risk management. It's been around for the last four years or so.

And how many of you actually realize you own the risk? You are the first line, and you own the risk. How many of you know that?

Yeah. Surprisingly, developers are in the first line with our lines of business. So you all own the risk.

The second line is our governance risk functions, compliance. That's who sets policy, monitors the risk on a daily basis.

And then our third line is where I used to fall, in internal audit. That provides independent assurance. It's because the third line reports usually directly to the audit committee or the board, so they provide that independent assurance.

And then, of course, on the outside of this model, depending on your industry, we also have external audit and regulators.

So what is the developer's role in governance?

First of all, awareness. You all need to be aware that you create risk and could bring down the company by just one little change in the code. So really making sure that you think about it and have that knowledge. And they say the first way to fix a problem is to realize you have the problem, right? So making sure you're aware.

Secondly, risk mitigation. Being responsible for that risk. Making sure you perform your work in a controlled manner and follow best practices and controls.

So we're going to talk about why controls. How many of you own stock out there? I'm sure we have lots of investors in the tech industry. So tell me, how would you feel if you were investing in that company if it truly was the Wild Wild West?

Controls are there to protect you, your investment, protect what you're doing. And then they're really providing assurance around financial reporting, also giving us that comfort to investors that we talked about.

But the best quote that sums it all is from Dr. Deming: "Uncontrolled variation is the enemy of quality." I think that just about sums it up.

We're going to talk then about minimum sets of controls.

The first concept I want to talk about is two sets of eyes, making sure we have someone competent enough to perform a review of your code, any changes you make, before it gets launched into production. And the key there is competency.

You don't just want any old person, "Oh, let me check the box, do this review. I'm good." You really want to have someone who understands what that application is doing, what your code change is doing, so you can do an effective peer review before the code is launched. And that's that two sets of eyes concept.

Secondly, least privilege. This is a key access concept, making sure that you don't have access to things you don't need. Most developers on a daily basis don't need write access to production. Update access and write access, no. Read access, yes. No problem with that. But really limiting that access to what you truly need to do your job and not giving you the Wild Wild West to do whatever you want.

And then these are both preventative controls.

The third control I'm going to talk about is unauthorized change monitoring. This is a detective control. You really need to have that key control at the end to really look at what did someone do? Is it what they were supposed to do? Can we go back and see?

We suddenly have a high-sev ticket come up. Did some code change cause that? So really have that monitoring set up in the back end to provide that detective control as well.

Developers' answer is always automation, right? Kind of checking the box through build on every commit, static code analysis on every build, scanning for open-source vulnerability, static security scan, automated tests. I'm good. I'm checking all these boxes. I'm good.

But let me ask you: biggest hurdle, how do you ensure one person cannot do it all and have appropriate segregation of duties?

Simple answer is you cannot with automation alone. You need to make sure you have the controls we talked about up front built into your automation process, and then work with your governance, risk, audit folks as a partnership. Don't wait till audit comes in the end and says, "Bad. You shouldn't have done this." Make sure you're reaching out to your first-line, second-line governance folks who can help you build as you go.

That's really what we are doing every day. That's how I got to know Topo so well, and he invited me to come talk to all of you, that we are really working together on all our new technology initiatives and thinking about what the impact on risk is and how do we put those right controls in up front.

Topo Pal

So I want to stay on this slide for a little more. Okay.

Because it is very hard, as Jennifer was saying. It's very hard to ensure that a single developer cannot make change to whatever code or whatever and send that code to production. I have seen so many pipelines, both internally and externally. I can find a hole in every pipeline.

And I learned that from Jen, actually. Every time I thought that I had a solution and Jen said, "What if?" And I go, "Oh, let me go back to the drawing board." And that's how I kind of learned many of these things that I keep talking about.

It is very hard to ensure that a single developer cannot actually make change to code and send that code to production. And the reasons are that if you say, "Okay, all my source code is in GitHub. Everything is peer-reviewed, and they are peer-reviewed before merge, before it goes out for the build process," I would say, what if the repository owner can actually turn off that setting that says before merging you need to have peer review, and actually make the changes to the release branch, and that goes out to production? My model breaks, right?

What if the person who created the Jenkins job that builds your code and sends it to production actually changes the configuration of the Jenkins job, saying that instead of getting the code from here, let's get it from there and build it and send it to production? How do you stop that?

So as I said, it's very hard. So what do you do?

The options are: separate team managing your pipeline. You have a different team that says, "We are the team who will build your pipeline, and you'll commit your code, and then we'll make sure that that code goes to production."

Or a separate team just to perform production deployment. Their job is you do the rest of it. When it comes to production deployment, you come to us, and then we deploy to production.

To me, these two are a kind of stairway to downstairs, where we actually came up from five years ago. And you'll see that. And these teams kind of show up, and they have different names depending upon which organization you go to. DevOps team, system team, please don't do it. That's my experience.

Or you can hire a bunch of professional button pushers.

Jennifer Brady

And staff them.

Topo Pal

The whole idea behind this is you automated the whole thing, the production deployment, and it just shows up as a button. And that button could be a script, or it could be really a button that you go to Staples and find one of those. And then you ask that team of people to just push the button when they're told to do so.

Now, I can guarantee you that after a few erroneous deployments, what's going to happen is these teams will say, "Okay, before you ask me to push the button, or before you ask me to deploy the code, I need approvals from the change approval board because last time when I tried to do that and it failed, you blamed me."

So these are not good options.

Also, the assumption is enough button pushers are available. Now, go find the people in the marketplace and hire them as engineers, and you tell them that your only job is to push the buttons because you cannot code. If you code, then you cannot push the button, right?

You cannot train them to do anything else because as soon as you train them, or you teach them a new cloud technology or things, they are not going to remain as button pushers anymore. They're going to move, and they'll start coding. And as soon as they code, you cannot push the button.

But it is okay for them to push the button because they know what's going on, right? No, that's not true.

So with all these, we thought that none of these models actually really works, and we kind of had good discussions over the meetings with Jennifer's—

Jennifer Brady

Some lively discussions.

Topo Pal

Yes, some lively discussions. Nobody left the room, ever.

And then we thought that instead of fighting over the old things, let's create something new. We really want to build something that is safe for the developers. Because at the end, we want to create an environment that is safe for developers.

We don't want one single person to get blamed for, let's say, Struts vulnerability that was not patched in time, and one security engineer got blamed for that. We don't want that to happen anymore.

So we thought that most of DevOps concepts came from manufacturing industry, so we'll take one more page out of that and try to use it in DevOps.

And we came to this conclusion that we should use something from clean room. A clean room is an environment typically used in manufacturing, including other scientific product research, as well as semiconductor engineering applications with a lower level of environmental pollution, such as dust, airborne microbes, other particles, and chemical vapors.

This is basically saying that making sure you have a clean room of manufacturing process. And I'm not saying that while we do clean room implementation in DevOps, we'll get bug-free code, because you all know that all software has bugs except mine, though. Right?

Jennifer Brady

I don't know about that, Topo.

Topo Pal

So we said we are going to create a concept called software delivery clean room that will have all product pipelines identified and registered.

You come and tell us, this is your GitHub repository, these are your pipeline jobs, these are your scanning jobs, et cetera. And we'll register those.

From that point onwards, we'll make sure everything is under source control, whether it's application code or test code or infrastructure code.

Every change is peer-reviewed. Now, as I said, you can break that, but I'm going to show you how we are actually going to monitor that every change is peer-reviewed.

Production changes occur only via code change. There's no going into production box and changing something.

Jennifer Brady

Nobody has access to production servers.

Topo Pal

Right. Right? So that's kind of what you love.

Jennifer Brady

That makes the auditor happy.

Topo Pal

That makes her happy. Nobody has access to production. But how does the code go to production? It goes through the pipeline.

Every code change goes through various levels of testing and scanning, and that's why test automation, security scanning, dynamic scanning, pen testing, all these come together.

Pipeline stops or alerts if things fail.

Evidence is captured and evaluated at near real time. So you are kind of making sure that the code was committed, it was peer-reviewed, and if you detect something wrong, you just stop.

The code was committed, it got peer-reviewed, it's now getting built, and a static security scan failed. You stop the pipeline. So the whole idea is almost at near real time, you detect all these different impurities, if you will, and stop the pipeline in proper time.

And then you analyze for discrepancies, such as, I send the code to production. Can you ensure that somebody's always using the pipeline and not just sending code to production with production access, and the code, God knows where they came from?

You cannot, unless you have proper controls in place. Meaning that if there's a production change and you can detect that, you need to have a traceability to the actual code that made that change happen.

So we came up with this model called Software Delivery Clean Room. That actually talks about various control points that we have, and these are not manual control points. These are being automated. Some have been automated, and some are in the process of getting automated.

And it has kind of seven groups of things around source control, build, binary repositories, quality controls, security controls, deployment, and your support. So we kind of follow that right now, and the results are kind of amazing.

With this particular clean room model, number of products that we are deploying multiple times a day in 2016 was about 20. In 2017, it's about 300.

Average number of deployments per day for these products in 2016 was about one, give or take a few from one. In 2017, it's about four, four times a day.

Maximum number of deployments for a product in a single day in 2016 was 30. In 2017, it's about 50. So enormous progress, right?

Now, if you do this all manually, this is not going to happen, right? What you want to do is create these clean rooms and create these control points and automation based on the data that is coming out from different tools.

And so that's why we thought that the best place to create these audit APIs, if you will, is Hygieia, because we already collect data from GitHub, your Jenkins jobs, your build jobs, your scanning jobs, just all these different data points are there. So by looking at a pipeline, I could almost tell whether, number one, you met peer review control; number two, you met all the other static security scan or Sonar scan controls that you have established for yourself.

And we are not hiding these audit APIs within our four walls. So these audit APIs are getting built in open source. So go to github.com/capitalone/hygieia, and there's an audit API repository. All this code is there.

What I want is kind of your participation in that effort, just to make sure as a group we come up with a simple model that represents the clean room concept, and we can all make our developers' life safer and easier.

Now, a lot of time, Jennifer, you have been asked too, right?

Jennifer Brady

Yeah.

Topo Pal

That, are you well-managed if you are doing continuous delivery? Right? And I think the answer is—

Jennifer Brady

Are you well-managed if you are not doing continuous delivery?

Topo Pal

So, I'll end with my favorite slide.

Jennifer Brady

Uh-huh.

Topo Pal

It's a Chuck Norris thing, and I have this as my end slide for every one of my talks. All of Chuck Norris's change controls are full cycle, and they're always approved.

Thank you very much.

Jennifer Brady

Thank you for your time.

Gene Kim

Thank you, Jennifer, for teaching us all about control environments.

Jennifer Brady

You're welcome. Hopefully I didn't put anyone to sleep out there.

Gene Kim

No, no, no. We want to be taught.

And, by the way, I just want to say one more thing about Topo. When you're at Capital One, you hear Topo's name over and over again. But one of the things that I've heard that I find very meaningful is this, and we heard this at the speaker dinner. He said, "Topo is amazing. I love how he's made our work at Capital One more fun and meaningful."

So thank you, Topo.

Topo Pal

Thank you very much.

Gene Kim

Yeah. Thank you.