Banking on DevOps

Log in to watch

San Francisco 2016

Download slides

Banking on DevOps

John Rzeszotarski

Director of DevOps · KeyBank

Chris McFee

Architect · KeyBank

We started our DevOps journey in January of 2016. We had a Traditional Silo'd Operations organization and spent entirely too much time coordinating these teams. In fact, we were aligned to our Risk organization, not for enablement.

Our team planned to accelerate the Online Banking and Treasury Portal implementations. The bottlenecks we focused on first was Infrastructure, Automated Testing, and the Release platform. The team started the migration from proprietary solutions to open source platforms on containers. Containers alone are great, but Enterprises need High Availability, Self-Healing, Auto Scaling and Deployments that does not affect users during a release. Our biggest struggles were educating while adhering to security and risk policies and communication with offshore automated testing.

In seven short months we adopted Kubernetes in the enterprise, migrated multiple application components to open source platforms, hardened solutions with log forwarding and configurations for rolling deployments, and built out all the environments from development, testing, staging, disaster recovery to production. Oh and we also automated everything with Continuous Integration and Continuous Delivery processes.

Chapters

Full transcript

The complete talk, organized by section.

John Rzeszotarski

My name is John Rzeszotarski, Rez for short. I'm the director of DevOps at KeyBank, and this is...

Chris McFee

I'm Chris McFee. I'm an architect on our DevOps team.

John Rzeszotarski

So I was here about a year ago. The first day, I was sitting in on these sessions. This was my first time, and I'm watching these companies talk about it, and I'm like, "Well, we're not that company. We have way too much politics happening in my company. We have too many roadblocks, too many blockers. There's just no way we could do anything like this."

But now I'm here a year later to tell our transformational story. I will say I got very inspired, and we're going to tell you how we got kicked off.

Quickly about KeyBank: we're actually a big bank. We're not in California. We don't have any branches in California, but we are big in the Midwest, upper Northeast, and upper Northwest. Our big differentiator is our corporate bank. We have a very powerful and versatile corporate bank. We offer investment banking, capital markets. We write research on technology companies. And we're headquartered in Cleveland, Ohio, so go Cavs, and you can feel bad for us about the Indians.

Knew that was coming.

We're a typical bank, I guess I would say, and we run a typical IT organization within it. We've grown through acquisition. We've grown through brick and mortar. We have our traditional silos. In fact, those silos have only gotten probably more siloed in the last 10 to 15 years.

We have a very rigorous ITIL process, and when we went through ITIL and we said, "All right, hey, these configuration items and service management is really, really good. We really want to do this." But we also kind of started getting impacted a little bit by the overhead that it was taking and affecting us in order for us to do delivery release.

And then what happened was, about two years ago, we had a big outage. It was a very significant outage. And when we tried to build a failover to specific components, we actually did more damage than good because we didn't understand our complexity.

Anybody that's read Farley's Laws is probably going to be like, "Yeah, you should have read Farley's Laws." And we didn't understand that complexity.

So our executive team was like, "Okay, everything we tried didn't work, and we can't have that type of an outage again. It just cannot happen. That's too much of a reputational risk for KeyBank."

So what they really did was they gave us direction. They pulled a couple of us. I led a group that we took out of their day jobs, and we focused around documenting some of that complexity. We called it START. It was our Senior Technical Action Review Team. Chris was on the team as well. And we really focused around documenting that technical landscape, identifying what points of failures existed, and then reviewing all of the operational things that we have that support us from a banking perspective.

And we really kicked this off about a year and a half ago, I would say.

Chris McFee

Yep.

John Rzeszotarski

And it only took us about three months, and Chris is going to tell you how we executed that.

Chris McFee

So I don't know about all of you guys, but how many of you actually have a fully documented end-to-end transaction that goes through any of your applications or your systems, and it's up to date?

Exactly. So that was one of the things that we had a real problem with. We were going through these large-scale outages, and we didn't really understand our own complexity.

As John mentioned, we really sat down, and the first thing we did is we took a look at all of our network, all of our infrastructure, all of our platforms. We did interviews with our application support teams to really understand how a transaction flows through each one of the systems, through each one of the pieces of infrastructure. And then we tried to review how each one of those pieces would fail if it were to fail. And there were some that failed pretty frequently, unfortunately.

So we completed our research. We started to construct some hypotheses. Some of the things that we looked at was our networking was probably a little too complex. We started to look at things like distributed cache in order to lessen the failure of databases. And then we really needed a concerted effort to get all of our events and alerts into one pane of glass for our enterprise operation center.

So it took about three or four months, and we came up with all these hypotheses. Then we actually had to communicate with each one of the stakeholders: all the portfolios, all the application support teams, all the application delivery teams, and all the infrastructure teams, which led us to the outcomes.

This is a sample of a diagram that we actually completed for our online banking application, and what it shows is it really shows the complexity.

One of the things that we heard this morning was it's really hard to understand or visualize, unless you go through an exercise like this, how complex you really are. So this shows 190 different hops between infrastructure pieces or application pieces for one login. So you hit Login, and then you're taken to your account summary screen. There's 190 different hops between each login.

The other thing that we have is we have two data centers, so this actually shows a login that were to come in from one data center or a login to come through the other data center. And the way that we had our networking infrastructure configured, it was possible that we were ping-ponging between different data centers. And if you were to do that once or twice, it wouldn't be a big effect. But imagine doing that 10 times. Obviously, the customer experience isn't going to be great.

We also found that we lacked standard configurations for things like our network devices, some of our application platforms, and then even some of the frameworks and things that we were using from an application perspective.

We really had a siloed approach for when things failed over. So one person from the network team might try to do one thing while somebody else is doing another thing, and they really weren't communicating. And then we had exponential growth of the environment, which really led to a large amount of technical debt over time.

Some of the recommendations that we came up with were that big single pane of glass for alerting and monitoring. Our enterprise operation center, we began to give them playbooks on things to look for as things may be failing. And then we really came up with a new network re-architecture.

We also developed reference architectures so that we could deploy applications the same way across multiple iterations. And then finally, we thought that we could really leverage practices from DevOps, and John can talk a little bit about our elevator pitch.

John Rzeszotarski

If you look at this, lots of people say, "That was one person logging in?" And they're like, "How did it work?" That seems very complicated. So this really gave us a lot of ammo.

Then, like I'd mentioned, I went to the conference last year, was really inspired, and I said, "Well, we have all this ammunition around all of these complexities, all of these issues, all of this technical debt that we've created. Let's build this elevator pitch." And I took this to the CIO.

Once again, everyone here will tell you it's all about a metrics-driven approach. We focused around mean time to resolution, which was everybody's heartburn. We focused around taking people out of their day jobs. The one thing we had to tell the CIO was basically, "How specifically are you going to accomplish this? It needs to be tactical before it's going to be strategic."

So you have to fill in the gaps. We really had to be very thoughtful in how we were going to execute it. We built the team before we actually presented this pitch. We said exactly the projects we would focus in on. We said the metrics we were going to achieve. We put the timelines in play very early on, actually, as part of the pitch.

So we gave the CIO everything she needed that says, "Please let us go do this. This will help your organization." And I believe it did.

So how did we get started?

We started in January of this past year, so we've actually done this in a very short timeframe. And I'll tell you how we cut corners in order to do it. But actually cutting corners was actually not really in a bad way, in a very good way, in a way that's going to help us in the future.

The first piece is infrastructure automation. It would take us too long to get infrastructure changed and configured and up and running. We had to move fast in order to do that.

The second piece is test automation. I can't move fast if I still have to wait and spin up 30 manual testers to go through and do an end-to-end regression testing cycle. And we really had manual testing across the board. It was kind of the standard.

And then the last piece was really the glue that needs to be there to kind of tie this together and build your build pipelines, and that's continuous delivery.

So we created a team, and it was not a very big team. Initially, it was four people. We grew to flex between 10 and 12. This team really focused around these three things. We still leveraged a lot of the testing developers that we had within our centralized testing organization. We still relied on the development teams to actually do it, but we were coaching them. And then we managed and owned the continuous delivery pipelines and the containers that we'll talk about.

The other piece here is our chief architect and our chief technology officer basically said, "We're going to angel fund this." So they're going to take money out of it. Everybody says, "How did you get money to do this?" It's like the biggest question across the financial services space, because if you're a bank, you should have money to do this, right?

But what we did was they said, "Hey, I'll angel fund this. This is our startup. This is our little baby, and we're going to take our own budget. We'll figure out how we're going to fill the gaps in on some of those resources later on."

So it was our senior leadership there that really took the risk that said, "Hey, I'll figure out a way to get the budget. I'll give you some overhead. I'll give you a little bit of freedom from the security organization and some audit and risk compliance folks, and let's get up and running."

So you do have to get all of that buy-in as well.

Everybody likes to talk about tooling. It's been kind of our living document of the tools that we've leveraged. We thought it would be kind of interesting to have a picture of somebody taking a picture of our slides, and now I see a bunch of phones out, which is awesome.

Chris McFee

Yeah. See?

John Rzeszotarski

We're trolling you.

But a lot of it really came into, in the upper areas, around build quality, around test quality. Some of the things that we did implement were things like our own little Chaos Monkey so that we could play around and determine if something fails, how does it fail? And that actually saved us recently as we were going into production.

The other thing is there's not one tool for everything. So if you get somebody that comes in and says, "I will sell you one unit of DevOps," this kind of shows that that's not really the case. There's a lot of different tooling that enables capabilities that allow you to go faster.

Another one to call out here is we use Hystrix, which is Netflix's framework for doing circuit breaker pattern. So we're a bank. We leverage a lot of application service providers, like everyone else, I'm guessing, in large enterprises, and this was huge.

The things that are coming out of Netflix really amaze me. So circuit breaker pattern, if you're not familiar with it, it's just like you have your circuit breaker at home. If you plug too many things in, the switch is going to flip, but it's not going to take down your whole house. It's just going to take down that stream of outlets that goes to that circuit breaker.

Same thing. If I've got a service that I don't want to take down the rest of my online banking experience, whether that be our bill pay services or our credit card services that we may get from MasterCard or Visa, I can actually utilize this framework. And we ended up finding even more benefits with the monitoring that kind of came out of it as well.

Chris McFee

Yeah, and I'll actually talk a little bit more about that monitoring piece later on, but that project has been invaluable for us from a data perspective.

John Rzeszotarski

Lots of times, people love to make up their own words at these conferences, so I'm going to jump on that bandwagon. There's been a big buzzword called Mobilegeddon and Generation Z. I don't know if anyone's heard this, but I hate it.

But I said, you know what I'm going to do? I'm just going to change that, and I'm going to call it Containergeddon and Generation C, because I'm a big fan of containers, and I do think they are changing the world.

From an infrastructure perspective, when you're delivering a product, the first thing that you need is infrastructure, right? Whether that be cloud or whether that be on-prem, you still have to have some CPU, memory, and storage. Then we typically take it, and we slice it in half to get more money. We virtualize it. Then we install operating systems on top of it. Then we install frameworks. Then we install more frameworks, because one's not enough. Then we install platforms so that we can get warm and fuzzy with service-level agreements and vendor support and vendor lock-in. And then we install applications.

Finally, we're ready to go. And we're ready to go with that. No. Now I have to configure those. I have to configure those from a security. I have to operationalize those. Now I'm done, right? Now I have to go through and test and validate each one of these individual boxes. Now I'm done, right? Now I've got to repeat all of that because I have to patch, upgrade, and hotfix each individual component.

Each one of these lines is different teams for us. That's why it takes us two months in order to get something installed and out on a server. Oh, and by the way, I have to do that four times, because I have four different environments that I have to support. So you can imagine the amount of coordination that it actually takes in order for you to actually get all of this together.

This is where I say we cheated. We kind of cut corners. We just did Docker, right? It was like, we built it once. We built it through code. We didn't have to talk to 20 different teams. We talked to a couple of the senior engineers to make sure they feel comfortable with what we were doing. But now we're building all those platforms as code, and now I'm just migrating it around on our immutable infrastructure. I've heard that term a couple times today. I love it.

That's what gives us app isolation. It gives us application portability. And it was my super small team that created a lot of these, what we ended up calling Dockerfiles or other configuration scripts in order for us to actually build these.

Containers are great by themselves, but they do not give me high availability. I still have to tell the senior leadership that this container is available. If I'm going to run this in production, it has to be available, and it has to be secure, and it has to be able to come back up if something ever happens.

That's where Kubernetes comes in. So I call that our battleship for Containergeddon. And here's this really big sentence around Kubernetes: it's an open-source orchestration system for Docker containers and blah, blah, blah, blah, blah, blah. Very, very long sentence. And I could see Ron Burgundy saying, "What did you say? My hair."

What is it really? It's really Google, right? Google built what's called their Borg cluster management system. A lot of people use Google, and you've probably never seen, "Hey, we're not going to run this search for you right now. We're in the middle of a deployment." You don't see that, right? You don't see their outage windows at all. They have to be up, and they have to be able to service all those customers. They don't really have an ability to be downtime.

So what they did was they open-sourced, really, their Borg cluster management solution, and they called it Kubernetes, which is Greek for a ship's helmsman, which is kind of the steerer of the craft, which is why it's our battleship for Containergeddon.

And that gives us two most important things. One is rolling deployments. It allows me to roll into production without affecting any customers, without affecting any users. And the second one was auto-scaling, which we didn't have. These are two big things that we didn't have. So now I have elasticity within my private cloud, which is Kubernetes.

The third one that I should have written on here actually as well is agility. So now I've got my really fast way to build up containers, but now I've got a really fast way to scale and flex that infrastructure up and down.

Chris McFee

As we're talking about delivering our applications quickly, we need to make sure that we're keeping the quality up as well. So in an effort to achieve that quality as we're delivering these things faster, we needed to make sure that testing wasn't an afterthought.

Honestly, we started adding testing frameworks, and we had wrongfully assumed that somebody would just be able to pick up things like Selenium or WebdriverIO. So as you're going through a process like this, make sure that you account for the spin-up time of learning new techniques and new technologies.

But at build time, what we're doing is we're running our unit tests as it's building. We're then running API tests to make sure that we're getting data in and out that we expect. And then we're doing automated web regression testing, which we had never had before.

So on every given build, we're running over 5,000 tests, and we're doing it in under 15 minutes, which is kind of amazing for us.

John Rzeszotarski

Yeah. I think it's amazing for the industry. Anybody that's messed around with Selenium knows sometimes you get yourself into a little bit of trouble.

So now really talking about the continuous delivery, the glue, right? How do you get all of these things to talk together?

The thing I like about this slide is along the top, everyone's a contributor for the most part to something that's in source control, that's in version control, right? So our developers are obviously checking code in. But now we're having our test analysts and our business analysts check in Gherkin. They're actually checking in something I'm going to plug into my automation service, and I don't have to rely on manual people putting spreadsheets together in order for me to say what's actually happening and what's right or wrong, what's the truth.

Our test developers, so instead of manual developers, we have real developers. And that was what Chris had mentioned, a little bit of a hard thing to digest initially, but now they're contributing to source control.

And then our infrastructure team: all of it going into source control for us to build our containers and to make configuration changes all through code.

Another real cool thing in here is we do continuous integration in our infrastructure space now. So we use a platform called KitchenCI, which actually runs all of our infrastructure code tests and validates that the infrastructure's got the right security parameters, it is configured correctly with the specific right versions, et cetera.

That's super powerful. And I'll tell you, that's when I was really like, "Aha. I can't believe we just did this. I have a report here that shows I just did continuous integration on all of our infrastructure."

So I love this slide because this is what we get to be proud of. This is why we get to be up here.

About a year ago, we announced that we acquired First Niagara, and we had to get our new online banking platform up and running before we were actually able to onboard these new customers from First Niagara. First Niagara had different capabilities than Key. In fact, they had better capabilities in their online banking platform, so we had to get our new platform up and running.

So all these things we put together allowed us to move from a development perspective very fast in order for us to accomplish that. But none more than anything else was the day we went live on customer day one within First Niagara. We had the biggest burst we've ever had in our online banking experience.

And in the middle of that, we were getting a lot of phone calls, or we were getting a lot of feedback that came back that said, "You know what? It's a little hard for me to enroll." Some people were struggling.

We were able to do 10 deployments in the first four days. We did those deployments in the middle of the day during the highest peak volume we've ever had. And we did that because KeyBank's online banking platform runs on Kubernetes, runs on Docker, and runs on open source. And I can't tell you how jazzed I am to tell you that. I just think that's awesome.

Like Chris mentioned, the test automation was another huge differentiator for us. What we called test automation, which still required manual testing, it was really nothing, I think is what we realized.

We were able to run twice as many test scenarios in a fraction of the time that it actually took to execute those tests. And we were actually finding defects. A lot of people say they're doing test development, and the first question I like to ask is, "How many defects are you finding in your test automation?"

We were averaging about 10 in the middle of the biggest amount of feature development that we were doing. The problem we did have, though, was we were fixing these things before they were actually ever logged, because the developer is getting this instantaneous feedback. And so they're like, "I can go fix that before I put this defect in." And we don't want to slow anyone down.

So this is an area where we have to improve, and we kind of manually kept some of those numbers together. But extremely proud of the team, and this is our proud slide.

Chris McFee

One of the big things that we've done is we've started to employ, and have started to go down the path of contributing to open source. So we use visualizations like Grafana, Graphite, in order to look at the health of our servers.

We mentioned Hystrix before, so this is a picture of Hystrix that really shows you how your service transaction times are going. We actually are in the process of writing a Sensu plugin that will allow us to feed that directly into Graphite so that we have a more historical view of some of the things that are coming in and out of Hystrix.

And then we've also started to implement Capital One's Hygieia project. And we're in the process of creating a plugin for Kubernetes so that we can see where in the pipeline our code is from a Kubernetes perspective.

John Rzeszotarski

Isn't it kind of cool that KeyBank is using Capital One's open source project? Isn't that how awesome, how far open source has come? It's really cool.

When are you guys using yours?

Chris McFee

That's right, and it's coming. It's coming.

John Rzeszotarski

I will say, I love the little plugin we built for it, though, because I know anytime anybody says, "Hey, this environment's down," I'm like, "Did you check Hygieia?" Because environment health is all there, and the revisions within all of our environments are actually all there.

Chris McFee

Some of the lessons that we learned: obviously, as a bank, we're in a highly regulated industry. We also have our ITIL processes. So in order to try and get through some of those hurdles, communication has been critical.

We met early and often with our ITIL partners. We met early and often with our risk, audit, and security partners, and really showed them the benefits of immutable infrastructure, infrastructure as code, and all the tests that we were putting in place.

Legacy skills in the offshore model. So one of the big benefits that we were getting from our offshore model from a testing perspective is we could get our development done during the day, send it over to our QA partners offshore, and they would test it overnight so that we could come back in the morning and begin to fix defects.

Moving away from that has been a little bit of a struggle. So one of the things that we've done is they've started to actually shift some of their skill sets so that they're providing some of the test automation for us, so that we just keep increasing our code coverage.

And then the other thing is change management for our existing teams. This kind of change has been a real shock, and nobody likes big, disruptive changes. I know I don't.

So as they're inundated with their backlogs and supporting their current infrastructure, they're kind of pushing back on us a little bit. And quite frankly, some of their managers are protecting their employees, which we understand.

But in order to overcome some of that, what we really want to do is start injecting some of what John calls DevOps ninjas into some of our delivery projects to speed along the initiatives and to give them that skill set and the practices.

And the other thing that we want to do as we start to scale is we want to start to implement what Target has done with their dojo. So people can come in, we kind of train them up, and we release them back into the environment. And hopefully, it's more of a snowball effect.

So that's really our journey so far.

John Rzeszotarski

Yep.

Chris McFee

Thank you, guys.

Q&A

John Rzeszotarski

I don't know if there were any questions.

Moderator

You have three minutes for questions.

John Rzeszotarski

Oh, we have three minutes. Oh, question.

Q: So my name's Tia White. I'm with Oracle. Congratulations. Question for you, being a bank. Can you talk a little bit on that first bullet around segregation of duties and what you did to make sure that you...

John Rzeszotarski

The regulatory side of things?

Q: Yes. Segregation of duties and IT access control.

John Rzeszotarski

A: So once again, we weren't breaking the model, right? Developers were still doing development. They didn't have access to production. They didn't have anything there.

My team was kind of split in half, and that's the way we decided to group it from an ITAC perspective. I have individuals on my team that are responsible for test development and development, and then I have individuals that are infrastructure engineers that are my infrastructure developers. And so they have access to some production systems, just so that they can monitor and configure those.

But they are developers. It's a bit of a shade of gray. I'll just tell you that I was still able to slide all my guys into the existing ITAC policies that we had out there to adhere to it.

The other thing is everything's automated in our pipeline, so there's no real touching of the code once it gets past development. So that segregation of duties is still there. They're not actually able to touch the servers. They're not able to change the code. And because everything is baked into an image, we can guarantee the lineage back to where it was built, how it was built, and what source code revision it was built from.

Q: Okay.

John Rzeszotarski

Yeah.

Any other? Oh.

Q: So myself is Sunita. By the way, congratulations for the extraordinary work that you guys have done.

John Rzeszotarski

Oh, thank you.

Q: I have a question. We do a lot of system integration, and one of the biggest things we struggle with our clients is they think that we are having so many changes that we can't really do automation. But really the problem is that they have monolithic systems. So whenever they touch anything, everything breaks, and they think that they have to keep changing their automation scripts. So how did you get past those challenges, and how were you able to convince everybody to come on board for automation?

John Rzeszotarski

Okay. So it's a good question.

Audience Member

Can you repeat the question, please?

John Rzeszotarski

Okay, I'll do my best.

In a world where we have monolithic applications, and you are trying to keep up with automation within groups that are asking for change, how do you... Oh, I hope I said that right. How do you overcome those challenges?

And I guess I would say, one, we were lucky. We chose projects that were building more greenfield, so we didn't have to deal with the legacy pieces.

The other thing we did was we took people out of their day jobs. They were 100% focused on building this automation. That's what kickstarted us.

Now our big next challenge is exactly your challenge, which is, how do I put this within the rest of the organization? And I think a lot of it's going to be like Chris mentioned. Let's take some of our big change agents and let's start putting them on teams that says, "Hey, guys, here's some of the lessons learned we had of how we can do this faster."

And the other thing is you can take pieces-parts. So if there's something that there's not a whole lot of value in, let's say for whatever reason, there's not a whole lot of value in continuous delivery, you don't have to do continuous delivery. Do what adds the most value for you.

So one of the things that we're looking at as a part of our monolithics is we're not going to be able to provide that automated pipeline for them, but we can help them do automated testing so that you can add value in ways like that.

Moderator

Time's up.

John Rzeszotarski

Time's up. But we'll be around if anybody has any questions.

Moderator

Okay. Give them a...

John Rzeszotarski

Thank you, guys.