Augmenting the Org for DevOps

Log in to watch

San Francisco 2017

Augmenting the Org for DevOps

SVP, Head of Digital Channel Technology · KeyBank

Director of Continuous Delivery · KeyBank

Year two in our DevOps journey is all about scaling out our success. We had an amazing year last year where we helped accelerate an acquisition of another bank with a continuous delivery process and modernization of our platforms. So much so our CIO called it out in a "Wall Street Journal" article http://blogs.wsj.com/cio/2017/03/23/keycorp-cio-develops-digital-construction-zone/?platform=hootsuite

What we needed to do now was move out of a bi-modal approach and collaborate this into the rest of the organization. That drove us to reviewing how we got where we are, and utilizing lean concepts to drive us where we need to be. Our story is about how we took a highly distributed and silo’s organization and start breaking it into teams that are organized to how we build solutions, aka Conway’s Law. We will also talk about the next steps for us enabling quicker and more stable releases by taking what we did with application and web server technologies and tackle it for databases.

There is a lot in progress right now but the entire company now is embracing DevOps in all parts of the organization, we still have a lot of work to do like internal DevOps days and building that culture of curiosity but starting out by positioning folks in a place to embrace these changes is exciting for us.

Here is the link the CNBC Mad Money show with Jim Cramer that is mentioned in this video. (We had to remove it from the video because it is copyrighted content).

https://www.youtube.com/watch?v=-hCoEW4n6so&feature=youtu.be&t=4m57s

Chapters

Full transcript

The complete talk, organized by section.

John Rzeszotarski

Hi, I'm John Rzeszotarski. I'm the director of our continuous delivery and feedback organization. I manage part of our infrastructure for Stephanie, as well as our code and release management teams and our monitoring teams.

Stephanie Gillespie

And I'm Stephanie Gillespie. I am head of digital channel technology at KeyBank for our community bank. So my team aligns directly with our retail line of business, and we design, develop, and deliver online banking solutions for our retail clients. Think online banking authenticated space, mobile apps that run on Android and iOS, and then our online account opening systems.

So, in effect, if you think dev and ops at Key, I'm the dev and John's the ops.

Who is Key? Well, we're probably the biggest bank you've never heard of. We are the 13th largest bank in the United States. We were founded in 1825, and we are headquartered in Cleveland, Ohio. So go Cavs, for all you Golden State fans out there.

We have about 1,200 branches across 15 states currently, and we have about 20,000 employees, of which about 25%, or 5,000, are within our IT and ops organization. We have three million clients and about $135 billion in assets. So that puts us at the highest threshold of the highest standard, I guess, of regulatory requirements from an OCC perspective. So not much different than our friends over at Bank of America. It makes our lives in IT just that much more interesting, but also complex.

Here's a map of the states that we're currently located in. The ones that are highlighted in black are where we have our retail presence, and the ones highlighted in gray are where we have our corporate and private banking presence. We've grown over the years through acquisitions and mergers to scale our footprint across the United States.

John Rzeszotarski

We are that 190-year-old bank, and we grew up through acquisition. And so that's a lot of technical debt. That's a lot of systems that we have to upkeep and maintain. And we did that primarily through a very traditional model.

We had a separate development team that was really focused on line of business. We have a separate security and EA team focused on standards, and we never fired anyone at Key for adding another layer of security. And then we have an operations team that is very siloed based off of technology platforms.

These teams all have to come together to develop a solution, but they all have different priorities. The development team's priority is speed to market, trying to get the features out as fast as possible, sometimes cutting corners maybe. Security and EA team is really focused on governance and standardization. And then the operations team is focused on reliability. They don't want too much change because they're just trying to keep the lights on.

When you have to design a solution, it becomes this downward spiraling effect, and you design point-to-point solutions. This is most notable that we had a significant outage in 2015. This is a representation of our online banking platform. You'll see one red user and one green user and just one transaction where they're clicking the login button.

One login transaction effectively means 200 network hops behind the scenes. We could ping-pong back and forth across our data center anywhere from seven to 30 times for one transaction. We also had single points of failure built within both data centers, so we weren't any more highly reliable by having two different data centers by any means. We were just more complex.

So we had a network outage, and this really caused a catastrophic event. We tried to fail around the network outage, and because we didn't have this diagram, which would've been nice, we made it worse.

We were down for the better part of a day. And our CIO and our chief architect demanded change, not just from the infrastructure teams, but also from the application.

Thank you.

Stephanie Gillespie

So, kind of messy.

Enter Digital 17, or D17, as we called it at Key, which was really the name of the project that we were going to implement in order to get rid of all that complexity and craziness that John just showed on the screen. Digital 17 started out in 2015 and was intended to be a two-year project for the two million clients that we had in our online banking digital applications at the time.

But the question wasn't necessarily, "How could we create an online banking application that didn't suck?" The real question was, "How can we build the digital framework that allows us to test and learn inexpensively?"

And the answer was to change everything.

We looked to change the architecture, getting away from that monolithic single application that everything ran within, and breaking it out into a three-tiered architecture, separating the user interface from the channel services from the core processes and the enterprise services.

Then we looked at the user experience. How do we redefine the user experience and break it up into a widget-based design that we could then extend to both our web and our mobile applications?

Then we focused on people, and we had to re-skill the team that we had to learn the new technologies and to learn the new frameworks.

Then we changed the way we worked and collaborated together by leveraging the agile-based practices. Everything needed to be built to change and change quickly.

And good thing for that, because a few months into the project, Key announced we were going to purchase First Niagara. And oh, by the way, those one million clients that we had just purchased were going to be migrated into Key's environment before our Digital 17 application was going to launch.

So wow, what do we do, right? I think any reasonable organization would say, "Hey, let's stop the project. Let's focus on this acquisition." At the time, it was the largest banking acquisition in the country since the 2008 crash, so we needed it to be successful.

But that meant bringing one million clients onto that crazy complex platform, only to then migrate them to a new experience and a new platform a few months later.

So we made the decision to accelerate. That 24-month project became an 18-month project, and we had to scale for an additional 30% user base. We kept the name Digital Seventeen, even though we were actually going to implement what was now Q2 of 2016.

So why did we think we could accelerate? And if that was the case, why wouldn't we have just done that to begin with? Why wouldn't we have just had an 18-month project to start?

Well, really, to make this work, we had to change the traditional approach to application development and infrastructure management at Key. There were two things that we wanted to focus on, and both involved speed.

We needed speed in decision-making on the left side of the equation. We identified decision owners. We held them accountable. Answers were needed in 24 hours or less, or we were escalating.

And because of that, we also needed a way to deploy our applications quickly, and we needed a way to scale our infrastructure quickly. Because if we were making quick decisions, chances are we weren't going to get everything right the first time. And when those scenarios cropped up, we wanted to be able to change them quickly.

There's actually a great quote from Kurt Bittner from Forrester Research which I think sums this up perfectly, which basically says, if agile was the opening act for a great performance, then continuous delivery would be the headliner.

So really, you need both. You need to consider the full end-to-end spectrum. And hence our exploration into DevOps.

John Rzeszotarski

Right. So we went to the DevOps Summit in 2015, like Gene mentioned, and we got to hear Target and Capital One, and you're like, "Oh, man, we would never do that. Not at our bank."

But at the end of the day, after we went, we basically came out with three big things. One, we had to have executive support. You're going to have to get your CIO's buy-in. Two, it's got to be metrics-driven. For us, it was mean time to resolution, it was release frequency, and it was some of our service levels for infrastructure services. And then lastly, the focus of removing bottlenecks had to be at the forefront of every problem that we came across.

We can't change the entire organization overnight. So Matthew Skelton wrote a great blog post that talks about DevOps as a service and about how you have to act tactically but think strategically of how it's going to evolve into the organization.

Stephanie Gillespie

All right. So we had to figure out where are we going to start.

What you see on the screen is our traditional waterfall enterprise software development lifecycle framework, which I'm sure many of you have some form or factor within your organizations. But the point was, we couldn't wait for a top-down corporate initiative to tell us how to get started with DevOps. We had to purposely pick certain areas of that framework where we thought we could be successful in order to get this movement started.

What you see highlighted in blue with the circles is where we chose to start. We chose to start with the installation of the infrastructure and the configuration, and that's where we brought in containers and Kubernetes.

We also looked at automated testing, and how do we automate the regression testing? How do we automate testing so that we can identify defects earlier in the cycle at the time that the builds are being committed?

And then how do we leverage continuous delivery within our development pipeline? And then we layered in agile practices throughout the lifecycle as it made sense in order to enable that quick decisioning.

So the point was, it was a grassroots, bottom-up initiative. We chose where we wanted to start, and then we took the pitch to our executive leadership team.

Here's John riding up the elevator with our CIO, Amy Brady. But trust me, Amy is a lot less scary and a lot more attractive than the picture, but we thought the picture was fitting.

The crux of the conversation was, look, we are in the middle of this major program to re-platform our application, and now we have to accelerate for this First Niagara acquisition. So we needed to think differently about the way we run and operate our platforms to bring the speed. And obviously, the conversation went well, or we wouldn't be here today.

So I would just say, as you guys look to carry these movements forward in your own organization, just be thoughtful about who you're making that pitch to, because DevOps is not an easy thing to explain to people, especially outside of the technical community. Just think about who that pitch is going to and think about what would be in it for them, and hopefully you'll have the same success that we did.

John Rzeszotarski

So you don't have to be handcuffed to go meet our CIO, by the way.

I'll talk briefly about containers, and I love the term "containergeddon" because I think containers are going to take over the world. The way I like to explain it is, when we start and we want to build a specific product, we start with CPU, RAM, resources, disk. We virtualize that to get more bang for our buck. We install operating systems on top of it. We install frameworks. We install more frameworks, because one's never enough. Then we install a platform so we can get vendor support and vendor lock-in sometimes. And then we install applications on top of that.

We're done. No. I got to go through and I got to configure it. Some have security vulnerabilities. We got to get them fixed. So I have to operationalize it, make sure I have the right alerting, make sure I have the right monitoring all put in place.

Now I'm done. Now I got to test. I got to test those dependencies between all those different layers to test and validate to make sure the application's working.

Now I'm done. Nope. It's time to start all over again because I got to patch, upgrade, and hotfix the application.

At Key, each of these different lines and boxes is typically a different team. So a project manager's having to cross-coordinate across dozens of teams in order to put a system together. And by the way, we don't just have to do that once. We have a dev environment, and an IT environment, and a QA environment, et cetera.

Containers: game changer. I build it once as an image, I build it through code, and I just deploy it on top of the infrastructure. The patching, upgrading, hotfixing, all goes away. I deploy it with the application. I'm also giving more responsibility back to the developers. They control more of their configuration for what they need to operate with. But I also have to protect the developers from themselves, because sometimes they want more resources than I'm willing to give them.

And this is where Kubernetes enters.

So we still have to make these containers highly available. We didn't go to Kubernetes because we wanted to build a platform as a service and let developers go build out innovative apps. We went there for reliability. We went there because Google's always up, and we want to emulate that. Their rolling deployments, their auto-scaling, their high availability they offer: that's exactly what we wanted, and that's what we put together.

So this team that we had was really building to offer these capabilities. And one other area we had to focus was continuous delivery. Because we're a very large bank, 700 applications, hundreds of project teams, lots of teams doing things differently, we wanted to standardize those pipelines but offer really good flexibility.

XebiaLabs XL Deploy and Release came in and has done a great job for us.

The other thing to mention is that Steph's team and our team had to collaborate together to help build reliability within the application. And we built something called the circuit breaker pattern.

All banks have to use third-party services, whether it's FIS or Visa or Mastercard, and we don't want when they're down to affect us. So this is where the circuit breaker pattern comes in. We use Netflix's Hystrix framework to really help safeguard our application, and it's come in handy countless times.

Stephanie Gillespie

Great. So yeah, great tools. What's DevOps without great tools, right? And what's DevOps without automation?

For Key, with the speed that we were trying to sustain with D17, we needed to really focus automation around testing, because you're only as fast as your slowest bottleneck. And manual testing takes a long time.

Given the fact that we had one million logins per day, we weren't willing to risk quality for the sake of speed and shortchange testing. So we took time to build automated test scripts, which got us much more coverage than we were used to, and also took much less time, from 20 hours down to less than 12 minutes.

It really helped to increase our confidence in what we were deploying, because those defects were caught earlier in the cycle at the time that the code was being built and migrated through the environments. So that was, I think, a big win for us.

How did everything go? We talked about the tools. We talked about testing.

D17 actually was very successful. We met our accelerated timeline. We migrated Key clients into the new platform, and we got prepped and ready for First Niagara.

So when that day came to migrate those one million clients into Key's environment, of which 500,000 were online banking clients, things didn't go so well.

But it wasn't the technology. The technology actually performed successfully. What it ended up being was a decision around the user experience and the first-time login. Where we thought we were going to be making it easier for clients, we were actually making it really confusing, and they were locking themselves out left and right.

Calls started flooding into the contact center. We had over two-hour wait times. There was a social media frenzy about how bad KeyBank screwed this up, which was not the headlines we were looking for, given all the planning and preparation that went into the migration. And the technologies were performing.

So although it was a firestorm, it actually ended up being kind of a blessing in disguise for the DevOps movement at Key, because this is when the real beauty of DevOps came into play.

For those of you that have read The Phoenix Project, this was eerily similar to maybe that first implementation approach for Phoenix.

But what happens, so envision a command center, and envision you've got senior executives, you've got your business, you've got your developers, you've got your operations teams, you've got your testers all hunkered down in a room, talking, working, collaborating. And we made changes to the user experience on the fly.

We had 10 changes within four business days, all done during the day, and not a single one impacted our clients and brought the system down. We were able to quickly make changes to the way the application was working and the experience that the users had in order to manage their first-time login.

So with that, DevOps basically sold itself that day.

It really helped those people at Key realize that we were well on our way to becoming that 190-year-old digital bank that our CEO, Beth Mooney, so often refers to. And it's not every day that an FI CEO gives a reference to DevOps, and I think Beth Mooney summed it up really well in her quote, and it was a very proud moment for all of us in Key Technology.

So where do we go from here?

With all that success, there was a real appetite to get DevOps more in the enterprise, like where else can we put it? So we started looking at other areas that could leverage this framework. Where can we grow one team at a time?

And we know it's going to be an iterative, continual process, but it's not also just about expanding DevOps across different teams. It's also about how do you go deeper with DevOps into the teams that are currently using it?

As an example, in digital, we're running in containers, we're leveraging a lot of these frameworks, but our releases, they're still pretty big. So how do we break those up into smaller chunks? And how do we leverage containers more effectively so that we can deploy changes to just small subsets of users and test those out before we roll it out to the broader user base? So we're thinking about those things.

D17 was the flagship application that started running within this framework at Key, and since, our corporate banking online application is now running in this environment as of Q3 into Q4 of this year, and our online account opening application is moving to this framework in December.

So there's a lot of excitement. We're going to keep moving, but we're also going to not just expand, but continue to go deep into some of these principles, and I'm excited to maybe learn some of that from some of you guys here this week.

John Rzeszotarski

As we started to scale this, not all the engineers got on board. Kind of shocker, right?

We definitely have some passionate change agents that want to change, sometimes even change a little bit too much. But we also have the engineers that have been able to keep reliable systems up for a very long period of time. When these two guys get together, there can be a little bit of animosity.

It doesn't necessarily mean it's a bad thing, right? Because you wouldn't stand up for what you believe in if you didn't strongly believe it. So there's right and wrong points on both sides of this. You have to be empathetic. That's a must-have, and that's hard to say for an engineer, to be empathetic also as well.

Stephanie Gillespie

Mm-hmm.

John Rzeszotarski

But the way we're handling this is through three different types of leadership.

One, leadership at the engineering level to actually show how to do the changes. Yes, you can do this in a large enterprise, and I'm going to show you how.

We do think there's leadership at the middle management layer that's also required. That's to sell the business case, put the plan together, prioritize it as part of that continuous improvement backlog.

And then lastly, you still need that executive leadership that's going to give you the funding and has the strong beliefs that you have to continuously invest in your systems long term.

Stephanie Gillespie

Yep. And can't underestimate the importance of people enough, I guess, in at least how this has gone with Key.

You can see over time in the last several years, our employee base within digital applications is growing, and we're changing the mix of contractors to employees. It's not because contractors don't bring value and expertise, but it's because we also want to protect the risk of critical subject matter expertise staying internal within the organization.

So as we're growing our employee base, we're growing not the same type of talent that we've always grown. We're not necessarily looking for a developer anymore who knows how to write their application code. We're looking for an engineer who knows how to write application code, but also understands how that code gets deployed, also understands how that code works within the broader ecosystem, how caching can come into play, how proxy settings can come into play, and just a full end-to-end engineer.

So we continue to grow the team. One of the things that we had done just actually in July is Key announced the purchase of HelloWallet. So here we are, a bank buying a software company. HelloWallet is a startup. They've got about 32 employees in the Washington, DC area, and we purchased their capability, which is going to be a core strategic capability for Key's strategy going forward, but we also purchased the talent.

This is a group of engineers who have new ways of thinking, challenging the historical, traditional approach that we've got in our 190-year-old bank, but really embracing open source technologies, embracing agile, and challenging us in the way we've been thinking. So we're really excited about having this team on board and bringing them into our thought leadership.

John Rzeszotarski

We brought in a new CTO this year, and he definitely is behind a lot of the practices and principles of DevOps. In his first town hall, he came out and made a big statement that says, "Look, we need more technology generalists if we think we're going to be able to move to the speed the line of business wants. So Linux engineers, start using a mouse. Windows engineers, start using a keyboard."

Yeah, you guys definitely got that one.

And they love it, right?

His other analogy that he uses that I really like, I think it hits home, is that we're actually air traffic controllers, right? We have to land 99 planes. If we land 98 out of 99, it's a failed day. That is not acceptable. But our job now is to get those planes out faster, get more of them out, and make sure that they're always on time. And doing so, we have to reevaluate our organizational structure to try to optimize for speed.

We were a vendor that very much focused on keeping the lights on versus continuously improving the infrastructure. Now, we were very eager. We're starting to take baby steps. They're kind of hanging outside the door.

So we followed our DevOps books under Conway's Law and said, "All right, well, how do we get work through the infrastructure pipeline today?"

There needed to be a much bigger focus on planning. We planned kind of within the silos, and we really just planned to do the minimum amount of upgrades, et cetera. So there had to be a better focus on planning and prioritization.

Infrastructure development was also at the forefront. It's funny because we've had engineers push back on automation because they say, "Well, we don't do it that often." And I'm like, "Well, wait a second. That's exactly why you would like to develop it and automate it, because if you're not touching it very much, most likely you're going to make a mistake the next time you do it. So it's going to be more accurate, it's going to be traceable because it's going to be sitting in source control, it's going to be versioned, and it's going to be much easier for that next person to pick that up. So you want to automate it if it's not a very common task just as much as you want to automate it if it is a common task."

We do think that we want to organize the infrastructure development by our lines of business to help share the prioritization. And then we still have 700 applications that we have to support in keeping up and running. Some of them are very specific, so we still have specialists, but we're striving to make those specialists more generalized.

Now, throughout our entire technology organization, there is a huge focus on continuous learning. And one of the things that our CTO also did was implement something called our 8 AM, and it's a postmortem call. He's very vocal in saying that it's not a blame session. He's very adamant about it.

But we all need to learn together. We all need to be able to figure out why is this and how is this affecting our customers, and let's continuously improve everywhere within the bank. Most of the entire technology organization calls into this 8 AM, and it's actually inspired a lot of people and drove a lot of passion.

Ultimately, we still have to make sure that each one of our organizations have to be able to federate to support Stephanie, because obviously being able to hit our line of business objectives is the most important thing as well.

So we've had a really good year. Last couple years, actually. We were probably one of the first, if not only, large banks that's running their digital applications in Kubernetes and Docker, and that was back in 2016. Our test automation story has just saved us countless times.

And it's thanks to guys like John Willis, who, when he took this picture, I think he was opening up for an '80s rock band, by the way.

But no, he made this statement back in 2010 that we know that DevOps will have made it when we hear it on Mad Money. And one of our engineers found this clip after the day it aired, and we thought it was pretty fitting. It's pretty awesome when Red Hat goes on Mad Money and they end with Cramer pitching KeyBank stock.

So all because of our DevOps journey.

I'm like, "Yeah, that's pretty good."

So thank you guys very much.

Stephanie Gillespie

Yep. Thank you.

Buy Key stock.