KeyOps: An Operations Story

Log in to watch

London 2018

KeyOps: An Operations Story

Senior Vice President & Director of Continuous Delivery & Feedback · KeyBank

Our organisation is made up of 50 folks responsible for managing change, release, and feedback for over 700 different applications and supporting an agile platform for application teams to implement a continuous delivery model.

Chapters

Full transcript

The complete talk, organized by section.

John Rzeszotarski

This is Cleveland, Ohio. That's where I'm from in the United States, and I work for KeyBank. I'm senior vice president of their infrastructure organization, really what we call technology operations.

We're about $140 billion in assets. That puts us in the top 15 largest banks in the United States. So we have all the regulatory, but we don't have the size of the BofAs and the Capital Ones, et cetera. We're about 20,000 employees. We're spread all across. We're a full-service bank, so we have retail, we have investment banking, we have capital markets. It's a lot of systems to upkeep.

Our DevOps journey started 2015. I actually came here, listened to John Willis and Gene talk, and just got really passionate about this whole thing. I said, "Look, we got to get this going." It was actually a really good time because there was a catalyst. There was another program that was getting started up that was called Digital 17, which was basically, we were going to rewrite our online banking and our mobile banking applications, and we were going to deliver them in 2017.

We're like, this is great. We can start building a DevOps pipeline. We can really start integrating all of these great things that we're learning about. But there was another catalyst that kind of hit us, and it was an acquisition. We acquired First Niagara, and basically, due to the timing of the acquisition, we had to roll their customers on in basically half the time that we had to deliver 2017.

So we basically said, "Hey, we got to go fast. We got to go now, and we got to make decisions, and we have to make those decisions quickly." So really, agile took off. We went from four releases a year to basically weekly releases, and we had our first implementation in July of 2016, and we were able to hit the October 2016 deadline to move those customers over. It was a huge success.

We automated a ton of different components, and you can watch all those stories out on YouTube. I'm not going to rehash those again. I want to tell a little bit of a new story. But the one thing I want to call out here was we had two catalysts. We had the First Niagara acquisition, and we had a program already running. But I have 800 other applications, so how do I do DevOps for those when I don't have a catalyst? And that's what I want to talk about.

So how do I do DevOps for business as usual, or operations as usual? I'm going to tee this up with two of my favorite talks. John did a talk about breaking bad equilibrium, which was actually a previous talk about There Is No Talent Shortage by Andrew Clay Shafer, and they do a really good job of teeing up what the issue is. Both talks talk through this concept called the prisoner's dilemma. It's part of game theory, and if you've never heard of it, I'll do a quick explanation. Richard Cook will probably call me out and say it's a horrible explanation later on.

But it's basically putting two different prisoners against each other. If prisoner A basically says, "Look, prisoner B did it. I didn't do anything," prisoner A might not get any time, but prisoner B might get three years, and vice versa. Now, if they both go crying to the police and the judge, they're both going to get the maximum sentence. So they're both going to get three years. But if they stay silent, that's actually the best outcome for both of them.

Basically what John and Andrew are saying is, you're always driven to this dominant strategy, and it's really hard to go away from it. Basically they say that if you think about Nash equilibrium, the only way to actually get the optimal outcome is if both parties change their strategy.

Now, the reason I think that this is important for us, you're like, "Why does this have anything to do with DevOps and teams?" Well, let's say hypothetically, a security guy walked into your office and said, "Those cloud servers you want to put on there, they're not allowed to talk to the internet." And you're like, "Well, that's a dominant strategy." It's a pretty harsh one, right?

This is the unbalanced equilibrium that we all deal with. It's both parties: security, infrastructure, development, operations. We go against each other all the time, and it's funny because we work at the same company. You're like, "Why can't we just get on the same team and do the optimal thing?" But it's so hard to see that when you're driven to your dominant strategy.

So what do they say? Andrew and John say, you got to change the game. And don't try to persuade them; change the game. That's what you actually have to do.

So you can take that to your boss and you can say, all right, if you believe him that he says you're going to have unlimited budget and unlimited time to execute your DevOps work, that's not going to happen, right? He's going to say, "If you believe me, we got big problems."

So you have to come up with a plan. And to do the plan, I really think you have to understand the constraints that live inside technology operations organizations.

One of them is cost. We're a bank. We're measured by the efficiency ratio: how much does it cost for us to make a dollar? So the cheaper we are to make a dollar, our efficiency ratio looks a lot higher. Is anybody's budget really growing in technology operations, or is it shrinking? In our world, a 200-year-old bank, it's definitely shrinking.

We're a bank. We're going to be highly segmented. We're going to have multiple systems. We're going to be very highly complex. We're also not going to be able to swap out all of our old systems, so we're going to have to deal with legacy. And what I mean by capacity here is really, do we have the workforce ready that is willing to take us through our next journey? So do we have the capacity, the aptitude? Do we have the skill sets to actually take us on?

So if these are constraints, if we know about them, well, let's come up with a plan. Let's use that to our advantage. I'm going to talk through a little bit about how we're going to use open source to adjust for costs. I'm going to talk through how we're going to automate to try to reduce complexity, how we're going to defend ourselves against legacy applications. We call it the defense against the dark arts. And then from a capacity, how we're going to continue to drive towards continuous learning.

I want to make sure I call out: open source does not mean free. And if you think it means free, then you have a very bad idea of what open source is. But our chief data officer actually had this really good plan when we talk about open source. It gives us something that we call mark to market, which is fair value estimate for what we should be paying for software solutions.

Now, I'm sure no one in here has ever signed a bad software contract that they wish they could get out of, right? It's never happened? I want to hear a hand of anyone that has not signed a contract that they wish they could get out of. Open source opens up that opportunity for you to basically drive decisions, build minimum viable products with the open source solutions that are out there.

So I think it's a great way for you to drive this as part of your operation strategies. Instead of having these big projects that are going to take up, "Hey, we've got to upgrade this system, so these 300 apps have to come along with it, and we just turned what should be a $100,000 project into a $2 million project." You shouldn't have to do that. You should be able to break it up. And I'm going to talk a little bit more about open source as we go on.

So complexity. This is how I like to explain containers. I always use the term Containergeddon because it's still coming and it's still taking over the world. And if you saw John's talk yesterday, he did say it's the next-gen infrastructure. It's really today's infrastructure for me.

But if you're going to build a product, you're going to start with CPU, memory, RAM, disk, et cetera. The first thing we would typically do back in the day would be virtualize it to get more bang for our buck. We'd install an operating system, we'd install more frameworks. We'd install more frameworks because one's never enough. We'd install vendor products and platforms because we like that vendor lock-in. And then we would install applications that finally sit on top of that.

Now we're finally done. Right? Now I have to go through and now I have to ensure that we have the right security, that we don't have any vulnerabilities within our frameworks. To operationalize it, I have to make sure I have the right logging in place, I have the right alerting in place.

Now I'm done. No, now I have to test it, and I have to test the dependencies between them all. Now I'm done. No, I start all over again because it's now time to patch, upgrade, and hotfix. And each of these are typically different teams that don't have the access to actually install and do the work on the servers.

Oh, and I got to do that for my dev environment, and my test environment, and my stage environment. So it would take us two months to get a server up and running. And you're working with the line of business, and they're like, "Hey, can you just let me use AWS? I could have done this in a couple of hours."

And this is where Docker and Kubernetes can really just come in. I love the first talk about Jaguar. He said that it gives the responsibility back to the dev team to make a lot of infrastructure decisions. That is so much workload that is taken off of our infrastructure teams, and the value can never be measured in a great deal enough. So this just reduces that complexity a great deal.

However, not everything can run in a container. Not everything is great for these types of workloads. So there's other things we can do to reduce complexity.

One of the other things we're doing is we're following Gary Gruver's model for evolutionary database design, and if you don't understand it, I'll do my best to explain it as well.

So if we're going to build a simple application, and let's say we have a database schema that's version 1.1. The first thing we're going to do is, we're a good company, so we should be building feature toggling in there. It's so ironic that the operations guys want the developers to do this the right way so that it makes their lives so much easier, but building feature toggling is a must-have requirement for doing evolutionary database design.

The next piece is, all right, well, we have to build the 2.0 version. We're going to add some fields. We're going to maybe do some transformations. And now we're telling the DBAs, you got to work within a source control management repository. However you used to do your job, it's completely different. So they got to get used to that.

So then you also have to build your application in a way so that it can at least support one type version of backwards compatibility. So if we build the application that says we're going to support the 2.0 instance, but we're still going to keep the feature that's running 1.1, now we have the ability to use, there's all types of different tools out there, and they work with all of your continuous delivery products, like XebiaLabs, to actually implement and actually release the 2.0 features.

But you can see I'm not actually enabling that in the application. I'm still using 1.1 because I'm going to preview that specific release. And then when I've tested it and I know it works in production, very similar to a blue-green deployment, now I can activate that feature toggle, and I've now just done a zero-down database release.

So this is great, and this is really, really helpful. And if you can get your organization to start thinking this model, it will save you from those horrendous eight-hour, 10-hour, 12-hour releases that you would have to do on the weekends.

But the next thing I'm going to say is, it doesn't solve everyone's problems. What about that vendor that doesn't version their web services? Or what about that vendor that just doesn't support this type of model, or this old application? What do we do for that?

Oh, real quick. Really important, just a note. This only works with small releases. Every developer's going to have their own database as an operations resource. Think about that. It could get a little expensive. And test automation is absolutely required.

So since this isn't going to work for everything, we have to think about how do we protect ourselves from those old legacy apps. I hope Jason Cox isn't here, because this is just not a great graphic. Not pretty. I need Disney to kind of brush this one up for us.

So we're utilizing something called the circuit breaker pattern, which is very similar to how you would have your electrical outlets in your house, and if you plug too many into one outlet, it's only going to take down that string of outlets. It's not going to take down the entire house because it's going to trigger one circuit. Same concept for applications.

We offer many different services. We offer bill pay services, we offer peer-to-peer payment services, and all of those are basically the little strands of breakers that we need to be able to protect ourselves with.

We use a framework called Hystrix. Do you see his mouth went upside down? My team said that the app wasn't happy because he's riding a whale, and he's probably sick, but I'm not sure.

But we use Hystrix, which is a Netflix open source framework. It's actually been invaluable to us. We've been on calls where we talk and we're blaming a specific thing. We're like, "Look, this outage is severe, and we need to really keep that in mind." But at the same time, we have to look up and say, "Guys, it could've been so much worse. If we didn't have Hystrix, we would've been down for four hours."

And so we're actually realizing that on some of these incident calls, and we're actually talking about it, be like, "You know what? You're right. We need to talk about the value of designing our systems so that they are ready for failure," just like Richard Cook talked about yesterday.

So if you don't know Hystrix, go Google it. It's out on GitHub. It's a terrific framework.

So it doesn't solve all of your solutions as well, though. And this is where you might have legacy systems out there that says, "Look, I can't even wrap that because we're so dependent on those legacy systems." How do we get out of that? Well, this is where you see, I think, a lot of good modern companies are adapting something called the strangler pattern.

Martin Fowler wrote a great blog about it, and what I liked about it is he says, when you're looking to replace a legacy system, you have to think about these three things. And one is, systems are always complex. If you think you're going to go rewrite it and say, "Look, let's just get it off the mainframe and put it over on this distributed system," you're never going to really understand the complexity of everything that went into that. And if you think you are, you're naive.

The other thing is, it's going to be extremely risky in order for you to accept that big bang approach. And the last one is, you're always going to want to improve it. Those requirements are always going to pop in. So break it apart, try to knock out each edge.

And that's something that we're following. I know Capital One's following that same pattern as well. And it's something that's our last straw when it comes to these legacy systems that there's not a lot that we can do with.

One other big thing that we're doing, and I appreciated the talk about Adidas, about their ability to capture metrics. And it's not just for the development and build pipeline. For us as an operations team, what's even more important is all of the analytics that are running within all of our different systems.

This was a model that we came up with, where we're capturing as much metrics and information as we can within our endpoints. We're classifying those based off ISO type, and then we're feeding them into our event system.

Really cool thing is most of this is all open source. This is all built with Ansible Playbook, so before we actually went live, we've rebuilt this 40, 50 times. It's been invaluable, that ability to learn, code, improve, and move really, really fast on the infrastructure space.

So this gives us all types of different lenses. The one I'll focus in on, and if anybody wants to talk about any of the other stuff, just ping me on Slack afterwards. I can tell you why we went in specific directions.

But the Elastic analytics part, it actually proved to be quite valuable initially and early on. The first thing we did was we started installing some of the Beats agents in the teller systems themselves. And we were actually able to see all types of analytics on the teller machines that we didn't even know we had.

One of the big complaints the tellers had was like, "It takes forever for us to print, takes us forever to get XYZ done." Well, as soon as we installed these agents, we realized that there's some processes, as soon as they would put the computer to lock when they went to go pick up something off the printer, that would just eat up the CPU. And so we basically said, "Hey, if we modify, kill these processes, it's going to improve the performance a great deal."

So this is getting rolled out across our entire branch, that ability to capture that analytics. And I think the other big thing that we're doing here is we're not thinking about applications point to point. We're thinking about applications and infrastructure as a platform, broadcasted. And we're trying to capture it all via the same similar type of way. So this has been invaluable.

So we're building all this automation, and the one thing else that we had to do was we had to take a step back and say, "Who's the consumers of our automation?" It's mostly internal employees. It could be support teams.

One of the things that we really found was, if we're going to build this automation, we have to make sure we offer it across multi-channels, and we have to make sure we offer it and put our development hat on. The first thing we built was we built a chatbot to do password resets, and it just took off. It was the number one phone call we would have into our help desk. We're trying to cut down that number of calls. As soon as we put it out there, it was a big success story.

But then we started adding all types of other intents or activities that the users wanted to accomplish. And we started with very simple fuzzy matching. We followed the Hubot model that GitHub put out. And we said, look, we were getting our consumers a little confused, and we weren't actually able to identify intent very well.

Well, that's where the AI capabilities in cloud and the APIs that are out there are just providing tremendous benefit. So one of the things we're doing is we're integrating with LUIS, which stands for Language Understanding Intelligent Service. It's an Azure API, super easy to use. You can go get a demo account and be up and running in 10 minutes.

What it does is it basically takes in all the utterances, all the input that the users would type in, and you define your intents, you build the models, you build the examples, and it basically will tell you what that intent analysis is. It'll also tell you what sentiment the actual user is actually giving you as well. So if you're really frustrating them, we can send them to a live agent.

We're doing this across all of our different channels. And the other really awesome thing that we noticed it could actually do is it can actually identify specific variables within the utterances. So it's using natural language processing, identifying the nouns, the verbs, the adverbs, but then it's actually able to pull out and says, if we know that it needs help with a password, it can actually tell us what password type is. So we can drop them into our intent workflows much sooner.

So I want to call this out because I think that this is a very tangible way to get very easy wins within your operations group, and it helps drive wanting to build more automated intents that can be consumed by your employees.

So Gene, I had this guy up here before your slide came out this morning. Gene had called out The Structure of Scientific Revolutions, amazing book where I think they came up with the term paradigm shift. It was the first instance of that in the 1960s, when Thomas Kuhn wrote this.

And so this goes back to that capacity conversation. Do we have the right workforce in order for us to complete these activities? And I love this sentence because I think he hits home what our issue is, because, "He there joins men who learned the basis of their field from the same concrete models. His subsequent practice will seldom evoke overt disagreement over fundamentals."

It's diverse thought, right? And he's talking about how there was gigantic gaps in scientific revolutions, and one of the causes is because there wasn't enough diverse thought. They were all following the same models.

As an operations resource that says, "We've only followed vendor X for networking," or, "We've only done this for that." Yeah, we definitely live in this. We're always following the same models.

So how do you get diverse thought? And so that's where I think what really hits home is what does our future-ready workforce actually look like? And our chief data officer actually built this.

As we were flying over here, someone had asked me, "Well, what skills do you think are the most important?" And when I answered, they said, "Well, I'm surprised you said that. I would've thought it would've been Python and C# and .NET, or whatever it is." And I said, no, the most important thing is passion, right? To have the passion and curiosity to learn.

And the fact that we're in this new world, and there's this term called micro-learning, where you don't go and read the book from beginning to end anymore. You do the tutorial. You learn in segments, and you build in segments while you're learning. And I really think that's an important aspect because what we're asking our employees to do, six months from now, we might not ask them to do any of that anymore. We're going to ask them to do something different because technology's changing so fast, and our business is changing so fast.

So this has been really important for us to drive home that capacity side. I think that not everyone's up for the journey, but I think this at least starts driving our framework for it. And if we can mix in some new with some of the people that are very curious, this is going to put us in a really ideal state for us to have the right workforce.

So I think there's a model, and I don't think there's a right or wrong model for you to implement DevOps by any means, or do DevOps in your organization. But for us, we have a lot of governance requirements. We have a lot of traceability. We still have to report back to Ernst & Young and do ad hoc reporting on our releases. So there still needs to be what we call the rails that we have to run on, that our train has to run on.

And so that's how we're set up. The first thing I want to say, though, is automation in our organization is a federated approach. Automation is everyone's job within the IT technology operations space. No matter who you are, it is your job to automate. But we can at least provide an easy means and an effective and reusable means to do automation within the bank today. And that's where some of these teams come in.

I think these teams are an important model to follow, even if it's not the exact same approach. The middle boxes are the most important boxes. So that's our continuous integration, continuous delivery, and continuous feedback teams. And these teams are building the reusable capabilities and frameworks in order for us to actually deliver software throughout the bank.

Just out of curiosity, does anyone else have problems with vulnerabilities in Jenkins? Because you should all raise your hands. It's our number one vulnerability issue solution. So we need a team that's going to allow you to, once again, make sure that you have the ability to use Jenkins in a self-service fashion, but upkeep it and ensure that you're not going to have those vulnerabilities sitting within your organization as well.

We also, like I said, the traceability and governance with being able to release and show the workflow of who approved which release, how does that validate and go back to some of our ITIL practices. All of that is built by the continuous delivery team.

And then we're building reusable monitoring frameworks so that we're not wasting cycles trying to redo all that work.

The top box is really our release engineering. These are almost like our coaches. These are the ones that are going to go out, work with the teams. They build typically our containers that might get supported, the underlying container images that would be supported. And because from a security perspective, we still want to keep that close to home. We still want to make sure we understand what images are running in production, which ones are not. So there's a lot of coaching that happens on the top of how to do these specific activities.

The bottom box is really more around cloud services, but also for infrastructure automation. It's very similar to the continuous delivery and continuous feedback model, but it's just a different infrastructure that we actually use to support it.

And then our pillars on the side. The one thing I really like with our new monitoring model, we're now looking for data scientists because we have all of this great data. We're not doing capacity management by pulling up the capacity management report and tool and say X, Y, Z, because I think that's a very poor way to actually do capacity management. Instead, we're going to actually build models against the data that we have so that we can actually feel comfortable. So that's more of that data scientist team that's really looking at all of the infrastructure analytics that we're evaluating.

And then on the right side, it's really around performance engineering. This is the squad that gets called into the big P0s, what we call P0s, P1s, which are the critical incidents. These are the A players that know how to navigate application performance management tools and know how to get through the logs, and that A squad that just goes in and does this because we still have a need for things like that.

But I just wanted to share this because I think it's an interesting model. It's an interesting approach that we've undertaken. It's definitely evolved. This is not where we started with by any means, but this is where we're landing right now.

So what do I need help with, right? A lot, because I would hope everybody would say a lot in their slides, by the way, because everyone needs help. But I think it's really just anything others are doing to help the enterprises move fast and effective.

And then maybe lastly, we're really starting the cloud journey. We haven't put a lot in the cloud. We've put some in the cloud. We're having to navigate some decisions that I think companies that have already gone through this, it says, well, hey, how do you keep costs under control? Do you guys do a chargeback? Do you guys do a direct bill model? And then even justifying those on depreciated infrastructure, we still have that issue.

And I'm not sure, a lot of other people seem to get over that hump, but that's not something we've been able to convince our finance team for yet.

So that's it. That's all the information I got. I think I'm actually way early because I talked really fast.

Thank you.