Building Continuous Integration and Feedback
Learn how the Continuous Delivery and Feedback organization at KeyBank built their teams and delivery pipelines to focus on providing continuous feedback for our development community.
Chris McFee is currently maturing DevOps practices, including continuous integration, continuous delivery, infrastructure as code, and configuration management / automation. Chris has experience designing and implementing highly available enterprise-ready systems and processes and infrastructure solutions; while providing engineering, deployment, administration, and support. Chris has a master's degree in Digital Sciences with a focus on Enterprise Architecture from Kent State University.
Chapters
Full transcript
The complete talk, organized by section.
Chris McFee
Good morning, everyone.
So this talk is about building continuous integration and feedback.
My name is Chris McFee. I'm the Senior Vice President of our Enterprise DevOps practices at KeyBank.
KeyBank is the 13th largest bank in the United States. We operate over 1,100 branches. We have over three million clients. Our online banking presence has over a million daily logins. We operate in 15 states. We have approximately 18,000 employees and just under $140 billion in assets.
So, from a KeyBank perspective, the path to financial wellness is really one of the things that our business has really been focusing on recently: developing solutions to help our clients understand where their money's going, help them plan better for the future, and essentially help them get more value out of the products and services that we're offering. So how do we do that?
So let's take a little history lesson. Our DevOps journey actually started back in 2015, and we really focused on three areas at the time. The first was our online and mobile banking presence. We hadn't really updated the user experience in probably about 10 years. So we wanted to make sure that we built teams and we built a new user experience, something that was going to be really good for our customers. So our lines of business set out on a path to, they called it Digital '17. And the goal was to build this new UI, and new user experience and have it deliver in 2017.
Around that time, there were six of us in technology services. And we said, "We can help this team deliver faster. We can help them deliver those applications and that user experience. And we can help guarantee some of the quality on those applications." So we really set forward to kind of build that framework and those patterns. Around that time that we said, "Hey, we want to partner with this digital delivery team," it was announced that we were acquiring a bank out of New York. So we had to either scrap that effort or accelerate it, and we chose to continue to accelerate it.
So KeyBank has talked about this in several DevOps Enterprise Summits in the past. We focused on two key areas to help remove some of the bottlenecks in those processes.
The first was automated testing. So at any given time, our testing cycles were taking approximately about a month. We knew that in order to help facilitate the new user experience, we needed to really accelerate those testing efforts.
The other thing that we focused on was infrastructure delivery. Back in 2015, it would take three or four months to spin up a server for application teams to take advantage of, and get their applications installed on it, get our platforms installed on it. And that's really when we started to focus on containerization and what that looks like.
But I'm not really going to talk about all those pieces parts. You can look at some of the previous presentations that we've done in those areas.
So this goes over some of the statistics around our test automation, and some of the efforts that we did there. The moral of this story is we went from about a 20-hour test execution, to test automation that ran in about 12 minutes. So I don't know about any of you, I can't remember what I did yesterday, let alone waiting 20 hours for a testing cycle. If I broke something from a development perspective, I may not know what it is until the next day. But now with our new test automation, we can run those tests in about 12 minutes, pretty near immediate feedback. So that's awesome.
So First Niagara customer day one. So this really talks about some of the things that as we were onboarding and acquiring the new bank that I mentioned out of New York. Some of the things that we didn't account for in the user experience caused customers to have to call our contact center. The login process, they didn't know what user IDs to use. Some of our legacy KeyBank clients were using wrong passwords or didn't know what links to click on.
So the day we went live with the new user experience, we had about 500,000 users logging onto the platform, about 30 logins a second. That was the most we had ever seen at the time. There was over two-hour waits for phone calls in our contact center for people to get walked through the process.
So in order to help facilitate the user experience and getting that up to speed, in the first four days, we did 10 production releases in the middle of the day. No customer impact as a result, and that was mainly due to we trusted the test automation, and we could spin up those containers in a quick process.
Funny story, as we were doing the deployments at 10:00 in the morning, the CIO was in the room at the time. No pressure as we click the button, right? But everybody had confidence in the platform, and the test automation. So, 10 releases, no customer impact. That was awesome. Great story for us there.
But how did we evolve as a team? So, like I started off, we had six individuals on our original DevOps team. And our goal at the time was kind of to do DevOps as a service. We would be a consulting group within the organization. We would hit kind of high-level targets for initiatives and programs. But we knew we couldn't really scale to be able to be involved in every single initiative or project. But we did want to federate out.
Around that same time, we started to go through some reorganization within the bank. It effectively tripled the size of the team. And that was kind of iteration two of our DevOps team was this new organization that came in after the reorg. Within that kind of version two, we kind of took on code management responsibilities. We still were working in containers. We took on events management or logging and monitoring. And we also brought in our change management ITIL function into the group. And kind of the thoughts around all of that at the time was now you're really affecting all of the code migration processes. You're able to monitor and get feedback as a part of applications running into production, and then bringing the ITIL function and the change management function into the group to help remove, again, some of the bottlenecks in the delivery process.
So that was iteration two. We're on iteration three now, and part of that, we took a step back. So there's a great anecdote from Larry Wall, who created the programming language Perl. He once said that when they built the University of California at Irvine, they put the buildings in. They did not put in any sidewalks, and they just planted grass. The next year, they came back, and they put the sidewalks where the trails were in the grass.
So we kind of took a step back, looked at what we were doing. We started to visualize the work and look at the value streams of everything that our teams were involved in. We landed kind of on these five different teams now within the organization. We've got about 40 individuals within this organization. And I'm really going to focus the rest of the talk kind of around this continuous integration and continuous delivery function, the cloud native function, and the digital intelligence function.
But we do have a tech planning group, a tech planning team that helps to prioritize the intake process of our projects that are still using Waterfall. We have about 30 agile teams now at this point that we federate in with, and we have a development team, which is called the Digital Assistance Development, where we're doing chatbots, and we're doing workflows with things like Siri, Alexa, and the Google Assistant. So some really interesting stuff, but let's talk about building this team and what that means as a part of our delivery processes.
So the cloud native team continues the work we started with containers. We must be able to provide end-to-end automation, zero touch, dynamic infrastructure. And this is really to help reduce time and effort and costs required for our delivery teams to deliver the software solutions.
So I mentioned that we were doing a lot of work in containers. We were one of the first banks that publicly talked about running production containers. Right now, all of our user-facing applications are currently running in containers within our data center today. But we wanted to take what we learned there and apply it more broadly. We announced in October that we were actually partnering with Google in providing feedback to their Google Kubernetes Engine on-prem product, which is one of the key components of Anthos and their Google Cloud platform, which I believe was just GA'd last week. So really continuing that effort, again, zero touch, on-demand dynamic infrastructure that allows us to spin things up quickly and scale.
The continuous integration and continuous delivery team really focusing on the build and release pipelines, utilizing feedback from our pipelines and further reduce bottlenecks in the processes.
So we've got great tooling now that can provide us visibility into how many processes within our delivery pipelines are manual, how many things are automated. When they are manual processes, how long is it taking? And it's really good metrics for us to focus in on and help fix some of that flow.
But if you kind of look at our continuous integration and continuous delivery pipelines from a capability perspective, they're providing the source code repositories. They're providing the automated application build processes. They're integrating in with automated tests written by our development and QAS organizations. We provide functionality and capabilities around artifact repositories, and then the actual application deployment processes.
So we really kind of define continuous integration up to that application deployment process, and from a continuous delivery perspective, which is really from a governance perspective and kind of a highly regulated environment. There are a number of checks and balances that we need to make sure that we're providing, and that's really where we're focused in on the continuous delivery, and that's where we're trying to remove a lot of those bottlenecks in that delivery process.
Kind of in this graph here, you'll see that the upper right-hand corner, the 158 days that it says, those are still manual processes in our releases. So while we do have 30 agile teams that represent probably about 80 applications, we have over 500 applications at the bank that are still in various maturity levels in their delivery processes. And we still do have some teams that are doing waterfall development. So on average, each one of our releases is taking about 158 days of manual tasks. But that's mostly representative of those waterfall projects still.
From a digital intelligence perspective, so this was really a shift in focus. We changed the way that we thought about event management and logging. So now it's more than just event management and logging. So we're correlating purpose-driven data from our systems, applications, network devices, endpoints, ATMs, in order to drive meaningful insights into the systems that we're delivering on to add business value.
So what this means is, we're orchestrating a bunch of different systems across all of our infrastructure and in our environment, systems that hadn't been orchestrated to date. So we can watch transactions as they flow from system to system now. We bring the events in through an ingestion layer, we do some enrichment, and then we send the enriched information and data off into some indexes so that we can visualize the things that are happening with our environment at any given time.
So this has been really powerful in hunting down potential issues, getting feedback, providing the visualizations necessary in order to get the data into the system, and make sure that everybody can visualize it, and is completely open across our infrastructure.
So bringing it all together, what does that mean? So leveraging each one of these teams and the capabilities that they provide, we're able to build this automated delivery pipeline that focuses on providing continuous feedback for our development organizations. So this is an example of one of the more mature teams at the bank that we've been able to consult with and help provide the ability for them to move quickly within our environment.
So on the left-hand side, you'll see the development team, you'll see source code repositories, our continuous integration engine. Everything that's in that gray box there is where we're really talking about that zero touch, on-demand infrastructure. So we can spin things up quickly, and when the ephemeral workloads or the atomic workloads are done, we can spin it down. And it really provides density and the ability to scale the environment as we need.
So what does that look like? So red dot starts on the development team. They check in their code into the source code repository. It kicks off a build. From Jenkins or the continuous integration engine, we then run some unit tests and we run some code quality analysis, some static code quality analysis.
Once those processes have been run, we take all of the data as a part of those tests that have been run, we put it through our ingestion tier within our digital intelligence platform, we add some enrichment, add some metadata around it, and we toss it off into the indexing for visualization.
It then continues on to the actual automated application build. And again, we get the feedback from the application build, add some enrichment for visualization, and then we go off and we do some more testing.
So within this tier, we're using automated functional testing. If you look at solutions like Selenium and WebDriver that really drive user interactions within a browser. We do automated API testing here in this space as well, using frameworks that are open source and available within JavaScript and Java.
We run small performance tests based on core functionality within the applications. So because all of this is parallelized, we can really do this in about 15 minutes. So think about spinning up a 15-minute load test, providing feedback of those core functionality, and then we can visualize it later.
And then we also do automated security scanning, checking for things like licensing, checking things for open source vulnerabilities. So again, shoot that off in a parallel fashion, send all the data back into our ingestion and enrichment tier, and then it moves on to the next phase.
So the next phase is really more the continuous deployment. But we put the built artifacts into an artifact repository, assuming that all of the tests have passed. If the tests don't pass, we break the build.
From the artifact repository, then our automated governance gates reach back into the indexing tiers in order to provide automated gates. So we can set up automated gates around, you have 80% code coverage, you have no high priority or high security risks, you're not using any open source software or libraries that are using bad license structures, et cetera.
And then it will also check to make sure our ITIL processes are in a good state. So is there a change record out there? If there's not a change record out there, we can go in and we can create the change record in an automated fashion. In instances like this, where we're getting all the feedback and we've trusted the testing that we've been able to provide, we're actually able to now release on demand. Still following ITIL practices because it's completely auditable, it's completely traceable, and now we no longer have engineers where they're spending hours putting in change records for all the different things that they're trying to release. So that's really good, too.
The other thing is, at this point, we can override anything that may have failed in the previous steps. So maybe they're at a 78% code coverage level, and we're okay with that. We accept that risk. So at this point, we can also go in and manually override some of those steps. And that is essentially the functions of our automated test pipelines.
One of the things that is still prevalent for us is how do we continue to scale these processes? So, like I said, we're a team of 40 now. We've got 18,000 employees at the bank. Probably about 700 of them are in tech-- or not 700, 7,000 of them are in technology. How do we continue to scale, and how do we continue to knowledge share, and how do we continue to upskill the current workforce environment and really around those change management pieces?
I actually ended early. I think I went through the history stuff way too fast, but I am available for questions or commentary.
Q&A
Audience: So the last slide that you explained, this is all about unit testing, or it covers also the normal functionality? Is it CI/CD or only CI?
Chris McFee: So, everything up to the artifact repository, I would consider CI. So it includes API testing, service testing, functional testing, which is the web browser testing. The unit test is actually before the application build. Yeah, so this is all at component level testing. Yeah. It's not integrated environment testing.
Audience: So the automated functional testing can include--
Chris McFee: Yes.
Audience: ...the full end-to-end integration tests.
Audience: That comes in CD part, right?
Chris McFee: Yeah. Yeah.
Audience: Do you have a lot of legacy code here? And how do you change the legacy code into bring it to this new pipeline, new way of working with CI/CD?
Chris McFee: Yeah, so the question was, do we have a lot of legacy code, and how do we build it in here? So what's great about the platform and the framework and the patterns that we've put into place is we can turn any one of those pieces, parts off as necessary.
So there are actually applications from a legacy perspective that are going through pieces and parts of this pipeline. So as an example, code quality analysis. We're getting code quality analysis for all of our applications that are going through our release pipelines now. We were never getting that data before. And because it's integrated into the pipeline, it just happens. So our delivery teams are now getting that data that they didn't have before.
Like I said, this is one of our more mature processes, one of our more mature teams. But we can turn any one of those pieces, parts off based on use case or how far along in the maturity process those teams are. As far as how do we deal with some of the legacy applications, so we're really taking more of a strangler pattern approach, if you're familiar with the strangler pattern. So building systems around the legacy, and slowly chipping away, so that you can switch the functionality over time.
Audience: And is it working, strangler pattern?
Chris McFee: It has been working, yep.
Audience: You are exposing via API, or how are you doing it?
Chris McFee: So yeah, we're creating the APIs, and we're creating the services in front of some of those legacy systems and then slowly moving them away.
Audience: Okay. And this is all Kubernetes?
Chris McFee: Yeah, so everything running in that gray box is essentially running on Kubernetes.
Audience: Okay. So my question is that you have showed two Jenkins outside of Kubernetes, and build is part of Jenkins. And also, what all tools are you using for different, for example, automated security scanning, if you can?
Chris McFee: Sure, great question. Yeah, so the question was really around, she noticed that Jenkins is outside of the Kubernetes piece, and then what are some of the other tools in the processes.
So what we did is we found that with the way that Jenkins runs and some of the data that Jenkins has, we really wanted to take more of a stateful approach, which means that our Jenkins master server is running on a VM. But all of the build executions is running within Kubernetes. So we have agents using the Jenkins Kubernetes plugin. It will spin up the build agents to do the other jobs and report back into the master process.
As far as some of the other applications, so we're using Git and Subversion from a source code repository perspective. Jenkins is really that CI function. From a unit test perspective, it depends on the frameworks that are being used. So Java's typically JUnit. The JavaScript stuff, they've got some of their other things. Code quality analysis is currently being done in SonarQube. Application builds, combination of NPM, Maven, Gulp, and Gradle. So, there's a bunch of pieces there.
Functional testing and integration testing, really using Selenium and WebDriver for a lot of that testing. Automated API testing has really been a lot of, I believe it's Mocha and Chai from a JavaScript perspective in order to drive that. Automated performance profiling, so using JMeter in order to generate the load and then using Dynatrace as an application profiling. From an automated scanning perspective, we use a combination of Sonatype Nexus and Black Duck. And then our ingestion tier, with the digital intelligence items is really, it's an entire platform that we've spun out, but it's based on the Elastic Stack. Some of that ingestion and enrichment tier, we're using Kafka, from an event streaming perspective.
Audience: And how much is this straight through for you? Does it break the pipeline and you need manual intervention there, or it is working seamlessly for you?
Chris McFee: So, the question is, is it straight through or how much manual intervention? So, it depends. It's very contextual. For the mature applications that are going through these processes, any one of those pieces that fail or doesn't hit the metrics breaks the build. And then it's on the development team and the delivery team at that point to determine why the build broke. So, when a build breaks, they get notified that the build broke, so that they can take a look at it. If it broke because of Kubernetes or Jenkins, they'll engage the team.
But for the most part, the framework and the patterns that we've built up in this space, we've said, "This is the standard." So, if you talk about the easy mode that was discussed this morning, that's really the easy mode is we provide that. If they want to go hard mode, they can absolutely go hard mode. The goal here is zero touch on demand.
So, we don't like the term self-service. I think that there's typically a negative connotation that we're pushing the work off when we say self-service. So we've really gotten in the habit of saying zero touch. So our development teams can really take advantage of the pipelines.
Audience: Can you say a few more words regarding the visualization? So how do you get to the process, what you visualize, and finally what are you visualizing?
Chris McFee: I was having trouble hearing, but I think the question was what metrics are we visualizing and looking at. So unit test pass-fail, percentage of things that are passing or failing, is one of the metrics that we're looking at. Code coverage is another. Code quality in the deltas. So did the code quality get significantly worse from the last runs? So we keep the historical data, and we can determine what SonarQube had scored those previously.
And then most everything else is percentages around-- So from an automated test perspective, it's percentages around what percent failed, what didn't. And then application performance profiling, again, is the deltas between the different runs. So we get that historical view. This function in the application previously took five milliseconds. Now it's taking 20 milliseconds. We're graphing that out so that we can see that over time.
Audience: And how often do you release this at the release deploy, and when is the version number given? Is it for deployment or is it part of the deployment?
Chris McFee: So the question was, how often are we releasing, and how are we dealing with versioning?
Audience: Yes, and is every release deployed?
Chris McFee: Is every release deployed? Okay. So, because of the Kubernetes environment, we have multiple places that we can deploy the application to. So within our dev region, where we're doing all of this testing, there is a static release. So if it passes everything and it goes on, it automatically goes on into our development region.
At that point, there are nightly builds into our integration test environment and QA environment. Those are automated. Those are nightly. From a production perspective, our mature teams are typically going about once a month for their releases. They can go quicker, but the business is essentially driving the decision on how fast they want to release those features. But because they're all containerized, because they're all production-like environments, we can get them in there, and they're essentially deploying every time the build passes.
From a versioning perspective, they're doing it through the source code version with tagging, and then incrementing it through the actual application build processes. That answer your question?
Audience: Yeah. Do you pull some versioning scheme like semantic versioning, or is yours your own?
Chris McFee: So the question was, do we do semantic versioning or do we do something else? It's really based on the development teams and what they're doing. We haven't put any standards from that perspective. Most of the teams, though, are using some form of semantic versioning.
Audience: How do you take care of the integrity of your applications for a release? Like say, this could be there for a deployment where you have, say, a standalone application, this would work across. But when you have multiple applications which needs a deployment or release, how do you take care of that integrity and solve CI/CD? The most challenging part is that in CD.
Chris McFee: Yeah, so the question was multiple application releases, how are we handling those that maybe aren't going through these processes?
Audience: Even for CD, that would be a challenge because you're waiting for one application which is not available. You need to do integration testing. It's automated, but...
Chris McFee: Yeah, so it really comes down to architectural pattern. So, a lot of the releases that the teams are going through are microservices, so it's very small blast radius. So they can release when they're ready.
For some of the more legacy systems where they're integrating in with some of the other systems, we can actually manage that through the automated governance gate and say, "We're going to stop this automated release until the dependencies for the legacy--"
Audience: Setting up there.
Chris McFee: Yeah.
Audience: XL Gate is what you're using there?
Chris McFee: Yeah. So we're using XebiaLabs in order to do that.
And we're actually up. We are at lunch, so I don't want to hold anybody up for lunch. I will be in the speakers' lounge this afternoon, and I can hang out here for other questions, if anybody has them. Since we're breaking for lunch, I don't know that I'm going to get kicked out.