Clean Handoff: Giving Devs the Power and Speed to Deploy Without the Power of Production

Log in to watch

Virtual US 2022

Clean Handoff: Giving Devs the Power and Speed to Deploy Without the Power of Production

This talk explains the tools and techniques we've developed at Truist to give developers full control over their CI process and enable them to deploy through integrated testing, while keeping production deployments identical in code, but separate in control, and automatically enforcing all regulatory requirements at a large bank.

Chapters

Full transcript

The complete talk, organized by section.

Evan Chiu

Good morning. Let's go ahead and get started.

Hi, good morning. My name is Evan Chiu, and this is my talk, Clean Handoff: Giving Developers the Power and Speed to Deploy Without the Power of Production. I'll talk about my company's journey, the techniques that we used, and how that might be beneficial for you.

I'm a software engineering director at Truist. Truist is a new bank. It's the merger of two heritage organizations, BB&T and SunTrust. In the merger, we've tried to come together and figure out what's the best from both sides: what technologies are we going to use, how is everything going to fit together? This talk is about some of the things that we learned and the things that we're using.

My team is part of the automation guild. It's our job to make sure that everything is flowing smoothly. When things are automated, they work faster, better, more repeatably, more reliably: all the kind of things that people at DevOps Enterprise Summit know all about. We are supporting already over 100 applications deploying about 400 artifacts, and we're expecting to grow that portfolio by about three times to five times as we are becoming the enterprise team.

First of all, why does it matter? Why do we need to separate developers from production? Part one could just be separation of duties. It's important for our regulators to make sure that everything is organized and maintained. In the financial industry, we have a lot of regulations. There's a lot of opportunity to set policy and write how you want things to work, and then you get graded on how well things actually work. Being able to separate the duties between the developers and the production support engineers is really helpful in terms of making sure that we meet our commitments and do things the way they're supposed to be done.

To give you a preview, we're going to focus on empowering dev speed, empowering production controls, doing clean handoff techniques between the two, and then we wrote a custom service called Kali. I'll give you an overview of how that all looks.

The tools that we're using could be swapped out for others in the same space, but we're happy with the set that we've got. We use GitLab for all of our source control management and all the pipelines. We love pipelines as code; that's been working really well for us. We're using JFrog Artifactory for all of our binary artifacts, and that's also been working pretty reliably for us. We use Veracode for security scanning, and then ServiceNow to try and capture everything: all the changes and all the pieces.

You can see that we tried to break out which pieces the developers can own. They own their own build pipelines. They own their development Artifactory repository. Then which things does DevOps own on their behalf? That will be the deploy pipeline and then their higher-level Artifactory repositories.

Let's get into it: empowering dev speed. The first thing that really helps developers move fast is giving them ownership. When the developers have ownership, they don't need to wait on our team for anything in particular, so they're able to move fast and make good progress.

One of our big mantras is that everything is an artifact. Whether they're building a Lambda function, whether they're building a Java jar, whether they're building a Docker container or an Angular static compile, or whether it's just their Terraform code for their infrastructure as code, we zip that up and everything is treated as an artifact. It all goes into Artifactory the same way, and then we're able to move that through each of our environments. Build once, run everywhere is really important. The developers control their builds. They build their artifacts and then put them into Artifactory, and that allows us to do that clean handoff to the pipelines.

Take a look at their build pipeline. It's got different pieces to it, including compilation, unit tests, linting, and SAST. We'll get to that in a future slide. Once all of that's done and they've merged their primary branch, then they can move over into publishing that artifact into Artifactory. They own their own development repository in Artifactory, so they can put whatever resources they want in there, and then we'll talk about how they move from there in a little bit.

One of the keys that we used to help them go fast is templates. One of the things that we like about GitLab pipelines as code is that the code's really easy to copy and paste, but the thing about copying and pasting is that you never get any updates. It's even better to do templates. When we write templates, we write templates that are generic so they can be used by many different teams and that are parameterized, so that the teams are allowed to provide their own variables so that each template works for them.

We've got an example on screen. There's a yarn install. If you're not familiar with the Node ecosystem, that is go and grab all of my dependencies, get everything out of Artifactory that I need for my application to work. Yarn build might be a TypeScript compile. It might be an Angular build. Whatever the application needs to build will be in that script. Then yarn pack is zip all of that up and put it into one file that's ready for deployment.

Then you can see the curl command there. We'll pass that artifact and upload it into Artifactory, and you can see it's got the variables in there of artifact name and version so that each team can provide their own parameters so that we can make it work for whatever the team is.

One of the things that we really like to push to help teams go fast is pre-merge testing. When we do the testing pre-merge, each developer gets their own tests on their own branches without having to involve their teammates, which is great because everybody's able to focus on their own piece and deliver that.

The parts of pre-merge testing: we like to have compilation if it's a compiled language; unit tests. Obviously, unit tests test the smallest unit of code, make sure that it's working as expected. Unit tests are really great at baking in assumptions and identifying those and making sure that things work the way that we expect. Unit tests, we say, are one of the developer's best defense against their own teammates and their future selves. If a unit test breaks, they know, oh hey, I've modified the assumptions when they're doing a refactor or they're taking down tech debt. Those unit tests really help them make sure that everything is working as expected.

Next up is linting. Linting can make sure that the code is written well. In an uncompiled language like JavaScript, you can use a variable before you declare it, but in most cases that's an error. Linting can catch things like that, warn you, and help bring those errors back earlier.

Then the last is sandbox SAST scanning. SAST is static application security testing. It's just looking at the code or at the binary. We use SonarQube for that. It looks through all the code and identifies things like logging your password into the logs or logging database credentials. One of the things that SonarQube does a great job with is tracking the complexity of your functions. If a function is too long and too complex, it's not worried about lines of code. It's worried about branches. It's worried about the different things that you might have to think about as you're trying to understand what a function does.

All of that data feeds back into the pre-merge test, and the developer fixes all of those things. Once all of those are fixed, then they're ready for another human to come review. What's great about having the testing shifted left all the way before the merge request is that all of those things can get checked before another human is even brought into the process.

Another check that they can do is a branch deploy. This allows developers to take the code that's on their branch and deploy it in a limited dev environment. This is really useful for our teams that are developing in the cloud. One of our teams is using Amazon Lex, which doesn't really have a great story for developing it locally. This team has like 20 different dev environments, one for each developer, so that they can take their branches and, before they get everything right, just deploy it in their own environment in AWS and make sure that it's working before bringing more people in to look at it.

Another feature that helps the developers move fast is that they are able to own their own IAM roles and policies. IAM is identity access and management. It's the definition of what are the roles in the application: who can do what. For a Lambda, which databases can it touch and how? Which S3 buckets can it touch and how? Just all the connections between the different pieces of the application. The developers are able to define that in their own code in either Terraform or CloudFormation.

Then security comes in at merge request time. Once they get it tested in their own branch, then they bring in security. We use a GitLab feature called code owners. Code owners lets you specify: hey, if you change code in these files or these files or these files, then you need additional review, you need additional signoff by these people. Security-injected code review means that security comes and reviews anything that's going to get merged into the main branch that's going to be headed for production. It's nice to have all of it fixed right at that time, that nothing is heading to production that has roles that are insecure.

The next step: let's talk about empowering production control. One of the things that we use is deploy projects. The build projects are completely separate from the deploy projects. Everything's an artifact. Those build projects do their builds, they get the artifact, they get that binary just the way they want it, they put it in Artifactory, and then it's referenced from the deploy project. The deploy project is also in GitLab, doing pipelines as code using templates so that everything can be done working the same way for everyone, and being able to provide a single upstream view into: okay, all of the pipelines that are of this flavor work this way, all the pipelines of this flavor work this way, so that they can have a standard and an expectation for how things will go.

A tool that we use to identify all of those artifacts are manifest files. The manifest file is a JSON file that defines all the attributes about each of the artifacts. We'll take a look at one of those. This is a simplified manifest file, but you can see we've got two artifacts here that each define their name and their type and their version. The location tells the pipeline where to go get that artifact out of Artifactory. Then it's got a reference to the commit so that they can go back and see: okay, what changed? Where did this come from? What was it?

Being able to have those versions really helps the clean handoff between development and operations and DevOps, because operations can say, okay, we know we deployed version 1.2.3. That's not quite working; something changed. What was it? They can say, okay, the previous version we deployed was 1.2.1. What happened in those two intervening versions? Then developers can look back and they've got a very limited set of commits to look at to figure out what were the changes, what was different, and how do we remedy that in the next one?

Here's our diagram of a deploy project. It's owned by the DevOps team, and the production pipelines are only run by our operations production support team.

The deploy pipeline has the manifests like we talked about. It's got three main phases. The first is prepare. This is where we would run a Terraform plan to make sure that everything is all set, make sure all the roles are ready, make sure all the pieces are available, make sure that all the manifests are accessible. All the artifacts can be downloaded from there.

Then we go ahead and do the deploy. Depending on the application, that might be a canary deploy, where we leave the running application and then start slowly transferring it over to the new version, then slowly shift the traffic over and make sure all the metrics are okay, make sure that everything looks successful as we do that shift.

Then the last phase is integration and smoke testing. Integration testing is great in lower environments to make sure that, once deployed, the application works as expected. In an integration test, we use Selenium a lot to drive all the way from the front end, from the web interface, through the various application layers all the way down to the database. End testing really helps make sure that all the assumptions are correct and everything fits in place. With unit tests, you're using a mock back end where the developers say, okay, I expect the database to give me this. But if the data in the database doesn't match the expectations of the code, then integration testing will catch that discrepancy and reveal those.

If we're deploying to production, that's when we do a smoke test. Once we've deployed the new artifacts, then we make sure that everything is working before we start releasing new traffic onto that code.

The last piece of that, that helps production control really have the right controls, is the ServiceNow validation. ServiceNow is where we do all of our change tickets. ServiceNow shows that everybody has signed off, everybody knows about this deployment, everybody knows that this change is going to be happening in this time window. That call to ServiceNow validates that everything is good for the pipeline to go ahead and run and we're good on all of our regulatory concerns.

Next we're going to talk about clean handoff, which is some additional techniques that the DevOps team has provided to help those developers and operations work hand in hand.

First of all is the DevOps Dojo. We provide a walkthrough of building an application at Truist. We provide walkthroughs for all the major different types of languages that they're using. This is a great resource to help new teammates get onboarded to doing things the Truist way. It covers everything from getting SSH keys set up with GitLab to downloading the code, to getting all the right software installed on a workstation, running unit tests locally, making changes, watching the pipeline succeed, watching the pipeline fail and debugging it: all the things that we expect a developer to come across in their main development process. It gives an overview of how all the pieces fit together.

One of the techniques that we use to separate permissions is called forked deploy projects. What we do is we fork the deploy project to allow the devs to control the lower environments and allow operations to keep track of the higher environments. Basically, we've got two copies of the same project with a fork relationship between them, so that the developers, as they're working on the exact script that they need to deploy, putting the files on the server in just the right place if it's an on-prem pipeline, can work on that, make changes, deploy to dev one, deploy to dev two, and test those deploys. Once they feel solid and confident on it, they can make an unsolicited merge request over to the forked project.

What's great about that is that they don't need any permissions on the higher DevOps-owned project. They can make that merge request and they don't cause any pipelines to run in the higher project. Then DevOps and production support can review those changes, review those scripts to make sure everything's safe and not trying to reach outside the boundaries. Then we can merge that and use that code written by the developers to do their deploys to get down to the QA and the prod and other protected environments.

Another technique that we use to separate permissions we've called GitLab utility pipelines. The big idea here is that it's pipelines as an API. This technique allows us to give dev teams a trigger token, and that trigger token only has permission to kick off the pipeline. It doesn't provide any additional visibility. It doesn't provide any editability into the project. It just allows the dev team to kick off a pipeline in this utility pipeline project. That allows us to fully separate the access control.

Here in this diagram you can see the developer owns their build pipeline. There's a trigger job in there which might do different things. The examples that we have are they might need to destroy an environment or promote. In their pipeline, they'll have multiple jobs that are the trigger. Those triggers kick off the utility pipeline, and then the utility pipeline can validate the parameters that came in. It can have its own set of credentials because it's in a completely separate namespace from the developer's project, and then it can take actions on the deploy pipeline.

For example, on environment destruction: like we talked about, some of our teams have a lot of different development environments, and they need to destroy those at some frequency. One, for just cleaning things up; and if a developer moves on or goes on vacation for a while, they can destroy those resources so that they're not using costs while they're out. Being able to destroy the environment allows them to get a clean first run and make sure that everything works on first run, because sometimes we get infrastructure into a state where it will keep working, but we can't build a new one. We try to avoid those situations and test our way out of that.

The other is for promotion. Some developers and some teams have the ability to promote to their test environment. We allow them to trigger that from their dev environment. The GitLab utility pipelines allows us to have complete control over what the developers can and can't do.

Here's what a trigger looks like. It's just a simple curl request. Again, this is GitLab YAML script, and you can see here the form variables. It's setting the environment to dev one. This is saying I want to call my utility pipeline which destroys environments, to destroy dev one so the stack gets cleaned up and I can start fresh. This is all the developers need in their build project to kick that off.

Another piece that we've been putting a lot of work into lately is automated change tickets. We just got this working so that when the developers deploy to whatever their pre-prod environment is, we have automated change ticket creation. That automated change ticket creation will create the change request in ServiceNow.

Then it does automated evidence collection. It's pulling everything that it can over into ServiceNow so that it's all in one place. We have a module in ServiceNow called ServiceNow DevOps, and that is able to take in all the commits, all the merges, all the GitLab issues. Everything out of GitLab has been coming in as webhooks over time so that ServiceNow has all the data. It also reaches out to our work management tool, which is Rally, to pull the user stories, the features, the epics, so that ServiceNow has the full end-to-end value stream of user stories being ideas, being created, being changed, being sent down for implementation to GitLab, getting done, getting unit tests, getting built, getting deployed to lower environments, and then being ready to deploy to production.

Since ServiceNow has that full end view, we can create dashboards and tooling to look at how things are running in the whole system. Then it gives a lot more context in that change ticket of what's changing, why is it changing, who is it changing for.

In addition to all these things to help the handoff be even more clean between the builds, the artifacts, and the operations, we built a custom service. We call it Kali.

Kali build integration is pretty straightforward. At the end of the build pipeline, the developers call Kali, and that takes the artifact and uploads it into Artifactory and then tells the Kali service about it. Here's a diagram of that. The developer owns their build pipeline. They can use the Kali CLI to reach out to the Kali service, and then the Kali CLI will put their artifact in their development Artifactory.

Then the Kali service can reach into the deployment pipeline to update that manifest, which kicks off a deployment to their development environment. While it's doing that, it also reaches over to Veracode to kick off the long-term full official scan. What's nice about that is those scans can take some time, so it's nice to have those kind of outside the primary pipeline. Developers can build and run and get their deployments going to their dev environments, and then Veracode can be running the scans in the background.

Like we said, Kali kicks off those long-running static application security checking scans, and then integration with the deploy pipelines will confirm that the scans are successful.

The deploy integration is that Kali is tracking every version of every artifact. It knows what changed, by whom, when, and all those details, and is able to update that manifest.

The last feature of Kali we'll talk about is Kali promotion. Kali promotes artifacts through the environments upwards from dev to QA to prod, and it allows increasing control requirements for each of those because Kali knows about each of the environments. It knows what are the requirements for each. For example, dev doesn't require any scanning to pass, but QA and production do require the scans to pass, and it requires the scans to have been completed within a certain time frame.

Kali is continually rescanning those artifacts. If there are no changes on it, we can release the same artifact again to production, or there might be no changes but it still then fails the scan because new vulnerabilities are revealed in the open source dependencies over time.

Kali promotion will copy the artifacts between the Artifactory repositories, which we need for retention. When Kali does a promotion, it reaches into the developer's Artifactory repository, picks up those artifacts, and then puts them into the release repository. The reason that we need to do that is for retention, because Artifactory only keeps dev artifacts around for 30 days, release artifacts for more like six months, and then once the artifacts are marked for production, they need to be kept around forever.

Again, here's our diagram of the pipeline framework. There are pieces that the development team owns. We want to give them the most ownership and the most control possible where we can. Then everything else, we try to provide ways for them to interact with it and for them to drive the pieces they are available to, given the requirements and the framework that we have.

In summary, we want devs to own as much as possible so they can move fast and they're not dependent on and blocked on our single team. Everything is an artifact, and so devs build those binaries and send those forward: build once, run everywhere. We use utility pipelines to specifically control what the developers can and can't do. We have automated ticketing that collects the evidence so that everything's in one place for review and for signoff. We built a custom service so that we could do that compliance kind of out of band, and so that we could be sure that we're tracking every artifact the whole way through.

Last of all, we're hiring, so feel free to reach out to me, or connect through careers.truist.com. Thank you very much.