Moving Mountains: Security & Compliance Guardrails in Pipelines

Log in to watch

Virtual US 2022

Download slides

Moving Mountains: Security & Compliance Guardrails in Pipelines

Bobbi Wenzler

Lead Technical Product Manager · Northwestern Mutual

Nicole Schultz

Assistant Director - CICD Engineering · Northwestern Mutual

How can large organizations move the needle on compliance, audit, and security standards in CICD pipelines when you have over 20,000 source-code projects? How can these standards be made highly visible, so developers are aware of them earlier in the software delivery process to ensure code and pipelines are compliant? How can you drive organizational alignment on these compliance standards that will impact every engineering team within the organization? How does an organization's culture adapt to this new way of integrated compliance standards within the software development lifecycle?

Over the past year, Northwestern Mutual has been tackling these problems head-on. We identified existing compliance gaps in our pipelines by deep-diving into data and visualizing the results. We pushed to gain organizational alignment - bringing together leaders across many verticals to take action. We acknowledged the reality that given certain scenarios, engineering teams would need short-term exceptions to certain compliance requirements - and we created a custom solution to address that need. Finally - we visualized all compliance and exception data in Grafana for clear transparency of organizational progress.

Join this session to hear about how Northwestern Mutual dramatically changed the way we're working to automatically integrate compliance requirements into every single CICD pipeline in Technology.

Chapters

Full transcript

The complete talk, organized by section.

Bobbi Wenzler

Hi. Welcome to our talk today. We are so excited to talk to you about one of the big mountains we've had to overcome in our organization.

I'm Bobbi Wenzler. I'm the lead product manager for the CI/CD engineering team at Northwestern Mutual. My role focuses on being out in front of the engineering team on multiple fronts. I build out product roadmaps so we have a clear vision of what we want to deliver, why it helps our company, and how it ties back to our enterprise goals.

The other big area I focus on is ensuring we keep the voice of our customer front and center in everything that we do. For me, this is getting input from engineers who use our products every day. I use data to target the customers that will be heavily impacted by our work all across our organization so I can get feedback on where we want to go. This is so critical so we can understand how our product changes will impact multiple engineering customers.

I'm really lucky to work with Nicole and her awesome team of engineers on the CI/CD engineering team. CI/CD engineering is responsible for creating a positive developer experience and capabilities that enable us to have a sustainable way for developers to create, test, maintain, support, and release all the applications at Northwestern Mutual. We intersect software development and deployments. Our goals are to help technology reduce the time to market while guaranteeing that the best practices for security, availability, and reliability are met.

Now I'll turn it over to Nicole to tell us more about how we work together.

Nicole Schultz

Thanks, Bobbi. And hi, everyone. Thanks for joining our session today.

I'm Nicole Schultz, and I'm the assistant director for the CI/CD engineering team here at Northwestern Mutual. As Bobbi highlighted, she runways for the team the upcoming work and customer asks we're getting. Once she's gathered enough customer data points, workflows, and high-level requirements, that's really where the team and I come in. We dig into the problem and start breaking it apart to determine how we are going to engineer a solution to meet the needs of the enterprise.

Bobbi and I have been working in this product lead and engineering leader duo for just over three years now, and we are really passionate about creating customer-first solutions for the business that make it simple for teams to adopt and accelerate their CI/CD and DevOps practices.

This year, one of those enterprise asks we tackled was figuring out how to build compliance and security guardrails into the pipeline. But before we deep dive into that, I wanted to share some information about Northwestern Mutual for those of you who maybe haven't heard of us.

We are a financial planning company with a vision to free Americans from financial anxiety. We are headquartered in Milwaukee, Wisconsin, and also have offices in Franklin, Wisconsin, and New York.

A few more fun facts about Northwestern Mutual to help you understand the landscape that Bobbi and I are working in: we are currently 97 on the Fortune 500, with annual revenue of $34 billion. We have 4.9 million clients, and year over year, 97% of those clients continue to stay with us for their financial planning needs, which is awesome. We have over 7,600 employees. And finally, Northwestern Mutual is over 165 years old, so we have a really diverse tech stack, which is why we needed to have a centralized way to ensure pipelines are meeting security and compliance guardrails.

So I'll turn it over to Bobbi to talk about how we began our journey up this mountain.

Bobbi Wenzler

Thanks, Nicole. So now that you've learned a little bit about Northwestern Mutual and our team, let's talk about the mountains we had to overcome.

The gist of the problem that, like Nicole mentioned, is we're trying to ensure our source code repositories and pipelines are meeting the necessary compliance and security checks. On paper, I realize that may not sound hard, but we have over 24,000 codebases and growing every day. We also have 15,000 pipelines that run every day. So making changes to this landscape is really not something we can do overnight. It's a really large effort.

The next problem we had to overcome is a lot of our standards are captured in various documentation locations across various departments. So how can we expect new or existing engineers to know what our standards are or when they change?

Reports have also been created for many of these items, but in some cases the data actually lagged, meaning teams have to wait until overnight, or sometimes actually until after they're already in production, and then see their repository on the naughty list. That's not a good feeling.

Speaking of manual work, a lot of overhead went into manually reviewing those reports and following up. In some cases, compliance findings would also be open, which then required teams to manually fill out remediation plans for how they would address those gaps too. So folks, this is our mountain.

Now that we know what our mountain is, how did we start to address these problems?

First, we gathered data. We really wanted to show what gaps existed in our technical landscape. We did this by querying project and pipeline metadata to understand the count of projects that were already compliant and to illustrate we have the ability to source this data from all of our toolchains programmatically.

That got us thinking: if we can source this data programmatically, why not move those compliance checks into the pipelines and run them real time on every pipeline?

So then we decided we needed to socialize this with our executive leadership, such as our CTO, our CSO, and their leadership teams. We wanted to gain alignment that a problem exists and that it needed to be solved. We wanted to ensure we had top-down alignment, since we knew if we wanted to require these compliance and security checks in the pipeline, we would be impacting the developer workflow and how they do their jobs every day.

While we were working on that cross-functional alignment, we also had the CI/CD engineering team working on a proof of concept to see: could we really programmatically source this compliance and security data from projects and pipelines? It turned out we could.

Once we had enough technical certainty that we could programmatically evaluate projects and pipelines for compliance rules, we began working through: how is this going to impact developers?

We knew we really wanted to maintain a positive developer experience. So we knew we needed a rock-solid communication plan with ample runway. To give teams that runway, we started using that data we gathered in step one and began visualizing it. We pulled all that data together and created a dashboard to lay out what projects would have to change to be compliant and who owned those projects.

We socialized this dashboard everywhere and began communicating about all the upcoming changes. The good news is developers did begin burning down the count of non-compliant projects, but we still needed to ensure that we kept communicating so they were aware. To do that, we gave targeted enforcement dates so our customers could do planning. That made sure that they could put this work into their program increments and be ready to go with us.

Now I'll turn it over to Nicole so she can share what our proof of concept turned into.

Nicole Schultz

Thanks, Bobbi. The proof of concept the team worked on was officially named Pipeline Enforcer. At the highest level, Pipeline Enforcer is a microservice that provides guardrails on security, audit, and quality best practices. Pipeline Enforcer does this by essentially analyzing every single project and pipeline real time from our source code management tool to determine if it is compliant or not.

So with Pipeline Enforcer now, we have started defining some of those standards in code, which makes it much easier to maintain, as they are in a central location within the source code management tool. And as Bobbi mentioned, they're not disparate and split across various documentation sources.

Pipeline Enforcer also enables us to shift feedback left in the pipeline. We want teams to get fast feedback as soon as their pipelines run to alert them if any guardrails failed so they can address those failures early in the development lifecycle. Like Bobbi talked about earlier, developers no longer have to wait for a manual report to be done and shared with them weeks or months after they did a deployment.

By moving these checks into the pipeline and providing that fast feedback to developers immediately when their pipelines run, we have really been able to move the needle on resolution of compliance checks in the pipeline, and Bobbi is going to show some metrics later on in our presentation.

So as we worked to create Pipeline Enforcer, we realized there were two things we would need in order to try and provide a positive developer experience as we were looking to enforce compliance guardrails in the pipeline. And those two things were warnings and short-term exceptions.

So we added a flag on each check within Pipeline Enforcer to be able to run it in either a warning mode or an enforcement mode. Every new check we add to Pipeline Enforcer first rolls out in warning mode. So if a pipeline runs and is not passing a compliance check in warning mode, Pipeline Enforcer just adds a comment to the pipeline.

This allows us to give the developer real-time feedback and details on what action they need to take to make their pipeline compliant, and also allows us to runway new checks with ample time. We typically roll out new checks in warning mode for at least 90 days so we can allow development teams to pull in those updates to make their pipelines compliant. And we combine that warning mode feature with data visualizations that Bobbi talked about earlier so we can easily track the progress of adoption of the new check that we want to roll out.

We typically aim to reach 80% of projects being compliant with a check before we move it into enforcement mode. When we flip a check to enforcement mode, Pipeline Enforcer actually cancels the entire pipeline if it is not compliant. The cancellation of a pipeline ensures that code will not be deployed to production without passing all the compliance and security checks.

So with enforcement and actually canceling pipelines on projects and impacting work, we also needed to be realistic. And we knew that there were going to be times when a development team needed to bypass a compliance check for a business-approved reason. The main use case for short-term exceptions is a system outage in production. In that scenario, teams may need to skip certain checks in their pipeline in order to get the system back up and working in production as fast as possible.

For short-term exceptions, we would need a system that would allow teams to search by their project and then select the specific compliance check they needed to bypass. We also needed to have an approval process in place to ensure exceptions had a peer review to meet audit requirements.

Internally, we didn't have an existing tool that met those requirements, so we got to work and built our own exception API and UI application that allows teams to submit and approve short-term exceptions. Short-term exceptions are only ever approved by engineering team members and engineering leaders who own the application code. This allows them to determine if that exception is needed for a business reason and govern any active exceptions on their projects at any given time.

All right. So let me show you how these tools are architected and working.

Here is our architecture for Pipeline Enforcer. We have our source code management tool configured to trigger an event on every commit and every pipeline that gets sent out to Pipeline Enforcer. Pipeline Enforcer is a microservice running in Kubernetes. And when it receives a payload, it goes out to our source code management tool and pulls project and pipeline metadata. It also calls out to our exception API to get a list of any active exceptions for the current project. From there, it runs through all the programmatic compliance checks to determine if they are met or not, and then logs an audit entry into a database that we have.

So let's walk through the anatomy of a check within Pipeline Enforcer. First, we pull our project data. Then we pull a list of applicable checks for the project based on the metadata we have. Not every single compliance check is applicable to every single project. For example, security scanning for containers would only be relevant for projects that are building and deploying containers.

If a check applies to the given project, then we run that check against the project to see if the project passes or fails it. If the project passes that check, that's the happy path and we move on to any remaining applicable checks for that project.

If the project fails the check, we check to see if it has any active approved exceptions. If it does have an approved exception, then we continue on to the next check because we don't take any action on a project that does have a valid exception. If the project, however, does not have a valid exception, we then determine if the check is in warning or enforcement mode. If it's in warning mode, like I mentioned earlier, we post a comment with the information on how the developer can take action to update their project to pass that given guardrail. If the check is in enforcement mode, we cancel their pipeline and again post the same comment back to the developer with the information on how they can update their project to pass the given guardrail.

So let's see some examples of this. This is an example of our warning message. You can see Pipeline Enforcer as an external job to our pipeline, which is in the red square on the screen, and then posts a message here on the screen that details out how the developer can take action and the upcoming date of enforcement for this check.

Here's an example of our enforcement message. You can see again Pipeline Enforcer adds an external job to our pipeline. However, all other jobs in the pipeline are aborted and canceled because this pipeline did not pass the compliance check for static security scanning in the pipeline. You can see at the bottom of the screen here, we post a similar message back to the developer with just an extra icon to grab their attention. And we also link out directly to our exception app if they do need to get an exception. Speaking of which, let's take a look at how teams can submit exceptions.

This is what our exception app looks like. Users can search for their project and then will click Request Exception to show the pop-up we see here on the screen. So on the bottom-left corner, we can see the logged-in user is Bobbi. So Bobbi is submitting an exception today, and we can see in the pop-up here that the exception type selected is merge request approvals, and the reason code selected is the app is being sunset. So Bobbi is submitting an exception for this project saying they need to be allowed to self-approve merge requests because their app is going to be sunset and decommissioned soon anyways. She has the expiration date set for December 11th, and we can also see towards the top of the pop-up that she has selected my name as an approver for this exception.

So when she submits this exception, I am going to get a Slack notification with the exception information and a direct link that takes me to the approval screen. The list of eligible approvers for a project is anyone with admin or elevated access to that project. Since these are short-term exceptions, we really wanted to empower the developers and leaders who own that code to determine and govern their own exceptions.

Now this exception we can see is a little different here. Bobbi is requesting an exception with the reason code selected as recovering from a system outage. So when users select this reason code, we automatically hardcode the expiration date for the exception to two days from the current date, and we actually bypass approvals, since these are considered to be urgent scenarios that require a hotfix to be deployed out as fast as possible. However, users are still required to enter in the high-priority incident number, which we do validate on the back end to ensure there is an active open high-priority incident before we allow them to submit the exception.

So as we were pushing towards MVP release for both of these tools, Pipeline Enforcer and our exception app, we partnered with our security organization to prioritize three categories of guardrails that we wanted to roll out in warning mode and then enforcement mode. We wanted to enforce security scans in the pipeline, code quality scans in the pipeline, and peer reviews. We broke those categories down into specific checks that were put into Pipeline Enforcer with corresponding exception types in our exception app, and then we staggered dates for rollout and we started rolling them out to every pipeline in our environment.

So we made it. We made it to the top of the mountain, and we did successfully roll out MVP for Pipeline Enforcer and our exception app. And since Bobbi had the fun of rolling out all of our customer communications along this journey, I'll let her do the honors of sharing what the view was like post-MVP.

Bobbi Wenzler

Awesome. Thanks, Nicole.

So you can't talk about outcomes without data. So let's walk through what our organization actually achieved. This chart shows over time the percent of source code repositories that have an approval rule set in our source code management tool.

Approval rules are really critical to ensure a review is occurring on merge requests. That way there are separation of duties and engineers get fast feedback on their changes. As you can see here on the left, our original adoption of this setting was 45%. If you go to the right of the graph, you can see now we are at 96%.

I'm guessing you're wondering, why aren't we at 100%? Well, we do have some projects that teams may not work on actively at this time that they didn't archive. The good news is if they pick them back up and start making commits, Pipeline Enforcer is going to catch them.

Now I bet you're wondering when did warnings and enforcements begin? So I'll add some markers here. You can see it's the value that our tools bring here by the largest portion of our adoption happening right after we did those warnings and enforcements in the pipeline itself.

Let's move on to our next case study. This visual is showing us the percentage of pipelines with static application security scanning that actually have their jobs stop when there are vulnerabilities in the code. This is critical so that teams are reviewing and correcting new vulnerabilities in their code to keep our company secure. Our journey for this one shows on the left: we started at 21%. As you can see on the right, we are now up to 90%.

Now, let's take a look where the warnings and enforcements occurred again. You can see again our largest jump in adoption happened once again after warnings were added to our pipelines.

The next case study here is pipelines recording code quality scanning. We really want to ensure that teams are doing scanning in their pipelines to find out about coding issues, code that's really hard to maintain, or code that hasn't been well unit tested ahead of it going to production. In this instance, as you can see on the left, we started at 38%, and if you look at the right, we are now up to 84%.

Now, let's take a look at our warnings and enforcements again. So while in this instance, this might not have been our largest period of growth, we still picked up 16% adoption during this warning phase. The one thing I really want to highlight, though, is look at after that red indicator where enforcement began. You can see we continue to get adoption. That's really important because no additional outreach is happening. This is all happening because of the tools that our team built.

And last but not least, our very first standard we enforced: this one ensured that for pipelines that build containers, they have container scanning in their pipelines. I want to point out, while we do have runtime scanning, it's really important that teams are getting feedback in their pipeline when they're making changes prior to it being deployed to any environment.

This graph shows the starting point for us on this one on the left at 62%, and we are now up on the right at 98%. Let's see where those enforcements and warnings began again.

This one is a little bit different again because this was the first one we rolled out, so there was a much longer runway for teams to adopt. But why I wanted to include this graph still is I wanted to show if you look to the right, we are still maintaining our numbers at this point because all new and existing projects now have to have container scanning. It's no longer optional or something you can forget.

So now after all this great data, I want to talk about what's on the horizon for us. So we've moved that mountain. What's next? More mountains.

So first, we want to really create standardized and opinionated pipelines to streamline adoption and streamline staying compliant in our pipelines. Right now, there are so many patterns and tools in use at Northwestern Mutual. This is going to take time for us to build out those standardized pipelines. But the goal with this work is to make it really easy for engineers to create new projects that are compliant and using our predefined pipelines.

The next item we want to focus on is automating Day 2 operations. This will make it easier for engineering teams to adopt new compliance rules or updates in their pipelines. This will really allow engineering teams to focus more time on delivering features for our clients instead of maintaining pipelines.

Lastly, we want to provide a holistic view of compliance findings at an application level. Right now teams are bouncing between so many tools in the toolchain. It's really hard to tell what's accounted for and what's not. We want to make it easy to see the end-to-end toolchain and what it uncovered and what actions teams took, all in one place.

So I do want to give one last disclaimer: while Nicole and I are great at moving mountains at work, we have not actually climbed any mountains in real life. And I want to say thank you for joining our talk, and we really look forward to answering any questions that you have.