Creating A Secure Software Supply Chain In A Large Engineering Organization

Log in to watch

Europe 2022

Download slides

Creating A Secure Software Supply Chain In A Large Engineering Organization

Rosalind Radcliffe

IBM Distinguished Engineer, DevSecOPS CTO for the CIO · IBM

Thomas Lawless

IBM STSM, CIO Developer Experience · IBM

The requirements of a secure software supply chain which is compliant with industry best practices, compliance / regulatory standards and specific customer requirements is becoming more and more difficult for individual product teams to create, enhance and maintain.

Uniqueness in how each product's CI / CD automation is created in a large engineering organization lends itself to countless hours of duplicate effort as each product team enhances their CI / CD pipeline to meet the next standard. This is the problem the IBM CIO organization is solving by providing a common CI / CD platform focused on the developer experience.

During this session we will share our experiences to date by outlining the challenges we face, the progress we've made and our plans for the future.

Chapters

Full transcript

The complete talk, organized by section.

Rosalind Radcliffe

Hello, my name is Rosalind Radcliffe, and welcome to this latest session as part of DevOps Enterprise Summit. I'm very happy to be with you here again at this event this year.

You may know me from before as Mainframe DevOps, but here, this time, I'm going to introduce myself as IBM Fellow and responsible for DevSecOps within the IBM CIO organization. And I'll fill you in a little bit more about that later. Tom, why don't you introduce yourself?

Thomas Lawless

Thanks, Rosalind. I'm also very excited to be here as well, and very excited to be presenting with you. My name is Thomas Lawless. I'm a senior technical staff member in the CIO, and I work really closely with Rosalind on our software supply chain initiative.

Rosalind Radcliffe

The IBM CIO organization runs the IT, which runs IBM. It includes HR, sales, supply chain, all of the systems that you would expect to run a large business. And IBM is a very large business. If we think about IBM, we have 300 and some thousand people around the world, and all of that infrastructure is supported by our CIO office. We have approximately, within the CIO office, 6,000 developers, 6,000 applications, 70,000 GitHub repositories. So we have a very large organization responsible for providing very critical applications, financially significant applications, and very important from the standpoint of personnel and running the systems for IBM.

This organization has come together to bring together all of the support services across a number of different parts of IBM to centralize that capability so that we can provide services to IBM in a better way. The intention in my coming over to the CIO organization is to help ensure we become the client zero for IBM and the hybrid cloud showcase to demonstrate to our clients best practices in the industry. And as we go through this session, we're going to talk about something we've learned that we're working on that we hope will also spread to the rest of the industry to help the developer experience overall for everyone. Tom?

Thomas Lawless

Thank you, Rosalind. So we are going to talk a bit about our initiative around centralizing our CI/CD execution and really trying to drive a secure software supply chain, a common software supply chain across our entire organization.

As Rosalind called out, we have a very diverse set of technology that we deal with. We have everything from mainframe applications that have quite literally been running for decades to containers running on OpenShift on Linux and in IBM Cloud. And as the industry changes and cybersecurity threats become much more intelligent and pervasive, we continue to work with our teams to make sure they understand the responsibility that they have in terms of the way that they build and deploy their code.

Some of our applications have regulatory requirements, some of them don't, but we all need to be responsible for certain things in our pipeline. And it winds up being very time-consuming and costly for an organization of our size to ensure that all of our teams are doing what they should be doing in this area, given a distributed CI/CD system.

Historically, our teams haven't had a lot of options. They could use Jenkins. They could use other open source build tools. But we're seeing that the amount of redundancy and the amount of wasted work that is done across all of those teams adds up quite quickly. So we start to look at all of the requirements that our CI/CD pipelines have to ensure a secure software supply chain, and we start to multiply that by the number of repositories, the source code repositories that we have. Even just a small reduction in the amount of hours it takes to manage each repository in this way over the course of the year for us adds up to tens of thousands of hours.

So in addition, of course, late in 2021, we also are hit by the Log4j vulnerability, which I'm sure a lot of the people listening to this recording were also exposed to. And while our teams handled it really pretty well, what we found, though, is we found issues in our ability to have visibility into our source code repositories, right? Our exercise here wound up being more of an investigation than a reporting exercise, and we need to get to a point where it's more of a reporting exercise.

Rosalind Radcliffe

If you think about this and you think about scale and size of our organization, and the fact that we have so many different applications, you've got to understand we don't necessarily, because all the teams work independently, the teams get to have their own CI/CD pipeline. They get to do development in their own organizations, own part of the organization. The question is, what are they running? Does Log4j affect them or not? So we have a lot of applications that we couldn't centrally understand what was the status.

And so everyone, whether or not they were COBOL or they were not, they were some other language that wouldn't have Log4j, they had to say whether or not it affected them. So this is a challenge across a large organization, and it isn't necessary. There was work done that shouldn't have had to have been done if we had a better central way of having the data managed so we could pull it together and easily see these are the applications that might be affected for Log4j, and they could then report how they were doing on the Log4j remediation.

Thomas Lawless

So how do we plan to solve this problem or improve the environment for our developers? Well, we talk a lot about our software supply chain initiative, and really, we want to get to a place where our developers can just really focus on creating high-quality, secure application code. They don't have to reinvent the wheel with CI/CD, and we can automate as much of this process as possible.

And the way that we're currently thinking about this, we're thinking about this in terms of three major components. The first is the automation catalog. The automation catalog, its true intention there is to, one, drive a culture of contribution in our engineers to take the work that they're doing. They're already doing this work in their own CI/CD system. We have subject matter experts all across our organization that are experts in mainframe, they're experts in containers, they're experts in infrastructure, they're experts in SaaS tools. We have a ton of expertise in our organization. The intention of the automation catalog is to really harness their power, codify their knowledge, so that we can bring it out of their teams, out of their silos, and make it available to the whole organization.

The automation catalog will then be consumed by what we're calling this pipeline execution management component. And what we want to do there is we want to abstract away what is actually executing the pipeline. We want to decouple our repositories from our build system so that we can change the build system over time, or we can use a different build system based on the platform that we're targeting. And this will help us weather the storm of changes. Everything changes in our industry on a fairly regular basis. If we need to go back and touch our repositories, our tens of thousands of repositories, to make even a simple CI/CD change, that is thousands of hours that we're spending doing that.

Then the third component is what we're calling the developer data lake. So as our pipelines execute, we are collecting a tremendous amount of data, metrics data, evidence of our software supply chain security. And right now, our developers, they go to all these different tools. They go to their tool to look at code quality. They go to their tool to look at open source vulnerabilities. They go to a different tool to maybe look at the results of their CI process. The point of the developer data lake, the goal of the developer data lake is to start to bring all of that data into one place, so that we can provide a developer-centric experience on top of that data to improve their productivity.

But we can also look at automating some of our other compliance and audit readiness activities that we currently do from team to team, and we currently do sometimes manually, right? So the developer data lake, we're positioning to be able to automate that work in the future.

So to dive into the automation catalog a little bit more, as I said, we really want to create a culture of contribution around here. We want to leverage the expertise of our subject matter experts. And we're looking at the automation catalog as four major sub-components. The first is classification of the automation that we have, that we want to build, and we want to have teams contribute to us. We want to be able to understand what the purpose of this automation is, what platform it targets. Is it a script that runs on a mainframe? Is it container-based, is it a container image with automation in it? And we want to be able to classify that so we can then bring that together into pipeline stages and end-to-end pipelines in a more dynamic way than just hand-coding those pipelines with Jenkins or with Tekton.

In order to do that, we need to be able to support the publication of this automation. And the catalog may not actually store the automation, but it'll store the description of the automation so we know what it is, we know what it does, we know where the artifact is. And this will also give us the ability to start treating our automation like tiny little applications, right? That we need to be aware of what vulnerabilities they have in them, what their quality is, because as we move this automation out of the team level and move it to the organization level, a piece of bad automation has a much larger impact than it would at the team level.

So once we are able to enable our subject matter experts to publish their automation, the next step is being able to drive discovery of that automation. We need to make sure our developers know where to go, and they're able to easily find the automation that they want, and then we can assist them in configuring it for their repository.

And as I said before, the goal here is not to tie the repository directly to a pipeline execution service. The goal here is to provide an abstraction, a layer of abstraction between what we define automation to be for our organization and how we run it are two different things. And that's where the transformation piece comes in. Our vision here is to be able to say our catalog has a description of automation that we control, that we define. But we should be able to transform that description into any declarative build system. So we should be able to transform that into a Tekton pipeline. We should be able to transform that into a Jenkins pipeline. We should be able to transform that into anything that we can create a declarative pipeline for.

Which brings us to our pipeline execution management component. And this component, we look at this as kind of a back-end component. The developers won't have direct exposure to this, per se. This is the glue. This is the glue that binds the catalog to the data lake, and this is what hides what is the engine that's actually executing the automation in the background.

Two major components here that we're thinking of. The first is integrations. Integrations would enable triggers, for example. So the most obvious one would be a trigger with our source code repository. So when a change is made to code, we can trigger a pipeline, right? Bare minimum there. But we have the ability, and we want to, again, make sure that our teams have the ability to contribute and enhance the platform for their needs. And platform integrations is a mechanism to do that, right? And we're architecting it in a way that it's open and contributable so they can add other integrations that meet their needs.

Then the second piece of this is pipeline execution orchestration. And this is what we wrap our execution engines with so we can come up with a common mechanism of how to execute a pipeline. Our intention here is to use Tekton as the primary workflow engine, and then in areas, in pipeline segments, pipeline stages that need to run on non-containerized environments, extend Tekton to execute that automation on a native platform. Maybe that native platform is z/OS. Maybe that native platform is macOS. It could be anything that doesn't run OpenShift today. And we can control the end-to-end execution of the pipeline, but we can run the individual stages and tasks of the pipeline where they need to run.

And then finally, the third component, as we start to collect all of this data from pipeline execution, we have the ability to provide much deeper insight to our developers, to our engineering management, our engineering leadership, our executives on their applications, right? And we're really focused on the source code through deployment. And we're looking at four sub-components of the data lake. So the first one is data aggregation. We need to be able to go get this data and bring it all into one place. And that may be retrieving code quality from a tool like SonarQube. It may be retrieving vulnerability data from a tool that gives us feedback on open source libraries or container images. It may be things that are generated uniquely during pipeline execution, like unit test case results, not the coverage necessarily, but the actual results, or test cases that target a specific type of compliance that we need to record so that we are ready for our audit process. And the data aggregation component is positioned to be flexible to accept this data, to go get this data, so that we can get all of our data into one place, where then we can do analytics and reporting on it.

And analytics and reporting is meant to provide a mechanism of exploring this data for the developer, but also for the non-developer role. If we have a cybersecurity expert that wants to go determine what repositories are using Log4j of a certain version, a vulnerable version, we should be able to do that from here because we have the complete inventory. If a compliance expert wants to come in and see the results of our compliance testing from a specific application version, they should be able to come in and do that. For the developer, we want to create a specific web-centric experience for the developer.

Right now, like I said before, our developers currently go to all of these different tools to view the health information about their application. Not the operational health, but how well their repository is configured, what their code quality looks like, their open source vulnerabilities and inventory licensing information. We want to bring all of that information into one place, and we want to provide them a tailored experience so that they can get that data from this single location.

And we've talked to our developers about this because we want to make sure we're giving them what they want and what they need. And they are asking for this. They're asking for our help in this area because they are feeling overwhelmed with all of the work that they're being asked to do beyond their day job of coding.

Finally, with all of the data that we have, we want to get to a place where we can use pipeline gates to enable organizational and team-level policies. So a team-level policy, the best example there is we don't merge code that doesn't have 80% test case coverage. So that one is the one that probably most people will relate to. From an organization point of view, maybe we create a policy more along the lines of, you added a new dependency to your application. This is the first time we're building it. That new dependency has a known vulnerability in it that you didn't catch during your process. This is the guardrail. This is the guardrail that says that breaks our organizational policy. We don't allow known vulnerabilities to go in production on a first push, right? If it's discovered in production, that's different. Then that needs to be fixed, and we have policies to do that. But we can put guardrails in place to help teams stay secure and not miss things along the way. We can automate a lot of that.

So finally, thank you. And Rosalind, I'll turn it back over to you.

Rosalind Radcliffe

So hopefully this has been helpful. Talk a little bit about what we're doing. We're really trying to make the developer experience better in a very large organization in which we have the variety of application types. As you all know me, or if you've seen me before, I'm definitely normally talking about mainframe pipelines. And so this is a larger role, a broader role, and when we think about this experience across an entire organization distributed in Z, we're trying to make it simpler and easier for the developers to not do those mundane tasks.

If we think about DevOps, we're talking about automation, making it easier, taking those mundane tasks away. But then we've made all of the teams do this work, and we want to reduce that load on all of the teams.

What we want help with, we want to understand what you're doing. How are you scaling? How are you providing that visibility to security issues? How are you dealing with those kinds of things, and what is your experience when it comes to centralized management or standardization of CI/CD within your large organizations? We're starting down this journey with just the CIO office, but logically, we've got a whole bunch of developers across all of IBM. So we want to understand what others are doing. What challenges have you faced, and what have you learned? What hasn't worked and what has worked? Because we'd love to hear from you as part of the breakouts or on Slack. Just reach out to us. Thank you very much, and have a great rest of your conference.