DevSecOps - The Broken or Blurred Lines of Defense

Log in to watch

US 2021

DevSecOps - The Broken or Blurred Lines of Defense

John Willis

Senior Director, Global Transformation Office · Red Hat

A classic model for risk management and control is something called “The Three Lines of Defense (3ODL).”

The three lines are as follows:

Line 1: Risk Owners - Front line staff and operational management

Line 2: Risk Oversight - Risk management and compliance functions

Line 3: Risk Assurance - Internal audit

However, with the advent of modern sociotechnical systems like Agile, Cloud Native, and Event-Driven architectures these legacy lines (3ODL) are at best blurred and at worst completely broken. With the modern patterns and practices of DevOps and DevSecOps it’s not clear who the front line owners are anymore. Risk management and organizational compliance teams struggle to adapt to new cloud-native models such as ephemeral containers, functions, microservices, and event-driven architectures. Most organizations' internal audit processes today are highly toil based and have very low efficacy. This is something I have called in previous presentations “Security and Compliance Theater.”

In this presentation, we are going to look at a couple of case studies that include the good, the bad, and the ugly when it comes to 3ODL. Primary topics covered will be organizational design, DevSecOps, and Automated Governance.

Chapters

Full transcript

The complete talk, organized by section.

John Willis

Hello everybody. I'm John Willis. I'm a Senior Director of the Global Transformation Office at Red Hat. This presentation is called Security Differently.

I have to put my shameless plug in. I started a podcast this year. I've really had a passion for Dr. Deming for years, so I've been doing these amazing podcasts for me, where I talk to people in our community and outside our community. I just did one with somebody on the history of autonomous vehicles that goes all the way back to General Motors and motorsports. It's really cool. It will be up in a couple of weeks. Anyway, check it out.

So, Security Differently. Basically, the topic of this presentation: I started thinking about, obviously, I was involved in the early days of the DevOps movement, and then, give or take five years, probably a little longer than that, the DevSecOps thing started populating. I was very early on that. I helped Sonatype build the first DevSecOps Days, modeled after DevOpsDays, and so I got thrown in the lake early on.

But there was a point, probably sometime last year, where I started thinking: if we just talk about DevSecOps, the horizontal breadth of security is pretty long. Are we accomplishing a whole lot when we broad-stroke talk about that? For me personally, I felt like I wasn't really being intellectually honest with myself.

So I started thinking about this very meta question I pose to myself and now I'm posing it to you: what would DevSecOps look like if DevOps never existed? Let me be clear: DevOps has been phenomenal. With DevSecOps, we accomplished quite a bit, automating a lot of the security things in the pipelines. But I'm wondering if, even with all that great advancement and opportunity, it is a scenario where we've been trying to force a square peg into a round hole.

What if, for some strange historical-fiction version, the innovation that DevOps was started as a security innovation? Would we have made all the same choices, as good as the choices were?

The example that probably got me thinking about it the most was that I would go visit companies, and it would start with the premise of, "John, do you want to take a look at our DevSecOps reference architecture?" We'd look at it and you would see really good stuff. You would see they had automated things like SAST and DAST, or they had software composition analysis built into the pipeline. Really, really good stuff.

Then in a deeper conversation we would talk about audit, and I would find, as an outsider looking in, a disconnect. They would say, proudly, "Yeah, we're using the three lines of defense model." For me, as an outsider looking in, I'd say, "Well, isn't that sort of three silos?" In fact, the virtual firewalls have been designed to break out the silos. The third line never talks to the first line. The second line is a translator, which, by the way, in most organizations does a terrible job.

Why isn't the third line showing up in design and requirements? It also has this pattern of, if the first line doesn't catch it, the second line catches it. It immediately reminded me of the original caricature of DevOps, which Andrew Clay Shafer, who I work with now, did in 2009 at the O'Reilly Conference. That was the same conference where John Allspaw did the landmark "10 deploys a day at Flickr" talk. That same day, Andrew gave a presentation called Agile Infrastructure. Andrew's wasn't recorded, so it didn't become a manifesto. Both were great presentations, but in that presentation Andrew was the first one to create this observation of the wall of confusion, which was a beautiful caricature of the problem.

The developer says, "I want to change this." Operations says, "No, we can't do anything." They're throwing it back and forth. If I think about three lines of defense, I also think about Conway's Law, not from the perspective of monoliths to microservices, but really organizational design. Melvin Conway's adage states that organizational systems mirror their communication structures. These are not just technical, how-you-program ideas. I think that was probably Melvin Conway's original intent, but it expands into organizational design.

So then I look at three lines of defense, and this isn't a popular question to ask in a large bank. I feel like I'm going to get chased out with pitchforks, swords, sticks, or whatever. But have we just created two walls of confusion by design? All the things we've accomplished in DevOps, and I would even say DevSecOps, if you're adhering to this model -- and now there's a two-line version from the Institute of Internal Auditors -- the point is, it is by design siloed. That's what we solved over a ten-year period with DevOps, and isn't this our square peg in a round hole?

Over the years, after I worked for Docker, I left and put my own shingle out, where I started doing this thing I call qualitative data analysis for transformation. That's fancy nonsense, but the point is the qualitative data analysis wasn't nonsense. Part of that was a classification of different patterns, and I came up with seven that I call the Seven Deadly Sins because it sounds cool. Ultimately, these were the patterns you saw everywhere. I've done full presentations on them. But the unintended thing I didn't realize at the time was that they all magically fell to this seventh deadly sin, which was security and compliance theater.

What I found was that even in the most advanced DevOps and DevSecOps organizations, I would sit in front of a CIO -- and that's who I did the final report to. In some companies, I interviewed 400 people over a whole summer. You have to tell them, and this is a dangerous proposition to sit six feet away from a CIO with the fancy wooden desk and say, "By the way, your security audit posture is just theater."

You break it down into the cloud-native world, and there is just no connection between what the auditors are asking you to do and what you are saving as evidence related to that. I'll break that down a little bit further.

If we look at how we do things today and how we probably should do, i.e., security differently, there are really three questions. How do we prove that we're safe? How do we demonstrate that we're safe? They are not the same thing. And how do we do both? We have to do both. We can't just do one.

I use the phrase post-cloud-native modernization, or post-cloud-native world, and I keep saying it until somebody punches me in the nose for saying it. I haven't been punched in the nose virtually or physically yet, so I'm going to keep using it. To me, it's a line in the sand: the world has changed.

In a pre-cloud-native world, how do we prove we're safe? In general, we use ServiceNow. We use change records. Bob tells Sue, who tells Sally, who tells Sam, and Sam says, "You can't go into production unless you do this." We create this telephone game of abstract, subjective knowledge related to incredibly complex operating environments and systems.

And then how do we demonstrate? We audit. That's the 30 or 40 days of clear toil and email exchanges and screen prints and long discussions about the third line not understanding the first line, or the second line not being able to explain the difference between a CMDB and an access log and why they appear to do the same things but really have different intentions.

What I think Security Differently is about is: how do you move from implicit security models to explicit proof models? More specifically, how do we change subjective to objective? How do we change people creating narrative about the change and the success of the change -- maybe having links to some logs, or maybe having a checkbox saying, "We did these six things" -- from a completely disconnected model, which creates most of the low efficacy and high toil of most audits, to a model where it's objective, digitally signed evidence generated by a machine. There's no human involved in that evidence.

By the way, that by itself is not good enough, even though that sounds like fiction. I do know a couple of banks that are doing this right now. That's not good enough because we know that complexity demands drift, and so we have to constantly verify that the things we said in the attestational data, or the configurations that drove the attestational data, are actually accurate. Is it accurate one hour, one day, one week, one month?

There's a chaos injection in a security mindset that is more than just chaos engineering, not just spiking CPU. It is about: if a configuration manifest or configuration document that gets through the pipeline should never have this port open, then let's open up the port in live production. That kind of thing.

As I said earlier, when I was thinking about how to get my head around meaningful knowledge -- I consider myself, more than anything else, a student of this industry. If I'm just going to talk about post-cloud-native things from a security perspective, DevSecOps, I don't want to talk about the horizontal everything in security. In conversations and working groups, it boiled down to three things. I'm not saying everything else isn't important. I'm saying I can't think of anything more important than these three things right now.

We took risk, defense, and trust. If I use the grid of subjective to objective/verifiable, risk today in most organizations is subjective. The root evidence of our audit is the change record. It's subjective. How do we get that to an automation engine? We've been calling it DevOps automated governance: a system that creates digitally signed evidence -- think blockchain, don't use blockchain -- of attestations and controls, and then uses continuous verification or security chaos to make sure that even though we know theoretically nothing should have gotten in the system looking this way or having these particular artifacts or configurations, let's make sure that they're not.

Then defense: we move from the classic detect and respond, or even in some cases sophisticated remediation systems, to everyone trying to build a cyber data lake. That's great, but I'll tell you later about a project I'm working on where we're calling it the decorator model. Just throwing everything from a security event or security orchestration into Elasticsearch or some cyber data lake still leaves a lot of work to do if you haven't worked on the front end. It's a shift left of the context and the metadata.

One of my favorite mentors, a brilliant woman who has spoken to us many times, Shannon Lietz, talks about adversary analysis. That's beyond objective, correlated cyber data lake data, which I think is important table stakes in a post-cloud-native world. She talks about adversary retention rate and analysis of adversaries. This is the verification: I can say I'm doing all this great stuff from a defense perspective, but the true test is how do I know what adversaries are coming, when they're coming, how long they're staying? Are there things I'm doing that deter them from coming and staying?

Last but not least, I probably won't have too much time to talk about this, but I want to express my ideas about the three primitives. Trust: if you think about what software-defined networking did for networking, at the highest level it changed the mindset from north-south network transactions to east-west. The world had changed where you had multiple cross-rack communication. The concept of thinking about traffic as east-west was a lot of what created the SDN conversation.

I think we should be beginning a conversation about thinking about trust the same way: moving from north-south trust -- you are authenticated and you're in -- to ephemeral, cluster-based, selective node-to-node constructs where, when you're in that cluster or pod, you're automatically authenticated. There are technologies that can do that very well. That includes secrets management. I think Vault is a great product, and many would argue, but to me it is still north-south-ish. It's the best game, maybe the only game, in town right now, but imagine secrets management falling into that east-west pattern very much like authentication.

So three novel ideas: DevSec automated governance, pragmatic cyber data lakes -- a project I'm doing with ONUG -- and distributed trust models for identity and secrets.

Risk: what are we trying to do? We're trying to reduce audit toil and increase audit efficacy. It is not good enough to say we're doing all these things in DevSecOps reference architectures. The real question is: what is the efficacy of your audit? And the honest efficacy, not the sort of thing where I go into a CIO's office and tell them the reality of what's going on and they get furious about audit. They'll accept all the other Seven Deadly Sins, but on audit they're like, "John, I'm going to have to disagree with you here. We've won awards. Ernst & Young, or one of the big three or five, said we are the best in finance in our audit."

Sure. But I just talked to a team that is doing about 300 deploys a day, all on Amazon to DynamoDB, using ephemeral pipeline-as-code infrastructure. I will tell you right now, there is zero evidence that relates to your audit. That's not even getting into Kubernetes, containers, OpenShift, serverless, and functions. The more technology we expand on, the more of these long-winded conversations we have at audit time.

A common adage that I hear, and use back to the CIO, is: "We don't tell auditors things they don't already know." In other words, we don't care about protecting the brand. If protecting the brand isn't the most important thing at a Fortune-level company, I don't know what is. Yet most organizations flippantly say, "We'll tell them about Kubernetes next year."

Efficacy and control: there have been a number of publications over the years from IT Revolution. In 2015 there was a paper, An Unlikely Union: DevOps and Audit, saying how DevOps could actually be the solution for segregation of duties or separation of duties. In 2018 there was a funny, tongue-in-cheek apology letter to auditors. These are all Creative Commons on the IT Revolution website. It wasn't just the two-page apology letter; it was another 30 pages of control regs and promises that we're going to do this, this, and this.

In 2019, I worked with PNC, Capital One, Marriott, Sabre Group, and we tried to put this objective, referenceable architecture on paper. I'm very proud of the work we did there. You can download it. It is the foundation for everything I've been working on in the last two years with a couple large companies. You look at it as seven stages, and what we're trying to do is create boundary points for gating control and attestational data evidence.

The original 2019 publication has probably 75 attestations, well documented: where it works, where it might come from, and why it's important related to some control reg. Some things people are doing with this model include percentage test coverage, change size, cyclomatic complexity, whether there was a review on the pull request, optimum branching strategy. These are things that should be gated and also created into an immutable evidence list for that deployment.

Imagine zero days in audit because it's just an immutable digital signature, or represents all the digital signatures of the things that happened in that deployment, by the computer, not by humans. It can't be changed. Again, think blockchain, not use blockchain. Imagine the evidence is a digitally signed event or signature and there is no debate. It is there, unless you dispute the engine that used it.

In the early days of Chef, I was there really early, before beta. We had alpha and no enterprises. A large financial institution visited us, and we were like, "Oh my goodness. Why?" We asked why they were there. It was one of the top investment firms in the world. They said they felt Chef would be useful for certain things like PCI DSS, because if the assumption was that the engine was certifiable, and that this was an automated engine building the thing, it would save a lot of audit time over looking at instructions or notes for how people built servers. They would accept that the Chef server is secure and that what it builds is the evidence itself. It elevated everything. This is an expansion on that.

In the build stage, you might have linting, SAST, and other things. At the package stage, it gets more interesting: artifact versioning, package metadata, code signing. A reasonably mature implementation of DevOps automated governance would have code signing as table stakes. Container image scanning, pre-prod controls, prod controls.

Over time, there was a maturity from the first paper around how to get the third line and first line to work together. We have to give them a collaborative way to create an artifact. It's not good enough to just send risk to design and requirements, because you still have communication toil. They don't speak the same languages, and even with the second line, there may be a lot of passive agreement to not disagree. But where you elevate the conversation is where you have to agree on a construct or code.

In the early days, people were like, "There's no way risk is going to work with YAML." It's not true. YAML is universal for anybody, honestly. It seems complicated. You can have that artifact be a collaboration between risk -- how this is going to show evidence to audit -- and the engineering team, created together. Then there is very little confusion about why this evidence is proof as opposed to a screen print or an email conversation.

At the beginning of this year, we decided to do a second version of the 2019 automated governance paper. We were going to put it out as a paper, but we felt the original paper was very technical and probably a lot of people after the third page were like, "I don't have time for this." So we decided to do it as a fictional story about a bank that fails an audit and then goes through the process of implementing DevOps automated governance. It was so well received by IT Revolution and crew that they started to create a full book out of it. Sometime next year, another book based on that GRC Phoenix Project style, with Eric and Bill, Alex and Jonah from The Goal. It will be fun. There are eight of us on it now, eight authors.

I did some work where I was invited into SolarWinds to talk to executives about automated governance. Someone from a large consulting organization knew I was an expert on this subject, so I went in and talked to the executives about why I thought automated governance would help SolarWinds. I took all the public information about that attack or kill chain -- CrowdStrike was by far the best -- and used that to present: these are the things CrowdStrike publicly addressed that happened.

If you don't know, they were able to sit in there for almost 18 months, sit on MSBuild, eventually hijack it, and then create their Sunburst code that went out to everybody, including probably you who are listening. What I was trying to say is there would have been a lot of red flags if you had automated governance. First off, a model, something we've written at Red Hat called Plygos, but there is in-toto and a couple of others, where you can use pipeline as code. We talk about infrastructure and pipeline as code. Let's use ephemeral. It's changing. It's repaving constantly. If adversaries are sitting there trying to understand your Jenkins files and they're static for six or eight months or a year, it's a whole other game when there is an engine that uses abstractions to create this stuff. Most likely, the adversary is just going to move on to somebody else.

There was a lot of masquerading, like an immutable attestation store. Red Hat uses something called Sigstore, which creates a Merkle tree. It's immutable infrastructure, so it cannot be changed. But the biggest one that was glaring to me in the CrowdStrike analysis -- and I used the MITRE ATT&CK framework as the structure for this -- was code signing. There were mismatched code signings all over the logs. That would have been table stakes for a base implementation of automated governance.

I talked about some of the tools, and this is a quick presentation. If you want to learn more, absolutely ping me at Botchagalupe or jwills@redhat.com. We've done some internal work. Bill Bensing, who is one of the authors of the new paper, wrote an infrastructure system for the government called DeadSWORd, and ultimately this turned into something called Plygos. I could do a whole presentation about how it fits really well. It's a great interface definition for how you want to build your pipelines, but it doesn't have an opinion on the implementation, so it's the perfect tool for automated governance.

Sigstore is another one I'm spending a lot of time thinking about. This isn't a product presentation, but Red Hat does have strength in OpenShift. Our Kubernetes version has a compliance operator, advanced cluster management, and advanced cluster security, which is basically StackRox, like a Twistlock or CoreSec solution.

Just winding down: that's a quick view of risk differently and the Security Differently story. Defense differently: it all comes about reducing toil and increasing efficacy. All the stuff we're doing today to react to our SIEMs and SOARs -- what if we took a mindset of a shift left on our cyber data lake constructs?

I've been working with a group called ONUG, one of the largest networking user groups. They were involved in SD-WAN and SDN. It's based primarily out of New York, but some of the biggest large-cap banks, financial institutions, and healthcare companies are involved, including on the board of directors. I was invited in to look at DevOps automated governance from a cloud perspective. The first version is Creative Commons. In the second version, we focused on a shift left on cyber data lakes.

It was two problems. One was the original problem: we interviewed about 30 large caps and tried to figure out what cloud security operations people, especially dealing with multi-cloud, were facing. The common thing we heard was that it is so hard to understand the common context of an event that comes from Amazon versus Google versus Azure that means the same thing, but is classified differently, even down to the text of the event. There is a lot of toil in figuring out that two things are the same thing.

The bigger problem we discovered was normalization of metadata. When we interviewed these companies, it was almost comical how they got their metadata and where they combined it: Prisma, F5, spreadsheets, and so on. Even then it was difficult. In a multi-cloud environment, if you look at probably five of the most well-known metadata fields from a cloud -- account, resource, event name, event type -- Azure, Amazon, Google, IBM, and Oracle all use different tags.

If you dump all that all the way to the right into a cyber data lake and don't work on context normalization and metadata normalization, advanced data lake tools can do the correlation, but your percentage of accuracy will be lower and the amount of time to create will be higher. Why don't we work on context normalization earlier? We created this idea called a decorator.

Check out ONUG and look at this automated cloud project. It's really cool and it is all going to be open sourced. The idea is to create these decorators. This year we've had Microsoft, Google, IBM, and Oracle working on this project. We're still trying to recruit Amazon. Some large companies, Cigna and some large-cap banks, are also involved. One of the things we're producing soon is an open source repository with an SDK to start the conversation. It's very exciting stuff.

Finally, trust differently. If we look at that third primitive, certainly zero trust architecture is table stakes. Modern least privilege, absolutely. We need to think about secrets management. I think Vault today is an amazing almost must-have product. But I'm always looking at the future: what are the problems we have to solve?

When I interviewed at Red Hat, Jim Whitehurst asked me, and I asked him, "Why do you want to hire me?" He said, "You, Kevin, Andrew, and the gang have been involved deeply in the conversation of what created the first ten years of DevOps. John, we point to your books. But we want you here for the next ten years and to help Red Hat define it." That was the sell. Great leaders are great sellers. I didn't see myself working for a big company, even with Andrew, Kevin, and Jay. But at the end of the day, I thought, "Okay, you got me. Yes, I'm in."

I used to joke that in 2025, if the hot topic is still GitOps, then we failed miserably. I'm not saying I'm special; I just hate the status quo. I've always had it in everything I do.

This East-West trust idea comes from the fact that most of our workloads are moving into clusters, pods, and ephemeral, more atomic compute structures: container images running as containers, functions, and event-driven architectures. I think we see a lot of that bounded by the service mesh structure. Today most of it is Istio and Envoy, but those are not the only tools.

In that world, there are interesting conversations about what I call East-West trust models. There is node to node. A cluster of nodes might represent a particular trust domain. Because you have things like Istio or service mesh traffic management, there is clever stuff we should probably take advantage of.

In the end, trust comes back to the age-old adage from DevOps: can you throw the server out the seventh-floor window and still operate? There is nothing new about repave and rotate, but it is an important discussion in security. If adversaries have 18 months, or a couple of years, or in the Marriott scenario maybe four years, to sit and learn, they're going to get you. But if we're dynamically throwing them off guard by repaving and rotating, that changes the equation.

Zero trust architectures matter. Emerging tech matters. I'm a big fan of SPIFFE/SPIRE. Andreas Weger, who is one of the committers on our project, is another author of Investments Unlimited. I think this is promising ground for what I call East-West trust.

Sigstore is really interesting. It's a Red Hat-developed project related to a Google project called Trillian, when Trillian was about certificate transparency. It has a beautiful back-end Merkle tree structure that can be used for attestation stores. I also think Sigstore has the possibility of properties to do East-West trust models as well, but that conversation is very early.

Anyway, I think that's about it for me. Thank you so much. I'm jwills at Red Hat, or I can always be found at Botchagalupe on gmail.com, @botchagalupe on Twitter, or just Botchagalupe.