Securing DevOps: Where to Start and What to Measure

Log in to watch

Europe 2022

Securing DevOps: Where to Start and What to Measure

How do we secure our DevOps processes? Why is shifting left important? How do we get developers to care about security and empower them to make a difference? Where do we start and what do we measure?

Often in software development we operate in silos. Different tribes have different priorities and lexicons. How do we break down these preexisting silos and continue innovating and optimising our software development process? Shifting left can help to break down silos and empower developers to take a security first approach. Measuring DevOps can be hard, DORA metrics can help you to become an Elite performer.

Join this session to find out more about these and importantly, when it comes to securing DevOps where to start and what to measure.

Chapters

Full transcript

The complete talk, organized by section.

Stefania Chaplin

Good morning, good afternoon, good evening. Thank you so much for joining my session, Securing DevOps: Where to Start and What to Measure. My name is Stefania Chaplin, aka devstefops. You can find me on my different channels at the bottom. I'm also a Solutions Architect at GitLab.

My agenda, I always keep it a little bit vague. I'm going to introduce myself, talk a little bit about what we're here to talk about: DevSecOps, common pain points, who our audience is, a little bit about organizational culture, how we can secure DevOps, shifting left, what to measure, why we're here, a summary, and you can also find me for the Q&A in Slack.

Who am I? I used to be a developer: Python, Java, REST APIs. I was one of those cool people that when you give a JSON to and you're looking for a specific value, I can give you the command with all of the square and squiggly brackets to get you what you need. Then I moved into the wonderful world of security, focusing on DevSecOps, application security, software composition analysis, and cloud security as well. And now I work at GitLab, the DevOps platform for end-to-end DevOps.

Outside of work, I am really into surfing. I'm also very much into yoga, the whole vegan lifestyle, and I really like tropical plants. You can't see them, they're just over there, but I have about 30 plants taking over my dining room table. I've got one here as well.

So enough about me. What are we here to talk about? DevSecOps. We have our developers. They are usually involved in creating features: the coding, doing builds, and running tests, unit tests, integration, et cetera.

Then we have security. One of the common myths is that DevSecOps is putting security in the middle. For example, once we've done our unit tests, maybe we might do some static code analysis. Okay, we ticked the security box, and now we hand it over to ops. It doesn't quite work like that. DevSecOps is about having security embedded at every stage, and you want to shift left as far as possible.

We also have ops. They're responsible for release, deployment, having it in production, and also monitoring. I always say, in my humble opinion, incident response is one of the most important teams. Because unfortunately, on a personal or on a professional level, we are probably all going to get hacked one day. So it's about how quickly can you notice, how quickly can you recover. Mean time to remediation is a very important metric I'll be talking about later.

I'm going to talk a bit about security. OWASP is a very famous Open Web Application Security Project. I always forget what the P is. But anyway, they do a top 10 vulnerabilities, and they update it every few years. Recently, with the latest update at the end of last year, there was a new category which I personally am very excited about: insecure design. This is very much the shift-left mindset. To quote Deming, "You can't inspect quality into a product. You have to build quality throughout." When you look at some of the other OWASP top 10, for example number 10, server-side request forgery, SSRF, that's a very specific vulnerability. It has a type, SSRF. It has a CWE, that's the type associated with it. So it's one-to-one mapping. Insecure design has over 40 CWEs. It's the first time we're really seeing this shift-left approach into our vulnerabilities.

Some of the common pain points that we can uncover: security is the bad guy, which we don't really want to be. We are all on the same team. We're just trying to stop us being hacked. Vulnerabilities, known and unknown, make it to production. Yes, you may have a huge list of red flashing errors from your static application security testing, but for example, are you doing software composition analysis? Do you know what's in your open source? Are you doing secret detection? Are you doing dynamic application security testing? There's a lot of different types of security testing, and if you're not thorough, then unknown vulnerabilities could be making it into production, which is not where you want to be. Because you end up with delays, fails, or worse. No one wants to be on the front page.

What we're seeing a few years ago, it was all about stealing data, stealing customer databases. Yes, that's bad, but what we're seeing since 2021 is a massive increase in ransomware. For example, you see what happened with the Irish Health Service. The ethics, I'd say, have dropped. It's no longer your credit card, it's health. In the UK, they even went after one of my favorite childhood snacks. If anyone has ever tried Hula Hoops or McCoy's, KP Snacks got hacked a few months ago. It almost feels no one is safe. It's really getting personal. So we really need to start securing our DevOps pipeline.

Who are we talking about? I'm going to introduce you to my silos. We have developers, and developers, we like new and shiny, and we're interested. We're prioritized on features in terms of, okay, this is our application, this is what we need to get done in this sprint. But where does security come into all this? Because security isn't often a priority within said sprint.

And also you're finding the numbers. You've got usually 100 developers to one security, to 10 operations. What you find is security is spread very, very thin, and security obviously wants to keep things secure. But even things like feedback cycles: once security finds a bug, maybe it goes into a ticketing system, then what happens? Can you map that to a specific commit? Is that going into production? How have you got the visibility?

And then you have ops. They like stability. With ops, there's always been a slight, just even in terms of the language that's used, the behavior, the KPIs. Between dev, sec, and ops, there is a bit of a mismatch, and you end up with these cultural silos.

So, what is culture? Westrum did some great work and helped to break this up. To summarize, you've got power-oriented. I think of this a bit like tyrannical. It's very much pathological. I like to look at the failure metrics. If something goes wrong, heads will roll. You will know, and it's a very much top-down approach. It works well sometimes, can work well in very small organizations, but I would not recommend pathological style.

You have bureaucratic. This is more rule-oriented. This is also, if you've read about microservices, the whole project mindset versus product. You've got almost a waterfall approach, like these are the boxes I must tick. If there's a failure, it leads to justice. There's a process. Maybe the head doesn't roll, but there is a process involved.

And then you have generative. This is my favorite. This is where you get trust. This is where you can work together. This is where you get innovation. When it comes to failure, failure leads to inquiry. Failure is seen as a way of improving the system. You pivot, you learn, you move on. When, not if, because there's always going to be a failure at some point. When there is a failure, you do a postmortem. How can we make the system better? We have this cooperation and innovation within that.

You can also think about culture, like I said, the yoga in me comes out, cognitive behavioral therapy. We have our beliefs and our values, and then they will shape our thoughts, and then our thoughts will become our actions. So if you have a pathological, a culture of fear, a heads-will-roll culture, guess what? If that's what you believe, if that's the way that the world is in your opinion, on a personal level, within your team, within your org, then what's going to happen? You're probably going to start hiding things. And when things get hidden, to quote Deming, "Whenever there is fear, you get the wrong numbers." If you have time, read a little bit about what happened to Nokia. I always thought it was because of the iPhone. Actually, it was to do with top-level management being a culture of fear. Middle management hid stuff, and they ended up losing market dominance.

Have a think in terms of your culture, because you can also have a different one between what is your culture, your organization, and your team. I've worked in pathological organizations in a generative team, and I've also worked in a generative org in a bureaucratic team. You can have all combinations, and it's really about where you want to be as an individual and also in your team as well.

Because you don't want to end up like this. When I join a company, I'll join two Slack channels by default: engineering and gaming, because that's where all the fun stuff happens. When I was in engineering, I see this graph. It's very high, cliff face, very low, and there were some very senior people who had commented and created this Slack post. The summary was, we've gone into Amazon. We found out there was a misconfigured EKS, so Kubernetes, where something was spawning, deploying, falling over, spawn, fail. Sounds expensive, right? In this Slack post with this graph, it said, "We've fixed the configuration, and we have reduced our bill from $6,000 to $1,000." So it's not even a case of security. There are cost savings to be had. When we are in these silos, it's very hard to communicate. We want to have this generative culture where we're empowering people and we're working together.

So how are we going to do this? Make security fun and easy. I'm one of those really cool people that think security is fun. Sometimes I hear, I've overheard people say in group calls on customer site, "Well, we all know this isn't the funnest topic." I literally want to be like, "I disagree." But anyway, how can we make it fun?

Gamification. I was working with a German bank, and they showed a graph of developers. There was a big peak around age 18 to 35, and then it kind of tapers off. That was the graph of developers. Underneath they had a graph of video game players, and with the age, big peak, 18 to 35, and tapered off, and the two were almost identical. So is there a way you can add gamification to your security program?

Hackathons are really fun. You can put everyone on the red team, you can split red, blue. What you'll normally find is senior developers know about some skeletons in the closet, and the junior developers are like, "Oh my God, is it really that easy to hack?" Because that's exactly what happened to me, and I was like, "Oh my God, I need to write more secure code." But anyway, you can have hackathons. You can make things a bit more gamified.

Recognition is really important. It doesn't have to be, okay, you can do an Uber Eats voucher, or if you want to be nice, you could do an iPad. You don't even have to do that. You could have an email or a letter from the VP of engineering or the CISO being like, "Congratulations on winning the most points at our latest events." Or you could have a system where you're looking at change failure rate. How many of your changes are introducing vulnerabilities or causing more problems? And if someone consistently doesn't introduce vulnerabilities for a sprint, guess what? Maybe they get a lunch voucher or any other type of recognition. It's also about speaking to your developers. We can do a top-down approach, but it's about understanding what's important to them.

Also, automation. We want security to be part of the existing workflow. There's no value. I speak from experience. I was a developer. We like our tool sets. I'm happy in my IDE. Is there a way that we can add security into our pipelines, into our IDE? Is there a way that we can get results to developers whilst they're in their feature branch and empower them to be able to fix the vulnerabilities there and then? Because I would much rather, oh, I've just written 20 lines of code. I introduced a library. Oops, that was a bad library. Maybe it's vulnerable, maybe it has a bad license. At which point I'm like, "Oh, okay. Let me just save face. I'm just going to fix that." And then I merge my pretty branch that's nice and clean, and I have peace of mind that I am merging secure code, and fingers crossed, I won't need to go back and change it in a few months once the pen test results come back.

So shifting security left, what can that look like? I'm going to give you a little example. This is from GitLab. Here we have a pipeline. This is quite a special pipeline because this pipeline is part of my feature branch. I have created a feature branch, and I have just committed some code, and that is triggering this pipeline. So it's going to build. It's quite clever because I didn't actually tell it it was a JavaScript app. It figures out what it's building, and then it does a range of testing.

Most people are familiar with static code analysis. Logically, it makes sense. It's the code developers are writing themselves. But what about the open source dependency they're using? That is a big part. If you look at software supply chain, that is huge. We've heard about Log4j. We've heard about Spring4Shell. So are we looking at our open source components? Are we looking at the licenses associated with those, especially if we're distributing our applications? Is it a nice Apache or MIT license, or are we looking at an AGPL that if we distribute our software, it's going to be expensive?

We can also do container scanning, secret detection. But the point is, we're having all of this scanning at the commit level, so we're empowering developers. I mentioned before 100 developers to one security. If we make each developer 1% better, if we give them their security results in an easily consumable format within their existing workflow, guess what? We're going to get results a lot easier and with a lot less friction.

Also what we can do, why don't we do dynamic application security testing as part of this? In order to do this, what we have at GitLab, we have something called Review Apps. It's an ephemeral instance. What you can do is you spin it up, and then you can start attacking it within dynamic application security testing. Are we leaking any headers? So we even have a live instance of our application at commit stage. We're really shifting security left. And then it gets a lot easier, cheaper, faster when you shift left than versus when it's in production, the developer's moved on, they're working on something else, and all of a sudden, you're like, "Ah, we have a critical vulnerability in production that's live right now." That is not where you want to be.

So the fun stuff, DORA metrics. If you haven't already, I would really recommend reading Accelerate. It's a really good book. It's not too long. I read it over a weekend. It really talks about the science of DevOps. The awesome authors set out a survey. I think it was over 2,000 results, from startups to massive enterprise. They did some very scientific sampling. They talk about the results, what they learnt. I really would recommend it. They came up with four key metrics.

Number one, lead time. This is from a customer making a request to it being satisfied. There's two parts to it. One is the more feature design: creating the feature, the UX, making it all pretty, et cetera. The second part, which is what we technically measure, is the time to deliver. This is how long it takes to get implemented, to get tested, and to get delivered. What we are looking for: shorter is better. For example, if a customer makes a request and you can design it fairly quickly, how long until it's in production? Less than a day? Or one to six months? What this does, it enables faster feedback, and you can course correct quicker as well. You can listen to your customers and do what they want. How great is that?

The second one, deployment frequency. This is a proxy for batch size. The way I think of this, to quote the common phrase, how would you eat an elephant? I'm vegan, so I wouldn't. But if I were to eat an elephant, I would break it into chunks. It's the same proxy with deployment frequency, because what we're trying to do is reduce cycle time, variability. It accelerates this feedback. It reduces the risk and overhead, increases efficiency, also motivation and deployment pain. The Accelerate book is great because it does talk realistically about burnout, about people, about deployment pain. Because if something is very painful, guess what? Developers or anyone in IT probably aren't going to stick around. They're going to look for somewhere where they can actually do their job without having multiple layers of bureaucracy. So when you look at deployment frequency, how often are you deploying? Is it multiple times a day, or is it once a month, once every six months? Which we saw the difference between the elite performers and the low ones. What's also worth noting with deployment frequency is on demand, multiple times a day.

Mean time to restore. I mentioned this earlier. When something goes down, how quickly does it take to restore? If, for example, you have a service outage, if there's an incident, how quickly is it being fixed? Is it the elite performers, within the hour, or is it a week or a month? The reality is, if you have customers, which I really hope you do, if your system goes down, how long before they start going to competitors? Yes, there'll be some loyalty. But for example, if your service goes down for two months, are you going to have any customers left by the end? Who knows? Probably not the active users. They will have moved to your competitors.

Finally, and I didn't actually know this was a metric, and then when I was reading about it, I was like, "Wow, okay. That is a very valid metric." Change failure rate. How many of these changes break the system? Then you have to do a rollback or hot patch or fix. The elite performers, we saw zero to 15%. The low performers, 46 to 60. That's a real two steps forward, one step back. Imagine if half your changes or over half failed. That sounds incredibly inefficient.

When we look at them, we've got the top two, one and two. They're all about throughput. That's kind of speed. There's also been a common misconception that you can't have quality and speed, but actually, if you follow these metrics, you can get both, and that's why I recommend the book. It really helps to explain why. The bottom two, this is about stability. Because it's great if we're going fast, but we also need to not break the system, because yeah, we're releasing 10 times a day, and six of the changes break, and then we have to spend our whole evening fixing those. That is not where we want to be.

What we saw, this is a 2019 stat, but you kind of get the idea. This is the difference between the elite performers and the low performers. One of my favorite stats from the book: elite performers spend 50% less time remediating security issues than low performers.

That's the thing about remediation and about unplanned and rework in general. It represents a lack of quality, because if you are building quality in, you're not going to really have to do the rework or the unplanned work. This also leads into, say for example, SRE, where we're trying to reduce our unplanned work by improving the system. You're never going to stop the pager going off, but you want to minimize it as much as you can. I do a lot of yoga. I was at a yoga retreat. The fire alarm would go off every couple of weeks. I said to the yoga teacher, I was telling her about SRE, "If we fix the fire alarm, then we won't have to go after it every three weeks at 3:00 in the morning. Let's improve the system." Disclaimer, I spoke to the yoga teacher recently. The fire alarm is now fixed.

So why are we here? Why have you been listening to this talk? Why is this important? I'm going to read off the slide a bit, so I apologize in advance. If we secure our DevOps pipeline, we can improve operational efficiency, deliver better software faster with reduced security and compliance risk. We can also innovate and iterate so we can listen to our customers and outperform our competitors. Because if you think about it, lead time. If our customer is looking for a particular feature, maybe it's a slight improvement on a current one. They're like, "Oh, I love this drop-down menu, but it'd be great if I could edit or configure this." If we can get that in the tool within days, maybe weeks, that will include all the UX stuff. If we can write it quickly and get it in the system in a day, all of a sudden, our customers are going to really like us. We're going to start to gain market share. Because if you look at the flip side, if we're not listening to our customers, we're going to lose it. This will drive true business value, because when we listen to our customers and we deliver what they're looking for, then we gain market share, and then we're all in a good place.

To summarize: take a security-first approach. This comes back to the insecure design. Take that whole shift left. Break down silos. We are all on the same team. We all work for the same organization. We want the same end goal. Security is not the bad guy, I promise. We're just trying to keep everyone safe. See if we can just speak to each other at a normal level. We've had a pretty weird two years with this whole lockdown, so take a more empathetic approach.

Finally, make it fun, automate, and measure results. Empower developers. I've mentioned 100 developers, one security. If we can make every developer 1% better, all of a sudden we're going to see real results, and we want to measure those results across the business.

Thank you so much for listening. I hope you have enjoyed this talk. If you want to reach out to me, my name is Stefania. You can find me on my different channels here. Feel free to pop me over an email, slide in my DMs. I really love feedback. It helps me make my talk better, so feel free to reach out. I do this public speaking a lot, and I'd love to speak to you. If you are interested in GitLab, you can also reach out on my GitLab email as well. I will be in Slack for the Q&A, so feel free to reach out, and thank you so much for attending this session.