DevSecOps: Security Automation at Scale

Log in to watch

London 2018

Download slides

DevSecOps: Security Automation at Scale

Margo Cronin

Solution Architect · Amazon Web Services

Security is job zero at Amazon.

How do they retain a high pace of innovation while meeting increasingly tougher security requirements?

Chapters

Full transcript

The complete talk, organized by section.

Margo Cronin

My name is Margo Cronin. I'm a solutions architect. I work for Amazon Web Services, and I'm here today to talk to you about DevSecOps.

My personal opinion: I'm not particularly crazy about the term DevSecOps. It's like the last kid getting on the bus and all the seats are taken, and Sec is told to sit there between Dev and Ops, to squeeze in. And that's definitely not the case. Security is not an afterthought for any of us, especially now with data regulation getting so much tighter, with the likes of GDPR, et cetera.

I mentioned I work for Amazon Web Services. Maybe something you don't know about Amazon Web Services is the rate at which we release services and features, and the rate at which it's increased exponentially over the last couple of years. If you look at 2010, we released 61 major new features and services, as opposed to last year, where it was 1,430.

But security is really job zero. It's the first job we do at Amazon. Aside from our own internal security regulations, we are regulated very heavily outside by SOC 1, 2, 3, ISO, FedRAMP, C5, GDPR, FIPS, and many other security regulations. So how do we retain this pace of innovation, but yet have all these features meet these security requirements?

Likewise, if you look at amazon.com, these are the metrics for the deployments on amazon.com. Roughly every 10 seconds there's a deployment on amazon.com, up to as much as 30,000 hosts. That means there have been hundreds of thousands of deployments since I've started talking to you. So how did they retain that pace of innovation but also retain the security practices that they have? Because can you imagine if there was a security incident on amazon.com?

So I began my session, I said I wasn't crazy about the term DevSecOps. There are other terms out there, other terminology we use, something like rugged IT. You sometimes hear that as well.

At Amazon Web Services, what we look to do is to take our core security practices, automate them, integrate them into our software delivery lifecycle, and our release practices, our release processes. But what we specifically look to do is to do that at scale, because when you deploy a service on Amazon Web Services, be it you as a customer or us as the vendor releasing a new feature or service, this is across 18 regions across the world, across 55 availability zones. So it needs to be security automation at cloud scale.

So for us, a fundamental principle of DevOps and to retain our pace of innovation is automation. Now, we like humans. We do like people. But people make mistakes. Okay?

If you think about if you're an engineer and you're experiencing an incident in production, you're on a Slack channel, there are 40 people there, they're not really contributing much. You're on hour five of a severity one call, cup of coffee number 11. You've got a C-level guy ringing you saying, "Fix it. Just fix it." If you're going to make a change in production under those circumstances, the likelihood is you could make an error, or the likelihood is you could put a workaround in place and you forget to roll it back, or you forget to document it, or you leave it and the next time there's an update, the next time there's a release, this incident is back again.

Also, as humans, we're very fond of bending rules. I think this is actually one of our favorite things to do. For me, what is interesting about rule bending is we often do it from a place of goodness. Right?

I know when we work on projects and you've been working on something and you're looking to release something, maybe it's a new application, a new app. Your C-level guy, he's really anxious to start tweeting about it, to tell shareholders about it. Friday afternoon, you're ready to go to the pub, the beers are bought. You're just going to skip a couple of steps. You're just going to push it out there, and you'll come in on Saturday and you'll document the change or you'll update it later on. So you're bending rules, and this can compromise your production landscape.

And people can act with malice. Okay? Attacks on our production landscape, automated DDoS attacks, bots that are pulling down keys from GitHub. Yes, this is automated, but invariably there's a human somewhere behind that. Okay? So people act with malice.

And machines don't. Yet, at any rate.

So today I want to talk to you about four steps to implement security automation in your organization at scale. They're not exhaustive. There are other steps out there, but I think they're a good place to start.

The first is to establish your level of trust in your cloud service provider, in your tools, and in your vendors. It sounds kind of obvious. As I mentioned, I'm a solutions architect. I live in Switzerland, actually, and I work predominantly with finserv clients, so I work with banks.

Now, their level of trust is typically here. They have their own security departments, their own security policies, their security teams, security tooling, very established security practices. In Switzerland, where I live, we're highly regulated how banks deal with client-identifying data. So it's very understandable that their level of trust is down at that scale.

Sometimes when I deal with fintech and with startups in the fintech industry, their level of trust will be up at this end. They'll use pretty much every service and feature that comes from the cloud service provider. And then sometimes with customers, they're somewhere in the middle.

So when you're using cloud, it doesn't matter where you are on the trust scale. Okay? You can create your permutation to meet your trust needs.

Let's look at it from the perspective of encrypting data. My customers that are highly regulated, and therefore have low trust, they might say to me, "We're going to use our keys. We have to do encryption. We have to use our hardware security module." Maybe they don't have a hardware security module on premise, or they don't have a good way to store the keys, so we move them a bit up the trust scale, and we ask them to use our key management service.

Or alternatively, if it's a fintech startup, it might be something where they use our keys, we do the rotation, they use our hardware security module. So it doesn't matter where you are on the scale. You can choose a permutation that meets your needs. It does impact your security automation, however, which we'll see in a minute.

Let's look at another example using the trust scale. Let's say you want to deploy something like Kubernetes. If you're down at the zero level of trust, you will possibly look at doing everything yourself. Okay? So you'll deploy the master nodes, you'll manage distributed consensus, you'll manage scaling, you'll manage the encryption of the etcd, you'll manage the pods. You're doing everything yourself.

If you're at a higher level trust scale, you would possibly use a managed service like Elastic Kubernetes Service. Here, Amazon exposes an API to you. We manage the master nodes. We manage the scaling. We manage the distributed consensus. Your worker pod, the pods register with an API endpoint, which is your master. AWS manages AuthN, and that way then, you don't have to worry about a certificate authority, or self-signed certificates, or certificate rotation.

So in this scenario, RBAC is enabled by default, Kubernetes role-based access control. In the previous scenario, where you're managing it natively, you're managing RBAC. And this is where we're going to talk about security automation, because where you are on the trust scale impacts how much you have to automate and how much you have to plan to automate.

So if security is job zero for you, if it's your top priority, you must plan more effort in security automation down at the left-hand side of the scale rather than up at the right-hand side of the scale.

If you look at a security incident like Tesla experienced at the beginning of the year, their Kubernetes control plane was compromised. So RBAC was off by default. This hadn't been enabled. These are the kind of stories you need to look to automate and the kind of stories you need to plan to automate.

It doesn't matter where you are on the trust scale. You can configure the tools to meet your needs. But plan your security automation.

So talking about planning, this brings us to security by design. For me, the introduction of GDPR earlier this year was quite interesting. As well as introducing the data breach notification, the need for data portability, and the right to be forgotten, they also introduced this idea of privacy by design, which is where you demonstrate that you're implementing GDPR through your projects and programs, no matter whether you're using Scrum or Waterfall, no matter what practices you're using, DevOps or site reliability engineering, that you're following GDPR policies all through your project lifecycle.

And this philosophy is actually a really good philosophy to adopt for DevSecOps. Interestingly, it kind of nearly implies if you want to be GDPR compliant, you have to do DevSecOps. But security by design means introducing security fully through your software design lifecycle.

At Amazon, we famously have leadership principles. Okay? They govern how we make decisions. There are 14 of them. They govern how we make decisions. They govern how we hire. They govern how we, as employees, are reviewed.

One of the leadership principles is ownership. This is a very good leadership principle, I think, for DevSecOps. In the same way that DevOps has broken down the silos between development and operations, and you're no longer throwing units of code over to the ops guys, it's now everybody's responsibility. Now everybody in the team is a security practitioner. Security is no longer the job of a team down the corridor or of a team in another building. Security is everybody's job on the team.

Because security is everybody's job on the team, you can now look at pushing your security stories through the same pipeline as you push your application stories through. So the security jobs you need to harden your production landscape can all be described as epics and reduced to functional stories now really easily.

In the past, if you were to take something like infrastructure security and you were to say, "Okay, the functional story here is we want to deploy web application firewalls," this is something that might have taken security teams a long time to do. They would have had to evaluate vendors, send out RFPs and RFIs, find partners or maybe resources internally to implement and build POCs to test the web application firewall.

Now, with the cloud, you can just deploy a web application firewall within seconds, and within your sprint, test the web application firewall and test your acceptance criteria for the web application firewall. And that's the case with all these security epics. They can be reduced down to functional stories to harden your production landscape.

So for security by design, every member of your team is now a security practitioner. Okay? It's everybody's job, security ownership. Decompose security epics down to security functional stories and create security acceptance criteria.

So for your application stories, you should have security criteria. If you have an app and the story is the user creates a login, a security acceptance criteria should be: how are these user details deleted? If you're deploying a piece of infrastructure like a web application firewall, the acceptance criteria might be testing the access control list criteria of the web application firewall.

And then finally, you should be using the same continuous integration and continuous delivery pipeline to push these stories through, to push your security infrastructure through as your application. And if you're not, the question is, why not?

Which brings us to step three: what are you securing?

So you should look at your continuous integration and continuous delivery pipeline as a shared responsibility model. You should look at the security of the pipeline and the security in the pipeline.

How are you controlling access to the pipeline? Do your users that push units of code through your pipeline, are they assigned to groups and assigned a role during which they can do this activity? If an event occurs in your pipeline, is your user assigned a token for a particular period of time, and you can identify who did what and when? Do you use multi-factor authentication to access the pipeline?

And likewise, you should look at the security in the pipeline itself. So if you're in the cloud, you're possibly doing something like this. You have developers, and they're writing smaller units of code now, and this is great because you've reduced your blast radius and it's easier to find issues. However, it means you're doing many more pushes, which increases the need for security automation.

You're probably using some sort of version control system, maybe GitHub or Bitbucket, and you're pushing up to a CI server where we're orchestrating our builds, we're bundling together our code, we're finding our dependencies, we're pushing up to an artifact repository. And now, because we're in the cloud, there's a lot we can do here. We can include AMIs, server images, configuration management packages, deployment mechanisms, and do infrastructure as code. We can spin out our environments, deploy our infrastructure, deploy the application, test, and repeat.

But across this pipeline, there are many points where you can introduce security. At the developer workstation itself, under the "people make mistakes" category, something that can sometimes happen is a well-meaning developer can include AWS access keys and secret keys in the code that's being checked into this public repository.

So when this happens, you're essentially putting the keys of your production, of your landscape, your cloud landscape, out there to the public. And unfortunately, there are bots out there that scan public repositories and pull these keys down within seconds. Seconds. I have had a customer test this, and it took two seconds for it to be pulled down.

When this happens, they start spinning up infrastructure and essentially start Bitcoin mining on your production landscape. So this is really an oh-my-God incident, right? This is something you just don't want happening. So wouldn't it be great if you could just prevent it from happening?

And the answer is you can. There are many ways to prevent it from happening. If you're using something like GitHub, there's a great set of expressions out there called git-secrets. It's a pre-commit hook, and it actually prevents the code from being checked in in the first place. And that's what you need to do here.

At your CI server, at your build, you have the opportunity to do some coding best practices, do some static code analysis, do some linting. At your artifact repository, you can now audit and validate your packages, sign and validate your packages, and all through the pipeline, you should be logging. Right? Your developers need logs, but even more importantly, your security practitioners, right? Everybody on the team is a security practitioner. They need that kind of visibility. So you should be logging.

And I mentioned infrastructure as code. Now in the cloud, you can describe your entire environments as code, and you can describe your infrastructure the same way you describe your application. And you can check it into GitHub and Bitbucket and CodeCommit the same way you do your application. So you can write, version, store, and deploy your infrastructure.

It means you can move away from the classic dev, SIT, UAT, and prod, and you can create verticals across those horizontals dependent on what you're trying to do. So a vertical for security where you deploy all your security infrastructure, for example, if you're using things like secure token services, identity access management. A vertical maybe for networking, a vertical for logging and monitoring, and a vertical for your applications.

It greatly reduces your blast radius of your environment. So if you believe you've had a security incident and you need to tear down that bit of the environment, it is much easier to do and easier to achieve.

Which brings us to automating responses.

So we've all heard of Love Island, except for the contestants of Love Island. When it comes to DevSecOps, I want you to think of log love. Or more specifically, you need to think of event log love. But event log love does not sound as snazzy as log love. So we're going to stick with log love, but you know what I mean.

And so now with DevSecOps, logging services and monitoring, these are the guardians of your galaxy. Right? And you need to ask yourself certain questions. When are you collecting log files? Is it just when your application is in production? Because you could be doing so much more. You should be collecting log files, for example, during your pipeline as well.

Why are you collecting log files? Is it just for auditors and regulators? Because you can do so much more to enable DevSecOps.

Where are you collecting log files? Okay, this has an impact on your cost management. There are many cost-effective ways to gather log files.

But most importantly, for security automation, what are you doing based on your log files?

So I remember developing and working on applications on premise, and one of the things I remember is as we went up the stages, you typically reduced the amount of log files you were gathering because you were somewhat restrained regarding capacity and production. And I also remember with log files that we used to have this kind of marking, like E for error, I for information, W for warning, and F for fatal.

But now this is no longer the case. Now every bit of our log files is informative, and we can act on the information in the log file.

So let's say you have an application running in production. Maybe as part of this application, you have a user accessing a file in a bucket that's encrypted. When this event happens, your application logs in Amazon to a service called Amazon CloudWatch. So this Amazon CloudWatch captures application-level logging.

But at the same time, another service called AWS CloudTrail captures all the activity of your account, so everything related to your account under the hood. Very important for you and for regulators and auditors.

All of this information can be stored for regulators and auditors, but even more importantly, all of this logging from both CloudTrail and CloudWatch, you can trigger an automatic event based on the information in the log files. So let's say, as I said, somebody has accessed a file and it's resulted in something being decrypted, a key being decrypted under the hood. You can create an event-driven function, so serverless, and notify your security team. Somebody has called decrypt.

Maybe you want to send a notification to tools on premise, or maybe based on what the user has done, you want to perform an application-level API.

Now, CloudTrail, it's a little bit like the security cameras of a bank. Right? So if you think back to the old banking movies, the heist movies, and somebody is breaking into the bank, what's the first thing they do? They disable the security cameras.

So if your account is being attacked, being violated, one of the first things somebody will do is disable services like these. Wouldn't you want to know about this? A really nice feature is to have these services send a notification, "Hey, I've been switched off," and notify your security teams. This gives your security team the opportunity to see, is this overuse of privileges, or is there actually something happening? Is there somebody attacking our account and we need to ring-fence web servers?

Another example of using logging services to prevent as well as just log is in the case of malicious IP addresses. So there are known lists of malicious IP addresses out there. Your security teams might be using third-party services to provide this information to your organization. They can update this information in an S3 bucket.

Once this information is in the S3 bucket, Amazon CloudWatch can trigger an automatic event, an event-driven function, to update your web application firewall automatically. So when the malicious IP address tries to access your web servers, it's blocked immediately by the web application firewall. Again, your logging service doing something automatic to protect your environment.

And how do we do logging at Amazon? Can you imagine the petabytes of data and logs that we have at Amazon Web Services? So we gather all of these logs, all of our logs, raw logs, permissions, VPC logs, application logs, CloudTrail logs, CloudWatch logs, and we stream them to dedicated accounts. These are all sent to S3 buckets, including all of our CloudTrail logs.

And many people might stop here, or they might use something like Elasticsearch or Kibana. What we do is we parse this on a managed Hadoop platform called Elastic MapReduce. They're then uploaded to a data warehouse on which we run analytical tools and machine learning tools.

Likewise, as I mentioned at the beginning, we're heavily regulated, so everything is stored in Glacier for regulators and auditors, and everything is encrypted end to end. So this is how we do logging for our security automation.

What are we looking for? We're looking for unused permissions. We're big in Amazon Web Services about least privilege, so we always remove privileges if people shouldn't have those privileges. We're looking for overuse of privileged accounts, people doing things like switching off CloudTrail. We're looking for the usage of keys. Who is decrypting what? How long is that data decrypted for? We're looking for anomalous logins, people logging on for 0.0001 of a second. And then we're looking for the obvious stuff, right? The policy violations, the systems abuse, the attacks.

But we collect the data once, we parse the data, and then we have many use cases for it.

So through my session, I have spoken about four steps to help you enable security automation at scale. First, establish your trust. Okay? A, it helps in your planning about understanding what parts of your cloud service provider you can use. But more importantly, it helps you plan your security automation effort, because it is higher if you're down at the zero point.

The second one, security by design, which if you want to be GDPR compliant, I think is actually mandatory now to do DevSecOps.

The third, to take a shared responsibility model approach to your CI/CD pipeline: the security of the pipeline and the security in the pipeline.

And then finally, your logging services, your monitoring services, your log files, they're the guardians of your galaxy.

But the key takeaway from my session today is that if security is the top priority for you, then your security shouldn't be relying on caffeine. If you want to ensure security in your DevSecOps practice, automate security at scale in your organization.

Thank you.