Prove it! The Last Mile for DevOps in Regulated Organizations

Log in to watch

San Francisco 2015

Prove it! The Last Mile for DevOps in Regulated Organizations

Principal Security Solution Architect · Amazon Web Services

In mathematics, the proof is an established style of argument designed to show that a statement is true. In compliance and audit circles the equivalent of “true” is when a control is operating effectively and validated using appropriate evidence. Despite recent years of progress in developer efficiency and feature flow, the style of argument used to demonstrate control often still resembles spreadsheets, screen shots and live observations of operators.

In a DevOps world, practices like continuous deployment and infrastructure automation actually implement key controls. Asking engineering teams to perform manual tasks required for control artifacts creates friction that often stops DevOps culture shifts before they gets off the ground.

This talk focuses on the last mile of demonstrating security and compliance in enterprises embracing DevOps: proving that you control risk without resorting to legacy control attestation. Based on thousands of engagements with global and Fortune 1000 companies migrating workloads to AWS, this talk tells the stories of highly-successful enterprises trying to demonstrate security and compliance in a cloud and DevOps world.

This talk showcases the following areas:

-Identifying blockers and friction – reasons current compliance and audit reporting practices in regulated enterprises can slow transitions to the cloud and DevOps practices.

-Partnership – Organizational behaviors and leadership moves that help overcome objections to the automation of security controls typical of DevOps delivery models.

-Control Design – Examples of how to translate compliance requirements into engineering specs that work in an environment built on automation.

-Control Evidence – Proofs! Patterns used by highly regulated companies to automate evidence and artifact gathering.

Chapters

Full transcript

The complete talk, organized by section.

Bill Shinn

Hi, my name is Bill Shinn. I'm a Principal Security Solution Architect with Amazon Web Services. What that means is that I help our customers understand privacy, security, risk management, compliance as they move their workloads and their businesses to Amazon Web Services.

This talk is definitely not about the cloud because, as we all know, DevOps is not the cloud. There's just overlapping Venn diagrams that are very overlapping of people who are taking on DevOps opportunities, looking to modernize their feature flow, their development practices, and those happen to be a lot of the same people who are moving to the cloud.

When you can program your infrastructure, when you can treat your infrastructure as code, when you can deploy servers and storage and everything else the same way you would deploy code and manage it the same way, there's a big correlation between taking opportunities to use the public cloud in the right way and also use DevOps practices.

I thought of this talk, and I've been with probably a thousand different enterprise customers. I spend a lot of time with GE, with folks like Hearst and Philips and Pacific Life and other publicly referenceable enterprise use cases on top of Amazon Web Services. The security is not so much the problem anymore.

Three years ago, there were a lot of security questions, a lot of security objections. I think the "prove it" part is really about compliance and audit now. We have folks like Capital One who talk about DevSecOps or DevOpsSec, and we talk about it with our customers, and the security teams are starting to participate, and they're starting to get on board. I think the last mile, as I'll call it, is really proving it to audit and compliance and the regulators that what you've done has managed control as you move to DevOps or you move to public cloud.

Today, we'll talk a little bit about blockers and objections. Hopefully, have a little bit of fun with that. Tell a little bit of history lessons, hopefully. We'll talk about the partnerships that we see in organizations that are working really, really well.

Before, it was development and operations having to get together and have a partnership. Now, I think that the last mile and the next step, and we see this in some successful organizations or highly successful organizations, is bringing in audit and compliance into that as well, and how they do the control attestations and control evidence and gathering control artifacts.

We'll talk a little bit about control design and the fact that the proofs, right? In the abstract, you were promised math a little bit. Mathematical proofs are kind of the language of proving a conjecture or proving a math theorem. We'll talk about that a little bit, too.

Gene, when he asked me to do this talk, he talked about the research agenda from last year's conference. We'll talk about the middle three, so a little bit about culture and leadership, some top approaches to design and how teams are partnering together. We'll talk a little bit about the roles and responsibilities throughout this, and then certainly, we'll talk about information security and compliance practices of folks who are moving their workloads to the cloud and using DevOps to do it.

You were promised math, right? Everybody knows in junior high or high school math, or wherever you took your geometry, everybody understands the Pythagorean theorem, right? Good old Pythagoras. The top equation there is X, usually, the Pythagorean theorem is X squared plus Y squared equals Z squared.

Pierre de Fermat was a French lawyer who did a little more than just lawyering. He dabbled in math quite a bit. He's pretty much responsible for differential calculus, for a lot of number theory that was created in that era. And he worked out, somewhat questionably, as history will tell you, many proofs for his conjectures and math theories, right?

He didn't write any of them down very well, and the techniques he used allegedly or would have had to use to solve some of these things were definitely unavailable at the time. Some of them were not available in modern math for more than 100 or 200 years after he allegedly did his proof. So some of them, certainly he could be accredited for, but some of them are a bit questionable.

Fermat's Last Theorem is a famous math problem that was unsolved for hundreds of years. Basically, it's X to the N plus Y to the N equals Z to the N has no whole number solution for N greater than two. So X cubed plus Y cubed does not equal Z cubed and to infinity. Mathematicians, greatest mathematicians in history, tried to prove this, make a proof for this for hundreds of years.

I guess one day he was spending some quality time with a third-century Greek text, Arithmetica, and he wrote in the margins this quote, and his son found it after his death. And to me, I'll just let you read that for a second: "I have a truly marvelous demonstration of the proposition which this margin is too narrow to contain."

I don't know about you, but that sounds an awful lot like some of the conversations between DevOps folks and their auditors, right? So it's secure, I promise. I just didn't write it down.

And the highly successful organizations, they are writing it down, right? They're actually writing down the proof. The margin is not too small. They put it in the margin. They write it down. And I think that's kind of the last mile there.

Let's talk a little bit about maybe an SDE equivalent, I guess, of his time, Sir Andrew Wiles. He went to Oxford, went to Cambridge, went to the Princeton Institute for Advanced Study. It's a place that von Neumann and others went to. He first encountered Fermat's Last Theorem at age 10.

This guy's really into math. He's definitely a maker. There's a quote in a great book by Simon Singh called Fermat's Enigma: "I love doing the problems in school. I'd take them home and make up new ones on my own." Very DevOps. You find a problem, you explore it, you dig in, you dive deep, and it's just a passion and it's interest for this person.

So it's kind of what it takes to get to proving it for audit and compliance, too. We need more people like Andrew. He spends seven years working in isolation on this problem, starting in about 1986. In 1993, he goes to Cambridge, I think, and does this big blackboard demonstration of his proof, along with a 200-page proof of a solution, proving that Fermat's Last Theorem is true, that there is no whole number solution for XN plus YN equals ZN.

So what can we learn from Andrew Wiles, or Sir Andrew Wiles now, proof of Fermat's Last Theorem as applied to DevOps and compliance?

Write down the proof, right? He did what Fermat did not do, which is he wrote down the proof. It was a 200-page proof, and if you've ever been through an audit, a regulatory audit, or internal audit, or a PCI audit, it's hundreds of pages. For a DevOps engineer who's writing a piece of software and deploying that through a CI/CD pipeline, there are a lot of controls that go into that for a regulated environment.

There's the traditional separation of duties, there's access control, there's logging and monitoring, there's encryption. There's lots of things that have to go into documenting a complete control. And people in banks and pharmaceuticals and industrial enterprises have been doing this for decades, and they have pages and pages and pages of audit documentation, particularly in the FDA compliance space or in the banking space.

They have control catalogs, they have audit artifacts that they've been using for years and years and years. The auditors come in, they open up the work papers, and then say, "What did you do last year for your control? What are you going to do this year, and has it changed?" And the effort to change that and educate the auditors is a pretty high bar.

If you're changing the way that you're shipping code, that you're developing code, that you're doing operations, it can be disruptive to the way that auditors and compliance folks communicate to the regulators, for example. So even if you have security on board, and we often find that security is on board, they're at the table now. They're operating things. They're helping to make the solutions. But it's often not security or even the CISO who talks to the regulators. It's corporate compliance. It's internal audit. It's not necessarily security. So write it down.

Don't work in isolation. He worked for seven years in isolation, obviously one of the most brilliant mathematicians of our time. But when he published everything, they found a flaw in August of 1993 in his proof.

Now, that it was public, there was a competition for this, by the way. Some prize, I think it was 1908, there was a prize awarded or deemed possible for the person who proved the theorem. And he wanted to work in isolation, under total secrecy almost, except with his wife, that he was actually working on this problem. During the time he was actually publishing the proof or working on the proof, he actually would publish things periodically that he'd already done prior to doing the work, just so people didn't think he was doing nothing.

So he didn't share. And then once he started sharing, he was under very strict scrutiny, kind of like an audit, right? And the thing, too, is that people will criticize, right?

His proof was so large, 200 pages, that like a corporate compliance or corporate audit team, they had to break it up into chapters. These are renowned mathematicians at Cambridge and Oxford who were reviewing his work. But because he worked in isolation, it took a long time for them to review it.

They found a flaw, and he spent another year under public scrutiny, kind of like an audit, trying to fix the flaw. Eventually, in 1994, he proved the theorem. And after another year of work, he almost gave up, and he was able to persevere and to have a proof for Fermat's Last Theorem.

Another thing is everyone reads each other's proofs in math, right? They bring the most qualified people to judge the most qualified people. And oftentimes when you're doing things like using Jenkins or using Docker and using modern methodologies for development and deployment and operations, there's a lot of education that has to go on.

How many people are developers? Sweet. How many people are compliance and audit folks? Right. All right. Do the shake-the-hand, meet-each-other kind of thing.

How many developers have read NIST 800-53? One, two, three. Awesome. Awesome. So how many compliance people write and read code? Right. I'm not asking auditors to read code necessarily, but when your entire infrastructure and all your security controls are suddenly written and checked into GitHub, which is the way it ought to be, how do you possibly communicate?

And when you're writing code and you have to attest to the effectiveness of the control and prove that you're meeting a control objective that's stated in law, right? How do you prove that without understanding the law?

I think development and operations is a lot closer together in terms of language than something like Gramm-Leach-Bliley or the FDA regulations or HIPAA Security Rule than something that's an SDK that deploys servers and deploys a virtual private cloud or is a continuous integration pipeline.

So in math, they read each other's work, but in DevOps and audit, we don't. And I think that needs to change. And I think that as we educate each other, that last mile, we'll cross that last mile.

Perseverance pays off. It's hard. These enterprises that we're working with, and you're part of them too, have been doing this for years now, and I think we're well along the journey. We definitely have a lot of regulated workloads moving to the cloud using DevOps. But we're still a ways away, and to have that perseverance and that optimism is critically important.

I think we talk a lot about optimism, and we talk a lot about friction sometimes. But I think the key thing to look for is the lack of apathy. It's okay to argue with the methodology. It's okay to test the controls and critique the proof and break it up into chapters and give it to the best of the best to scrutinize. But you can't be apathetic, right?

When you run into that apathy in an organization, you have to persevere through that, and you can call it out for what it is. If it's just apathy that no one wants to change or rewrite the control or translate the way you're doing something into the way that the control's been stated in the past, you have to work through that apathy.

So this guy, Galileo, he's the guy who disproved that servers are not the center of the universe, right? Actually, yeah. We all know him for heliocentrism. There's the Copernican way, and then he was a big fan of it.

He was put on house arrest, not for his scientific theories about heliocentrism, about the Earth going around the sun, but it was the Roman Inquisition. Really what made the Pope a little angry was his reinterpretation of the Bible, trying to worm his explanations into existing scripture. Right?

Like how you have these corporate policies and you don't really want to change the policies because they have to go to the board, and they have to go to legal, so you try to change the way you're implementing the policy. Right? So he was put on house arrest for pretty much the remainder of his life for trying to prove it.

And it wasn't so much the fact that he was saying the Earth goes around the sun, but he's trying to sort of work against gravity almost by saying in scripture, there's things like, "The world is firmly established, it cannot be moved." Right? So if that's true, then the sun must go around the Earth, right?

But he was working against these essentially policy statements that he was unable to work through, and he was persecuted for it as part of the Roman Inquisition. So he's put on house arrest, spent his whole life trying to prove it, published some books, was basically put on house arrest.

So what can we learn from Galileo? The world does not revolve around servers. That's important, right? Not so much the policies or the controls, but the control procedures in a lot of regulated companies assume a fixed server or a fixed asset.

This is actually where we see quite a bit of friction where companies are moving to the public cloud, and they're using DevOps to do it. They're programming their infrastructure. They're using things like auto-scaling.

The asset management that underlies that and the way that auditors take their samples of the running environment, they say, "You have 10,000 servers in your environments. We're going to randomly sample 100 of them or 1,000 of them. Give me all the controls. Show me access control. Show me agent installations. Show me server logs."

And you move to an operating environment that's programmed and auto-scaling, or its equivalent in other environments, are bringing the environment up and down. There's no more server. So the sample that existed at noon, when there's lots of traffic, had 1,000 servers, and then it's 900 servers. What happened to the other 100? How do you possibly sample that?

It's indicative of the struggles that I think DevOps organizations are going through, where you used to put code in one place, and someone would manually move it to another place, and those places are no longer there anymore. And the controls and the tests and the audit artifacts you used in previous years have to be retooled to account for the fact that the servers are there sometimes, and sometimes they're not there, and the code used to be here, and now it's done automatically. And that friction can be impactful and slow down the feature flow.

If an auditor is looking for a screenshot, which is sometimes the common denominator and the best and current practice of audit fieldwork even today, or they're looking for samples from logs put into a CSV file and then put into a file folder, sometimes a manila file folder on paper, then how do you move through that?

The policies and the way the controls are written might not accommodate changes to ops. That's the key thing, right? When you're doing your epic, or even your user stories to some extent, and you're building a feature flow, that one epic has to be pass the audit, right? And you have to treat that audit as a customer, essentially, and then build sprints for each one of the controls or parts of the controls. Or maybe even, if you want to get really granular and it's complicated, maybe even each artifact becomes its own sprint.

You can treat it just the same way, and you can automate it just the same way. Hopefully, you also are educating the auditors, and they're understanding a little bit about how you're doing development and operations.

We have folks who take, I'll talk about it more in a bit, this solution called CloudTrail, and it audits all the API calls that you make to AWS, and it puts them in a JSON file and then puts it wherever you want, in a bucket or a Syslog server or a log management server or something like that.

We actually have a company who's built a process to take the JSON and make a PDF out of it and then put that in a folder. And to me, it's like, okay, you're working your way into the way the control used to be written, that logs were exported and dumped, and evidence of those was put into a folder for an auditor to see and sample from.

But really, they've got the whole thing. There's no need to sample. They've got census, right? They could say, "Here. Here's the login to Splunk," or, "Here's the login to Kibana," or, "Here's the login. Put it in a time search for your sample and when you want to have the period and time audit, and you'll see all the logs."

There's no need to print samples. And I think the old way of doing that audit and the new way of how operators work and how security is now working can be reconciled.

Tides. Galileo, he's really smart. He discovered moons and all kinds of stuff. But he was wrong sometimes. He used to think that tides were caused by the rotation of the Earth around its axis. We know that's not true anymore. But that's an indication that it's okay to iterate, right?

Even though you're a big advocate, and you're probably right about the Earth going around the sun. It's okay to be wrong. I think failing fast, and that's why treating audit artifacts as sprints and quickly exiting if you don't have the right evidence and working with your auditors along the way, not waiting for the annual audit, makes it okay to iterate.

One of my favorite powdered-wig wearers is John Harrison. This guy is pretty much responsible for solving longitude at sea. He's kind of a hacker, right? Built his first clock when he was 20.

Prior to that, ships would just get lost at sea all the time because they'd use methods like dead reckoning, and they had these things called knot logs, K-N-O-T, and they'd tie knots around logs, or they had these little boards. They would track the speed of the ship, and they could do latitude, but they couldn't do longitude, so they'd wreck and crash and get lost.

There was a Longitude Act of 1714 because it was so important to the king and to English society that we not lose ships at sea anymore. So there's a prize awarded for it.

He basically did his first unit test in 1736. He built a clock, a big clock that went on a ship at sea, and it was supposed to keep accurate time of the place where you left from, and you could determine the time where you were by other calculations, and you could compute the difference, and then you'd figure out where you were in terms of longitude, right? So you wouldn't be lost anymore.

His next sprint was five years later, and his clock was called H2. Seventeen more years for H3, and he almost gave up until he tried a watch. The time was pretty accurate. It was better than the current methods of dead reckoning. Pretty much way better than the other things that the non-hackers were trying to do, the aristocrats and the Astronomer Royal and everything.

He eventually tried a watch, a much smaller timepiece. Six years to build his first sprint, basically, or his first feature, and the sea trial was in 1761.

In 1765, one of his competitors, who was a fan of the lunar calculation method for longitude, which was totally wrong, was now the Astronomer Royal. So he basically had to get the approval for proving longitude at sea from one of his previous competitors, who was Astronomer Royal of England, and not really a mechanic or an engineer or a hacker, a DevOps person.

And the Astronomer Royal said, "Oh, that's just luck. It was accurate because of luck. Go do another sea trial." His son did the sea trials because at this point, he was quite old.

What can we learn from him? Makers win. I think we all know that. That if you're just sitting in a room thinking about the right way to do operations or the right way to do an audit, the makers are clearly going to be the ones that win.

Defer to expertise, not authority. Gene and I were having a conversation about this pretty recently. We've talked about it a bunch. The authority at the time, people like Isaac Newton, believed that lunar calculation was the right way to determine longitude at sea.

It's a clear indication that just because there's a policy statement or because we've done things this way in the past, or because experts say that we can do these things to pass an audit, that you defer to the makers. The operators who are actually shipping code and operating the environment know how it's controlled.

And we see this all the time in public cloud deployments and with DevOps methodologies to do so. That it's actually much more secure than the previous ways they've done it. There's less humans involved, so there's more trust of the systems. They're immutable. The audit transparency and audit logging with Slack and other systems like HipChat, and the way that check-ins are done, all have code commits. That security, that visibility and transparency and auditability is much, much, much stronger than the way code used to be shipped and the way operations used to be done.

But if you defer to authority and say, well, code has to be put in this folder, and then somebody has to manually move it, there's a lot more room for error. There's a lot more room for security flaws. So deferring to expertise, not necessarily to authority.

Lastly, optimism pays off, right? We looked at his first clock he built when he was 20. He was 80 years old in 1773 and finally received more money for his work. The Longitude Award was never given, but his optimism paid off. He actually solved a massive problem for naval navigation or maritime navigation, and optimism paid off. He's now recognized as one of the biggest contributors to marine navigation in the history of the world.

So that could be you for DevOps, right?

Lastly, don't get lost at sea. You want to automate the navigation. Wherever there's a control that's audited and is essential for your control effectiveness, you want to basically try to automate that. I think that goes without saying in a room full of DevOps folks, but we still see people doing manual things, and it's quite possible now to program the infrastructure, program the deployments.

I love talking about this because you go back to the compliance folks and the security folks, and then you have the development and operations folks sitting together. Deriving engineering from regulations is one of the hardest things we run into. It creates probably the most friction.

I love the example of the Security Rule for HIPAA. If you haven't looked at the Code of Federal Regulations, which I can't imagine that you would have, it's the government regulations. It's pretty exciting stuff.

The Security Rule for HIPAA is located at 45 CFR Part 160 and 164, Subparts A and C specifically. Now, how do I get engineering from this, right? How far down, how many licks do I have to do? How far down do I have to go before I can build to the Security Rule, specifically?

So you go to Title 45 is public welfare. Title A is health and human services. Subchapter C, administrative data standards. So up top is kind of where the senators and representatives play. Maybe down here, we start to get maybe a privacy officer or maybe a lawyer.

General administrative requirements. There's some of the security rules in there. Security and privacy. Okay, now we're at the security team. The small font's kind of for effect.

But security standards for the protection of ePHI. We still don't have anything we know how to engineer to. Administrative safeguards, physical safeguards. Aha, technical safeguards, way down there in Part 164.312. Still nothing to really build to. Ah, audit controls. So there we are. We've got 164.312(b), standards for audit controls.

Still not really specific if I'm building a code pipeline or I'm deploying an application that's going to process, store, or transmit protected health information, or PCI data, or banking data with personal information. Could be clinical trials, could be anything. But in this case, we'll talk about PHI.

What does this say? Implement hardware, software, mechanisms that record and examine activity.

So when you're actually audited, it would be the Office for Civil Rights. They have an audit protocol. This is actually the most specific thing that there is in terms of guidance. It's the audit protocol.

You have to determine activities that will be tracked or audited. If you're a developer building a system that processes PHI, you're probably not reading the audit protocol, right? This is actually somewhat specific. So it actually calls out some things, or the things that you have to do.

You have to determine the activities, you have to do documentation, you have to determine whether audit controls have been implemented, you've got to select tools, you have to implement those audit capabilities, and you have to review and capture the appropriate audit information. So those are pretty good requirements. We can build and engineer to that.

This is the conversation that we expect to be happening between the compliance and regulatory and audit folks and the security team, and maybe the DevOps team if you're deploying in a regulated environment.

With regard to AWS, and I'll be specific here, certainly this doesn't cover the application stack, but there's areas within the use of AWS's services where if you took an API call and you copied a snapshot of an Elastic Block Store, or you deleted a bucket or something in S3, our simple store, our object store, that could have an impact on PHI.

You could lose PHI, you could disclose PHI, or inappropriately use PHI. And with CloudTrail, it's our audit service that tracks all the API calls. Every platform has some way of auditing the use of different APIs. But in this case, these are the activities and the core services that could be used to move around PHI that you want to be auditing.

So how do you prove to an auditor that you're actually doing this, that you're actually recording the activity, that you're actually doing what you say you're going to do?

We have CloudTrail. This is the output of a command line, but you could just as easily do this with an SDK as part of a code pipeline of some kind and capture the output of this and check it into a code repository or an audit repository or a GRC tool or something. But this is actually the proof. This is the prove it.

Whatever platform you're using, whether it's AWS or not, there's something automated that you can do and capture with actual granular detail. The very bottom of that has a log group that defines exactly what log group all these logs go into. Pretty simple. It's good audit documentation.

You might also have an architecture that looks something like this if you're on our platform or any logging ingest platform. A lot of customers do this. They have agents on hosts, they have CloudTrail, they push that to a logging service called CloudWatch. They push it through Kinesis, and they put it in the ELK stack, Kibana, or they put it in Elasticsearch that's managed by us or something.

So how do you document all this if this is your tool chain? In an old world, a lot of this ran on servers. Now a lot of it's just public APIs that you can call for Kinesis and CloudWatch Logs. There's no infrastructure for you. How do you document it?

I work with a lot of our healthcare customers. And a lot of this is describing the log groups. All of this is something you can do through a command line. So you compare the key activity that you're auditing with the way you technically implement that through code, and we encourage everyone to do that.

Go back, talk to your compliance and audit folks. If you're in compliance, talk to your developers about how you're implementing these things in a cloud platform or in a continuous integration pipeline or something. But this documentation, the stuff in Courier font, is the artifact. It's the way that DevOps meets audit, and it's a thing you can capture and prove to an auditor that it's being done.

Same thing. We have log subscriptions, and you can put those out with all the details of the log subscription. And finally, what this means is that it's all being pushed into ELK or Kibana or Splunk or anything like that, and if you have the right searches and you have the right dashboards built, you can actually put the audit control and the statement from the HIPAA Security Rule into a dashboard along with the evidence.

To me, that's the modern version of a manila file folder. It is the system that's doing the auditing, and it also aligns the control along with the evidence. And next to this, you could have a link to GitHub that has all the configuration artifacts that actually do this end-to-end.

This is just one control, and there are hundreds and hundreds of controls in most organizations. If you're shipping code in a DevOps way and you're running your environment that way, it's pretty straightforward to have this be the new version of the audit artifact and cross that last mile and prove it.

Similar controls, and this is very granular just to auditing, but there's access control and crypto and everything else. You want to track and monitor. In PCI, you've got 10.1 through 10.8, NIST 800-53, AU-1 through AU-16, ISO 27001, A.10.10. They all have audit requirements. And being able to prove it and align to those control frameworks and pass those audits, you have to basically have automation around that.

Additional resources. I'm definitely going to call out the DevOps Audit Defense Toolkit. I think it's a beautifully documented control for a code pipeline. It talks about Jenkins. It's super granular. It's one of the best examples I've seen of how people ship code these days, enforcing separation of duties, enforcing an audit trail, and it's a well-documented control for that piece of your architecture that would have to pass an audit.

We have a couple presentations, one from Jason at Netflix talking about how they do their audits. They built some bespoke tools to do it. We have secure by design templates for AWS, and then the three books that I've referenced throughout the talk: Fermat's Enigma, Longitude, and Galileo's Daughter, two pop science books, but I think they're great engineering stories as well.

Thank you for your time. I'm Bill Shinn, or Packet 791. Thanks a lot.