Compliance and Audit Readiness: The DevOps Killer?
Ann Marie Fred has worked at IBM as a Software Engineer since 1998, and a manager since 2015. She has a Bachelor of Science in Computer Science degree from Duke University, and a Master's of Computer Science degree from the University of North Carolina at Chapel Hill.
She worked on the first DevOps-focused team at IBM in 2011, and currently works in the IBM Marketplace organization, where development squads deploy dozens of changes per day to production, monitor their own components, and support them.
Chapters
Full transcript
The complete talk, organized by section.
Ann Marie Fred
I'm Ann Marie Fred. I'm a senior engineering manager at IBM.
Just by a show of hands, think about the software development teams that you work with on a regular basis. How many would you say are deploying software at least once per year?
All right. How about once a month? Once a week? All right, some came down. Once per day?
All right. So now imagine, and maybe it's not so hard to imagine, that you've built a compliance and audit readiness culture and processes around deploying maybe a few times per year, and suddenly you're deploying several times per day. What kind of pain would you experience? And that's what my talk is about today.
I'm going to talk about myself and the group that I work in, a bit of history, and our challenges around compliance and audit readiness. Then I'll talk about why DevOps and continuous delivery can make these a little bit more difficult, and then go into several aspects of compliance. There's a pattern to what we learned, and finally, how you can help me.
There's my family. I have a husband and two daughters. We love to travel. My husband and I love to go scuba diving. I'm an avid reader. I have a Master of Science in Computer Science from the University of North Carolina, and I worked for 17 years as a software engineer at IBM, and then the last three as a manager.
It's very important to note I am not an attorney, I am not a compliance expert, I am not a consultant. You get what you pay for. These are my personal memories and opinions, and what I hope you will do is take them back, anything that you find interesting, and run them by your own people and see if they're interested in trying it.
In the Digital Business Group, we have about 350,000 employees in IBM, and we have several hundred in the Digital Business Group. What we do in our group is we manage IBM's digital presence worldwide, and that includes things like websites, pricing information, checkout, provisioning software-as-a-service offerings when you order them, search engine optimization, analytics, developer outreach programs like developerWorks, and even conferences and events. So as you can see, we do a great deal of customer-facing work, but we don't sell any products ourselves.
If you look at my reporting chain, you'll see IBM and then the Digital Business Group. There are about 75 squads in DBG. If you're not familiar with the squad, it's basically an autonomous team with a clear mission, and they have everybody they need on that team to deliver on that mission. They have their own business owners, designers, developers, project managers, and so on. Myself, I am a manager for four squads. I won't name them here, but basically, between these four squads, we're responsible for roughly 150,000 of the webpages on ibm.com.
Everybody has to care about compliance and audit readiness, but in large enterprises, we have to do it at scale, right? We have thousands of applications and services in IBM, and for each one of them, we have to ensure compliance. We have to worry about all these things, right? And it can get a little bit overwhelming. I can't give you a how-to guide for all this in half an hour, but I can tell you about the things that are special for DevOps and continuous delivery.
Some things that make this interesting. First of all, we have very frequent deployments. So any process that relies on doing something before you release is not going to work very well with continuous delivery.
We also have very short-lived services. So anything that's very heavyweight or cares about the specific IP address or location of a service or anything like that, it will need to change.
Third, we have very few technical gates to production. Anybody can go get a free trial on some public cloud and deploy an application out there without asking you.
Finally, we've blurred the lines between developers and operations. So those responsibilities that your operations team used to have, if you don't have a separate operations team doing the deployments, you have to teach the developers how to take those on.
Think about something with me. If your CEO asked you today, "What applications and services are we running right now?" would you be able to answer that question? How long would it take you to answer the question? And how accurate would it be? How many things are deployed out there that you don't even know about necessarily?
What you need is an application and service registration system, and we call ours the Enterprise Application Library. It includes information like the system application name, the business and engineering owners, and other basic information.
If I could change one thing about our library, it would be this. We had this library, and people saw this as a perfect opportunity to make sure that systems were compliant at their first release. So they said you have to register before you can release, and you had to be compliant before you can register.
The problem is that meant for some applications, like those that process personal data, it was taking six to eight weeks to register the application. And I think this is a mistake. You should make it very easy to register an application and then follow up on the compliance immediately after that.
Because what happens is developers are like water, and if you make something difficult, they will find a way around it, right? So people were very resistant to registering the applications. Make it easy, and then make it very clear what your guidelines are for which applications need to be registered with your registration system.
Finally, you want to have very clear people who are personally responsible for making sure that those applications are there. And for us, that's the business owners and our HR managers.
One of the very first questions on that registration form is, what is your business continuity value? For this, we ask people to assess how critical the application or service is to IBM's business. We ask them to think about things like, if your service is down, will we lose money, cause any irreparable harm? Will we break our contracts or service-level agreements? Will we even harm our company's reputation or anger our customers?
Depending on your answers to these questions, you'll get a score, one to four, of what your business continuity value is. If you have a high BCV score, that obviates the need for more caution in how you deploy your service. So for those applications, we need to have off-site data backups, a disaster recovery plan that you've actually practiced and tested, an IT support workforce continuity plan, and so on.
Now, web and application security is obviously critical. You don't ever want to be in a situation where at any time your systems are exposed to hacking. But again, with frequent deployments, we can't rely on manual processes to enforce this.
Fortunately, there's a whole field of study in this now. It's called DevSecOps. There's a conference on it here in two days. You should go.
But I did want to mention one thing that was particular to compliance, which is this GDPR secure-by-design requirement. Your applications need to be secure by design. And what does that mean? Well, it's an evolving area, and I think that we're growing in our understanding of what that means to us as well. But here are a few things that we're doing to make our applications secure by design.
The first is education. Everybody in the company gets annual IT security education, and we're actually tracking that they complete that. So we have just a general level of familiarity with IT security across the company.
Secondly, as a best practice, we want to have a security focal in each squad, somebody who's had a moderate amount of security training so they know what some of the common attacks are. What is a man-in-the-middle attack? What are the things that I can do wrong on my web forms that'll make me vulnerable to hacking?
Third, we have experts who are our security architects, and they work across several squads, and that's their life's work, IT security. And I'll show you in a second here how they get involved.
A second thing is that we do have a pretty extensive IT security standards checklist for your first deployment into production. This is a set of a couple dozen questions, and what we're doing is asking you to fill this out in order to educate the security architect on how your application works and how it's secured. Then you send that to the architect, and they review it with you, and together you come up with a set of remediation steps that you need to take, and you get those completed, and then everybody signs off that it's good at the first deployment.
Then we have to maintain our security on an ongoing basis. One thing that we use that I don't know if it's unique is in our story templates, we're starting to ask people, "Have you considered the security implications of this change before you make it?" So this is early on in the planning process, before you write a line of code. Just asking people to spend 10 seconds thinking that through.
Secondly, in our code reviews, we've trained our developers to think, "Oh, does this have any security implications? This code change that I'm about to put in, has somebody done something silly like check a private key into our source code repository?" That never happens.
So this is two good ways that we maintain this on an ongoing basis.
Another kind of fun thing that we do is periodic external penetration testing. We bring in outside consultants who try to hack into our systems on a regular basis, and they write up bug reports for any vulnerabilities that they find or even potential vulnerabilities. And then they don't just throw it over the wall. The nice thing is they stay with us and help us fix them.
Finally, and perhaps most importantly, is security automation tools. We have two classes of automation tools that are used pretty broadly.
One is static code analysis tools, and these are able to actually process any of a number of different programming languages and find common vulnerabilities or mistakes that people make in their code. These can run in every build, and they can fail your build and prevent a deployment that would make your application less secure.
We also have web crawlers, like IBM AppScan Web, that run against our production servers on a frequent basis, maybe daily. And they are checking for common exploits and hacks. Again, they can create a report and automatically open a defect against us, and we can fix that very quickly.
Just a definition: access control is the selective restriction of access, whereas permission to access a resource is called authorization.
In a DevOps world, some of the access controls that we see frequently are API keys, IDs, and passwords. Just a couple of rules of thumb that we find useful: we want to use individual credentials for any manual action, so that people are individually--you can trace who made a change, and this is great for audit or fraud detection. You never want to share a password, even amongst a team, because you lose that auditability and traceability.
In cases where it makes sense for something to last a long time, if people come and go, functional IDs are a really good answer there. We also have API keys that can be either long-lived or short-lived, depending on the account they come from. We actually have our managers set up the functional IDs and own those, and then they will just encrypt the secrets before they put them into our deployment pipelines.
A couple of things that are special about GDPR is that it actually requires you to have solid access controls. It requires you to limit who has administrative access to your applications. You have to have a plan for what you're going to do in case of a fraud or a security breach, and how you can respond quickly, and you need to be able to revoke access quickly when it's no longer needed. So you should have a documented process for every time somebody leaves the company: how do you revoke their access?
A global privacy assessment is one of the few things that we do require people to do before delivering to production. And what this is, is just another questionnaire about what personal data you collect. How do you process it, how do you store it, who uses it and why? Which is very important with GDPR. What's the purpose of the processing, the access controls you've put into place, what countries are involved in storing or processing the data, and so on.
The answers to this questionnaire are then reviewed by our legal and privacy experts in various countries to make sure that we're not breaking any local laws. The output of this, just like our IT security testing, is a series of actions that are required to comply with the laws, including GDPR.
Oh, I meant to mention here too, there is a way to make this easier. We do have two fast paths through the global privacy assessment. One is for applications that don't store or process any personal data at all, and another one is for applications that only process a very limited type of personal data, which I've heard called pseudonymized data, which is basically data that's not personal in nature, but it's a reference to a person.
So this is something like an IP address or maybe an internal ID number that we use to identify a person that does not equal their email address. So if that's all you have is maybe some IP addresses in your logs, there's a fast path through this assessment, and you can get through it in a day or two. This is the thing that was taking six to eight weeks, if you do process personal data.
All right, GDPR. I hear GDPR is very popular in Europe. For those of us in the United States, it was more like, "Oh, God, GDPR," because we're getting all the work and very little of the benefit.
Although, I will say we're getting more of the benefit than we thought initially because everybody's had to batten down the hatches, tighten up their security policies, and so we are benefiting from that as well.
Now, to be GDPR compliant, the good news is, if you look at what we talked about earlier, if you do all those things, it sets you up very well for GDPR compliance. You need accountable, easy-to-find owners for your applications and services, and that's your application registry. You need to have clear, documented security standards and compliance with sign-offs that those are completed. You need to have strong access controls. You need to do global privacy assessments. And you need to have audit-ready documentation in case somebody comes and claims that you're not processing their personal data or controlling it correctly.
Now, another thing that I don't have on my slides, but I did want to mention, is that you also need to ensure that the third-party services you're working with are themselves GDPR compliant. Like anybody who's a processor for us, we want to make sure they're GDPR compliant. We also want to make sure they're up to our IT security standards.
I don't know if this is the way everybody does it, but we do it through our procurement process. So we are not allowed to pay for a third-party service unless they have agreed to abide by our standards. And this can make procurement take a longer time, but for us, it's worth it. We renegotiated so many contracts because of GDPR, and I think many people did as well.
Another thing that's kind of interesting about GDPR is the data subject access requests. DevOps makes this a little bit more difficult, especially microservices. When you have many services with many small databases, you can end up with a proliferation of personal data repositories. How many of them are storing a copy of some data from the user's profile or something like that so they don't have to look it up again later?
To address this, the first step was really to identify where our personal data repositories were, and we started with our application registry and went from there. We said, "Okay, here are a series of questions that will tell you if you're a personal data repository or not, yes or no. And if you are, you're going to participate in the DSAR process when it comes out."
Well, this was a pretty powerful motivating factor for people to get rid of extra personal data repositories. A couple of the ways we did that: one is we took the profile data and we centralized it in one place, and we said, "If you can, please rewrite your application so instead of storing a copy of somebody's profile data, you look it up from the profile service every time."
And furthermore, on the profile APIs, they are asking you what is the purpose for which you are going to use this data. The profile service is connected to the consent service, so it can look up what kinds of processing each customer has consented to, and they will only send you the data that you're allowed to use for that purpose.
Many of our services did that, and then they deleted their copies of the data. We also had many services that maybe had personal data for some kind of fluffy function that we really didn't care about anymore, and so we just got rid of some features. And we even shut down entire services because they would have been too difficult to remediate. This kind of dovetailed nicely with this server consolidation project that we were in the middle of.
Again, we have a fast path for a few cases for the data subject access requests. It's for that pseudonymized kind of data, those IP addresses and those internal ID numbers. It wouldn't be very helpful if you asked a company, "What data do you have on me?" and then we said, "Well, what's your IP address?" No. Or in fact, even if we tell them what their internal ID number is, it's not very helpful to them. So it's more helpful for us to say, as a blanket statement, "If you visited our website, we've logged your IP address, and we use your internal ID number across our systems to track your sessions," and so on.
The other thing that makes that easier is that we consider those types of data that don't require consent, because we need to keep your IP address in order to keep our system secure, to prevent a denial-of-service attack, to respond to fraud or security problems.
Separation of duties is another interesting one. This is the practice of having more than one person who's required to complete a task, and its intent is to prevent fraud and error. But with DevOps, you might not have an operations team to separate your duties to, right? It's the same people.
So separation of duties is required for some sensitive applications by law and by best practice. Things like healthcare data do actually still have separation of duties. But for things like websites, web servers, which is a lot of what we run, it's overkill.
For us, it's more about the spirit of the law. How are we going to prevent fraud and errors without actually having many people involved in the deployment process?
First of all, we want to avoid breaking changes, and the most important way to do this in continuous delivery is with really good automated testing coverage.
Secondly, we have code reviews as another way that we avoid breaking changes. You have to get another committer/owner of the application to agree to the change in the first place.
Then we have accountability and traceability. If you remember, we talked about using individual credentials where it makes sense, or functional IDs where it's automated in a build pipeline, so we can see which ID was responsible for each change. We also don't allow manual changes to our production systems at all in our DevOps environment. It's not possible to log into those systems and accidentally make a change because we shut off the remote access.
Doing all this gives us traceability because we can see all the changes through our source code management system, which in our group is generally GitHub Enterprise. You have to make a change to the configuration management code in order to make a change to the production systems.
Third is instead of preventing all errors, we just want to have quick recovery from errors. And we do that in a couple different ways. Whoops. Okay, I'll just skip that.
One is good monitoring, right? So we don't want to wait until a customer reports a problem before we resolve it. We want to have monitors that are fairly sophisticated. For example, on our web servers, we don't just have ping tests making sure the host is up. We also have tests that display the page and check for certain words on the page, and we have visual regression checker tests. I don't know if you've ever seen those, but it's actually an image of the page, and it will raise an alert if the page has changed, so you can make sure that that was intentional, and so on. And of course, we have our security checks on a regular basis.
Blah, blah, blah.
All right, on to accessibility. Accessibility is the design of products, devices, services, or environments for people who experience disabilities. IBM takes accessibility very seriously. We follow all the standards like the World Wide Web Consortium and the Web Accessibility Initiative, and then we have our own accessibility standards on top of that. And we don't just require that the applications that we sell for government bids are accessible. Our internal standard's that all of our websites will be accessible, and even our internal documentation and training. So this is something that touches all of us who develop software at IBM.
Now, with DevOps, this is a little bit more difficult because again, we have frequent deployments. So in the past, the standard was that you did accessibility tests pretty late in the product development cycle, when the UI was fairly settled. But this doesn't work anymore. So we had to develop processes that made this lightweight and fast.
Fortunately, we have a website that's publicly available to everybody, which is www.ibm.com/able, and I strongly suggest that everybody go out there and take a look at that website. There's a lot of best practices for accessibility, and as you can see, there's actually a tile on that page about how to streamline your agile DevOps processes. So we have open source tooling that's available for everyone to use where you can check your source code and your web pages for accessibility.
Now, automation can't catch everything. The easiest example is probably screen readers. It's hard to tell if something's going to make sense when you read it or hear it through a screen reader, unless you've actually done it.
So we do very intensive accessibility tests the first time an application is deployed, usually actually right after it's deployed, if it's not something we're selling. If it's a website, we'll do it right after it's deployed into production, and we'll maybe spend a week where the whole team is finding any accessibility problems and fixing them.
We also have periodic manual checks, where somebody will go back and spot-check pages and see if they find any problems.
Two other things that we do similar to our security checks is we ask people to think about accessibility when they're doing code reviews. Part of your testing of a change to the user interface should be to actually bring it up with the browser plugins like Chrome DAP plugin. I forget what it stands for. But anyway, bring it up in the browser with the plugins that are showing you if the accessibility looks good. Is your contrast good? Can you tab through the page? And things like that. So that's part of our code review process ongoing.
And also the story template, again, just asking people to take a moment to think if there are any accessibility implications to the change they're about to make.
Open source is another fun one with DevOps. Modern package management tools like npm make it trivially easy to pull open source into your project, and they're very widely used in a DevOps environment. Furthermore, if you just pull in one package, you're usually going to end up automatically pulling in several other packages without even knowing that it's happening.
We actually have different processes for software that we sell and internal-use software. And most of what we're doing continuous delivery on is internal-use software, not all of it.
For software that we sell, we actually have a complete sign-off where they have to list for the release every single package and version and what its license was, and was that approved by our open source standards committee.
For internal-use software, we actually allow teams to self-certify their compliance, and then we give them the tools to make that easier for them.
The first step in an open source process--well, actually step one is to educate everybody in open source. Everybody who touches open source in any way, which includes pretty much all of our developers, but also their managers and project managers and so on, has to take annual open source training so they understand the concepts and what they need to be looking for.
Next, we have code reviews, and we tell our developers that one of the checks that they should go through is, is there a new open source package, and what is its license, and is that one of the licenses that we like to use? Because not every open source license is okay for commercial use. So we have lists of license types that are generally fine, like MIT and Apache, and ones that are yellow flags that we need to have reviewed by our open source experts before we use them, because whether they're okay or not actually depends on how they're used. And ones that are red flags, like proprietary licenses or ones that have gotten us in trouble in the past.
Then we have a database of all the packages and versions that have been pre-cleared, blacklisted, or require a review.
Finally, we have a code scanning tool. This can be automated in the builds, and it recognizes several different programming languages and environments. What it will do is it will automatically find all the packages you're using and all their license files, classify each package by its license type, tell you if it already has a license type that it knows is fine, if it recognizes that specific package version, if you need to contact legal, or what you need to do.
Another useful thing is that the tool, if you configure it this way, any time it finds a package it doesn't recognize, it will automatically request a review of that new package or version via a POST request. Within a day or two, our open source team is able to tell you thumbs up or thumbs down on the package. This way, you don't spend weeks building something on top of a new package only to find out that you need to tear it out later.
Now, I didn't talk too much about audit in particular. It was kind of woven in there. But one thing that we have for audit specifically is documentation. One thing that's great is to standardize the documentation for all the areas that we talked about as much as possible.
A very simple thing that's working for us is Box folders. Box, it's not an IBM company. It's a third-party company we use that has secure shared cloud storage. We have folders that roughly mirror the organizational structure. So for example, there's one folder for the commerce platform, which is my boss's level, and then there's a subfolder for each squad in that area. And within each squad's folder, they're responsible for gathering up the compliance documentation that they need. Individual pieces of documentation may have additional access controls if they're sensitive, but at least this gives our managers and our project managers a very easy place to go in case they have a request for audit.
Another thing that's useful for audit is to use Git for sign-offs. If you're requiring somebody to assert that they're compliant to a standard, you can actually just set up a README file in a GitHub repository, and they can make a pull request to sign their name to it. And it's a very auditable and traceable sign-off.
So there is a pattern to all these different compliance things that we've done with our DevOps systems.
First, you have to discover what you have today. First, get a sense of what's out there right now. What are your applications and services, and what's their current state?
Educate everyone. Educate, educate, educate. Share best practices with each other. We had weekly calls about compliance and GDPR, and we learned a lot from each other as we went through this journey.
You need to identify the responsible parties and hold them personally accountable for doing this.
You need to plan the remediation steps and get them planned as part of your agile process. These were regular stories for us with the known assumption that they were the most important stories for us to get completed quickly.
Give people helpful tools to make it easy for them to get their jobs done.
Report and track progress against your goals.
Remove roadblocks. If you hear people complaining about something that's too difficult, fix it.
And automate to maintain compliance.
How you can help me. We have a DSAR request process that we've put together within our own company, and we have a JSON format for these. I don't think it's standardized across companies yet, but that's probably a good starting point if anybody else is interested in working on that.
And if you want to tell me what compliance automation tools you love, that would be great.
Thank you, and please stop by the IBM booth. They do offer compliance and DevOps tools and services that we sell, and you can pick up a free copy of a book like DevOps for Dummies there.
Thank you.