Digital Framework for an Agile Cloud Governance Program

Log in to watch

Las Vegas 2019

Digital Framework for an Agile Cloud Governance Program

Brian will cover a specific requirements and a solution that USAA utilized to solve the challenges in a vendor neutral and technology sanitized way.

Chapters

Full transcript

The complete talk, organized by section.

Brian McCarty

Hi, my name is Brian McCarty. I'm with USAA. I'm going to talk today about what USAA has done for a cloud governance program. Basically, going to focus a lot about how we try to digitize as much as possible to remain agile in a highly regulated environment. So before I really get started, though, I've done this talk a couple times. This, by far, is the largest group. Normally, there's only a small number of people that are really interested in FFIEC regulations and how to do DevOps there. So just maybe a quick show of hands, how many people actually work for a financial services organization? So, okay. So I guess that explains it then. So, okay. So, this is a significant amount of work when this is something that's actually operationalized. So we won't be able to cover everything in the small amount of time we have here. However, I will be at the Speakers' Corner.

You can also just find me on LinkedIn and just send me a note. I love talking about this stuff. So, we'll get into that a little bit in a minute. So, maybe just another quick show of hands. Who actually knows what USAA is? Okay, even better. Members, hold your hand up if you're still a member. Okay, great. So if you were in the armed forces, we appreciate your commitment to USAA and to the US. So, if you weren't know, even if you are a member, USAA was founded as an innovation group back in 1923 by 22 army officers. Basically, what happened was, at the time, army officers, or actually anyone in the military, was deemed to be a high risk to be able to insure them for automobile insurance. So basically, USAA was founded on an innovative idea of like, well, if we can't find someone else to do it, we'll just do it ourselves.

So the 22 army officers actually created an organization that's referred to as Reciprocal Insurance Exchange, if you're paying attention. But so, it's a great story, and we actually use it all the time, even internally, about any new innovations that we're working on. So 100 years later, though, we do provide a full range of insurance, banking, and life products. So, I am going to talk a little bit about a background just to be clear about where we came from, the reason that we had to put so much emphasis on cloud governance. So just give me a second, and it will come together. So, USAA recently actually just broke into Fortune 100, actually in the 100th spot.

That by itself isn't as relevant as the next point, which is we're now in the top 30 largest banks in the United States, which means there's a level of rigor that goes along with being a large financial service institution that we weren't previously subject to. But more on that in a second. But basically, USAA has always been considered a great place to work. I've worked there for the bulk of my career, and I love every minute of it. So yeah, specifically though, since we're not a corporate structure, all profits return to members or reinvest in the company. But since we're owned by the membership, and that membership has bravely fought and died for the United States, for America, what we do is we have an extreme focus on understanding, being ethical and having compliant operations, right? It's part of our DNA, it's part of our culture.

And so, when we realized that we needed to increase the level of rigor and accountability for delivering technology solutions as a large banking institution, we had to retune ourselves and reassess what is it we weren't doing or should be doing better to meet those needs. So, part of those regulations, though, were mainly around the financial crisis of 2008. But even if you know some of the FFIEC regulations go back decades, but some of the increased scrutiny did come out of the 2008 too big to fail factors, which is where that kind of 30 bank threshold came from, right? But since we've always strived for ethics, accountability and compliance, we had to develop some new muscles, and so that's what we're going to talk about today. Also, to put this mildly, our membership is unique, right?

Being 100% owned by members of the armed forces, we're actually a high-profile target for state-sponsored terrorism, state-sponsored cyberattacks, that kind of thing. I don't actually getting into some of the statistics there, but safe to say you'd be very shocked at the level of cyber threats that do face us as a financial institution. There is at least one statistic on there, 13 plus million cyberattacks. That was a 2018 statistic. So, since cloud and security is cloud, that goes without saying. So, we start every meeting reviewing our mission and our USAA standard to make sure that whatever we're about to do or talk about or decisions we're about to make live up to the mission and live up to the standard. So recently, though, we did do a review of this and understand that now we added an additional attribute to the USAA standard of being compliant and managing risk.

Well, cloud governance is being compliant and managing risk. So, and we're going to show how that's directly related to the USAA standard. So just a quick thing about me. So I work with the Chief Technology Office. I'm a principal architect there, but I'm responsible for cloud governance, architecture, technology business management, and agile. I'm probably going to end up referencing technology business management a little bit later. Does anybody here know what technology business management is? You raise your hand. Does anybody know? Okay. Just one? Two, maybe? Okay. Interesting enough, I'm actually back here next week for the Technology Business Management Conference. Two weeks of Vegas in a row. Not sure what that's going to do to me, but you get to see me only on the second day. So there's a few disclaimers, though.

As we're talking about this, just wanted to be aware, I did take some screenshots, not a live demo of some things we built. This is a non-vendor specific presentation, as was disclosed. If you've got a quick eye, you probably recognize what platform is used to develop some of these, the digitized workflow for governance. I can talk more about that in more of a one-on-one setting. I just can't do it up on the stage. But it's not really relevant to the rest of the conversation. And also, I may say verbally a few statistics, but they're not up on print for reasons. That's why they'll be redacted. And just to be clear, most of what we're referring to for the FFIEC, specifically around the OCC, Office of the Comptroller of the Currency, is a US regulatory authority. But some of these things still apply if you're from the EU.

I've given a talk in Dublin one time for the Open Group, and they knew, they immediately latched onto it as well because they basically have the same kind of requirements. And also, I got a little bit of jet lag, so work with me there. So what are the motivational forces for being able to do cloud governance? So basically, there's what I call internal factors, which is really business growth, both organic, our membership's just growing, as well as marketing. You may have seen commercials, probably. Hopefully, you liked them. They're not quite as funny as some of the competition, but they're good. So we're doing a lot more innovation around eliminating technology debt at USAA, as well as building some new solutions.

There's been a few in the marketplace, including an announcement we made with Google a little while ago about using automated machine learning to immediately assess auto damage just by taking a picture. Does anybody see that in the news? It's pretty cool. So basically, just snap a photo of your car and we can tell you instantly what's it going to cost, what's the estimate to replace it, to fix it. So that kind of thing. As well as technology business management. It's basically like we have a renewed focus on trying to understand all of our technology investments. We have about a $3 billion technology spend per year. And so we're going through an exercise now of really trying to rationalize all those expenses and really understand what's driving costs so that we can get good return on investment.

If anybody noticed this, how many people think that their public cloud expense may be running away for them? Does anybody have that concern? So what we did, and I'm going to talk about this some more, but is basically saying we ingrained our use of infrastructure and platform service solutions directly in our technology business management strategy. That's part of the governance. So we're going to try to prevent the runaway spend that some other companies that adopted public cloud more early are kind of exposed to now. We kind of learn from others and then have embedded some things immediately in our workflow here. And of course, external factors. Almost all suppliers are shifting to cloud in terms of the elimination of commercial off-the-shelf software, transition that to software as a service. The technology cycles are far greater now.

We're cycling through technologies that may be in place for 20, 30 years. They may be only in place for months at this point to solve some particular problems. So we've got to go faster. And of course, the regulation scrutiny. So all of that basically means that we need to be able to digitize all of our work. We need to automate the heck out of it. And we need to do risk management, but we need to do that in a reasonable way, meaning not all use cases have the same level of risk. So therefore, they don't need the same level of rigor around compliance. So a lot of what we did was try to understand, take a risk-based approach to governance. If you're not doing something risky, you try to identify if it's going to be risky or not early on. Shift left so that you don't put a burden to the right on the pipeline unnecessarily. So there's a couple tricks to that. We'll talk about that.

Of course, trying to maintain agility. So just as a reminder, if you're not familiar, there is several regulations in the space, but basically it all kind of boils down to the FFIEC saying, and I think rightly so. I'm not actually dragging any regulatory bodies through the coals here. It actually makes sense that a large financial institution ought to have a really good third-party risk management strategy. No company in the world survives without third parties. So all public cloud, in some shape or form, is third party. Either you're buying something from a public cloud provider, or you're about to deploy something you built to a public cloud provider. But it's impossible to do cloud without understanding what your third-party risk is. So at the end of the day, really what these say is you need to be doing good third-party risk management. And so that translates to public cloud.

I actually saw somebody take a picture. I usually stop to say that. It's like, this is really easy to Google. You think it would be real easy to Google? But actually, it's hard to get down to the specific line item that says what is it you're supposed to be doing or not doing. The handbooks are this thick. Literally, I've seen somebody print them out. They're that thick on one of my business partner's desks. So anyway, so that's it. So what we did was, is that first, we established a program. A whole program around cloud governance. And it starts at the very top, the board of directors level.

So the board of directors decided to-- Of course, we proposed something to them, but they basically said, "We're going to put in place a cloud computing policy, where it's tied to mission standard corporate strategy," and then left it up to people like myself, in technical leadership roles, to define what that cloud strategy ought to be. So the reason I start there and I say, "Well, Brian, we're not really talking about standards and strategies. We're talking about governance." The point is, is that you can't do governance if you don't understand what it is you're supposed to be delivering on. So you have to have some sort of strategy in place first. That is harder than it is to do governance. If you think your organization has a really good strategy in place, I'd love to talk about that later.

It is very difficult to outline a crisp and clear strategy, both a technology strategy and, of course, also a cloud strategy. So maybe one day I'll give another talk about going through that. But basically, cloud governance is saying we need to align the ability to manage risk and measure risk versus the reward we're going to get on what the strategy is. So if you don't know what the strategy is, you can't really do governance. So let's move on to governance itself. At USAA, we actually refer to basically kind of six pillars of privacy, data risk, architecture, audit, and security. You may have something different than that, but basically, governance should encompass at least those dimensions, if not others. Okay? So that's kind of clear. We have this really cute name called Control Partners for that. I don't know if that's an industry term, that's just what we say.

And then there's one layer down, which is referred to as cloud management. So if you Google cloud governance versus cloud management, you find out that governance is about establishing standards and a way of behaving to be able to manage and measure risk. Cloud management is about actually operationalizing. The pipelines to get code into production. We refer to that internally at USAA, we have a product we refer to as Safe Landing, meaning if we're going to manage the cloud, we need to be able to land workloads into the public cloud in a safe and secure way. And so that's really at the heart of it. It's a DevSecOps discipline. It just so happens that its ecosystem is specific around public cloud. Does that make sense? Okay, so that's where the DevSecOps comes in.

I am going to talk about software as a service just a little bit because I did see there's so many people with hands up that work for a financial service institution, but I'm sure most people are a little more interested, being a DevOps conference, around software that's being developed and then, of course, deployed. But we'll go through that. So the plan of attack. All right, so we had an outline of what it was we were going to do, but we realized we needed some really key things to make sure we delivered on before we could really establish a good program itself. So we established cloud computing policy approved by the board. We identified applicable control partners, and we solicit executive support. If you don't have that first two layers of having board of directors level approval, and senior vice president type commitments, you might as well quit. You might as well give up.

Because you can't do this unless you have the full commitment at both horizontal across reporting structures. Commitments at a vertical level, through the vertical channels of your organization, because governance is a crosscutting or horizontal concern. So if everyone isn't bought on board in the different vertical silos in a large organization, you'll never be able to get that crosscutting horizontal view that's so important. Does that make sense? Okay, so we also were able to secure funding to basically build up a small team to ensure that even though we're digitizing and we're trying to automate, you still need a good small team to continuously look at the consistency of it. Is it doing what it was supposed to do? Is the data hygiene, that data that we're capturing to report to regulators, good and clean? So that does still require some human capital investment.

Even if you do focus on digitization and automation. And then, so we came up with a risk-based approval process that took some iterations. We're on version three now, that has some key performance indicators. And then, of course, we digitize as much data collection as possible and automate. All right. So I'm just going to hit the high level real quick. So basically, the risk-based approach is the corporate leadership supports our cloud enablement group, which is a body of executive levels that has the authority to actually make executive level decision. They meet very regularly. All exception cases, all outlier cases that we haven't thought about before, get reviewed there. It's actually a very lively discussion. We couldn't get people to show up at a meeting two years ago. Now it's a packed room, people bringing questions and concerns. So it's really cool.

That's a measure of success when people actually show up and volunteer information, right? So then there's what we refer to as the cloud review panel. It's basically the worker bees. So it's the people, they're the SMEs in the different spaces, actually providing support to do the governance itself and say, "Well, what is that governance?" It's what we just referred to as cloud review process. You can come up with your own cute name. We like CRP because it sounds a little funny when you say it. Do you see the thing? Some people got it. You can see so. All right. So basically, we review use cases and then we approve them. And you say, "Well, Brian, what happens if you-- Do you approve everything?" No, we don't approve everything. We deny things. Okay, well, what happens if you deny things? Well, we're grown up, so what do we usually do?

We have a conversation about it and figure out what is it that kept someone-- What is it that's causing concern or alarm, right? What is the risk that we feel is unworthy of the reward that we're about to receive? And so it is possible that you could escalate up to the corporate leadership. Oh. But has anyone ever done that before? No. Problems get solved before you have to escalate to that level, right? Theoretically, I could take something to the board of directors. I don't plan on ever having to do that, right? That means we failed as adults to have conversations, right? So, and then basically, the review process doesn't actually end there. We trace everything all the way to the time it retires. And so that's actually an important topic. You can't let things just dwindle or just survive in perpetuity. They should be reviewed on a regular basis to ensure they're still valid.

Okay, so let's just go through the flow real quick. I'm just going to touch on SaaS real fast because it's really pretty straightforward. It's what I just said before, except once it's approved, then it goes to contracting. You're not allowed to sign a contract in any shape, fashion, or form for a public cloud, for a technology provider, third-party technology provider, without first having an approval from cloud review panel. That's a really key point, okay? This is what, at the corporate level, prevents business areas from purchasing technology solutions of their own without understanding the risk that they're exposing the organization to. This is why you need board of directors level approval. So that-- Well, there you go. I'll leave it there. And so that moves into adoption, right? All right, let's talk about IaaS a little bit more because it's slightly more complicated.

So with that review, what we do is we try to analyze what's the risk and what is the outcome that we're trying to achieve for some new effort. The reason that we do this upfront and we get approval basically in step two down here-- I'll just tell you because for the sake of time, I don't want to do too many trivia questions. The reason is is because we're trying to protect the developer community. Do you really want the developers to have to worry about the level of complexity of FFIEC regulations and am I supposed to be doing this? Am I not allowed? We don't want to do that, right? By the time it gets to them, they can do their development. They can put things in the pipeline. The work has already been done by people that wear jackets and pocket squares, right?

It's done before they have to work, so we can try to protect them from context switching or stopping and starting an effort because someone's put them on hold. So that's the point, right? Is we try to review what's really shown in step two right here. Then development happens, right? Work with operations, work with security, make sure the solution is going to our safe landing environment. Meeting some criteria that was set out ahead of time. Hits the pipeline, gets deployed. I can't mention the name of our preferred public cloud provider, but it's one of the large ones everybody uses. And then the workload lands. So now what? All right. This is what we're working on next. Is that we're trying to get it such that all-- Because we're doing everything from source, right? Basically use Terraform.

So everything comes from source, that the appropriate tagging is in place ahead of time so that we can actually verify in production on a regular basis to prove to external authorities and internal auditors that we consciously made a decision to prove this use case to being pushed to the public cloud. And so that's not actually very easy, right? Because use cases ebb and flow, they flux. They add scope, they reduce scope. So we're trying to figure out what can we detect in the workload itself to know that it's still within the guardrails that were set in place early on. A few of those is because we tag some very important things, environment ID, the application ID, and the cost center, but we're also looking at other options.

If any of you all are working on how to do that, right, basically more like a day two type scenario, to prove what workloads are actually being done in the public cloud, I'd love to talk about that. Oh, let's just hit this real quick. I know you may not see down here, so I'm just going to read it. It says there's one and only one enterprise-level agreement allowed with IaaS and PaaS providers. So I'm not going to mention vendor names, but let's just say GCP, right? Google Cloud Platform, Azure, AWS, any of those big ones. Let's just say you use AWS. How many separate agreements do each of your lines of business have with those public cloud providers? Does anybody want to-- One? Two? What? Just one? Pretty good? Okay. Up to five. Five? Yeah. That we know of. That you know of. You have to put a stop to that.

That's part of this thing is, right, is that a master service agreement needs to be in place with all the terms and conditions negotiated with the full force and effort of whatever it is that your organization is capable of doing from your contracting department with those cloud providers. And then you need to monitor, through expense management, that people are not going outside of that master agreement. Okay? If you're financial services, you can do what the heck you want to. I'm just saying this is what it takes. Because if you're not in the account taxonomy under that master service agreement, you're going to have a bad time. Right? Well, anyway. So have one and only one. A few things on safe landing controls.

The reason that we have only one account taxonomy that we use and one master agreement is because all of the controls for the public cloud provider on landing workloads are standardized at this point. They're referred to as, they're like guardrails, right? So, the big circle, the big wheel of the ones that we care about. We're going on, on something in the order of over 400 separate controls are in place that basically fall into these spaces. I just highlighted a few of the key ones that you need to do up front, but as you get more mature, there'll be additional ones. There's some industry work going on right now around trying to establish what those controls are because something like an AWS or GCP or Azure, they expect a shared responsibility agreement, right? You have shared accountability and responsibility of using those providers.

So we had to stand out and figure out what is it that we should be responsible that the vendor is not doing for us. And so all those controls fall in that space. So if you deploy in our pipeline, to our public cloud provider, you're going to automatically receive those 400 and something controls, right? Because we've governed and we standardized the method for pushing software to that public cloud provider. Just hang on. Let me just hit one more thing and I'll take some questions real quick. So there's a cloud service catalog. Basically as new initiatives come in, I just wanted to hit some screenshots. You could see that we're trying to digitize as much information as possible. No governance is ever done via email, walk-ups, spreadsheets, anything like that. If it's not actually in the system, it doesn't exist. Okay? I even do that for my...

I'm the technical architect for dozens of cloud use cases. I just don't tell my tech lead over Slack or email, "Oh, it's fine." What I do is, is I actually go to the system and record that I made a decision here about something, right? So it's recorded in perpetuity, basically till the end of time, is the point. So a couple screenshots. I redacted a few things. The main thing I just wanted to point out here was notice the life cycle, draft submitted, pending updates canceled, validated, scheduled, follow-up withheld, reviewed. So there's no argument about where we are in the governance life cycle. That's a really key point. Also, we do this, back to the risk-based approach.

I saw some people that seemed interested in this is, based on the phase of the effort, trial, pilot, production, we increase the level of risk, the level of accountability, the level of controls because the level of risk has increased. So this is a real key point. We try to really understand, are you in a trial, pilot, or production phase so that we don't overtax the engineers, the people that are working on the solution for something that is not yet applicable. This takes practice. This is not something you can do overnight. You have to practice this. Trial basically means you're doing something with no risk to USAA, meaning it's usually using a contrived or industry data set of some kind just to test a capability. Right? There's no risk in data loss or something like that at all. So that's basically what trial means. And then we repeat the review process on the phase changes.

So it's fully auditable history. Current status is never ambiguous. I'm just going to put this up for a second because then, we're going to switch to some questions real quick. So basically, these are the digitized sources of truth that we have so far. Every single one of these is actually pulled together as a digital source of truth in our what we refer to as cloud service catalog. But basically, the inventory of all the governance use cases. We're constantly trying to find better ways of eliminating humans from having to type, copy, paste. I don't know if you realize this. Humans are actually kind of, they're difficult, right? To deal with. So, it's better to work with trusted sources of information than try to rely on someone's intuition or their memory. So, we relate these sources of truth back to the cloud catalog. And so all these sources are governed.

This took a lot of work, okay? I don't want to act like this is just all unicorns and rainbows. This is a lot of work to do this. But it's worth it in the end because now we're driving through something like 400 use cases a year. That's something like 300% growth year over year, right? If we didn't invest in this back when we did, we would not be able to sustain the level of throughput that we're doing right now. That's the bottom line. So with that, though, I'll take a couple of questions since it's two minutes, 20 seconds worth of questions, I guess.

Q&A

Audience: How long was your journey from the start?

Brian McCarty: We started the cloud governance program in 2009, but what you are looking at right here, this is version three. Version three took about two years. That entire process where we got everyone on board and we knew all the data sources took about three years.

Audience: Do you send your CI/CD pipeline changes and plans through that same process?

Brian McCarty: What we are trying to do is keep, and that is why there is a handoff here, we are trying not to expose the pipeline, like the people with hands on the keyboard trying to crank out the software, to having to worry about what is going on around this from a governance perspective. So the answer is no, but that is by design. We do not want them to have to worry about it. All I really need to be able to do is verify when workloads deploy, if someone comes and tries to audit, that those people built the software, they are delivering a solution that we said they ought to be able to do, and we have fully evaluated it. That is where the tagging comes in. The only thing I really need them to worry about is making sure they are following the tagging standard, but they should be doing that anyway.

Audience: What is your ratio of on-prem and cloud?

Brian McCarty: We have about 3,100 applications in our application portfolio managed inventory. About 30% of them are public cloud or hybrid. We are actually pretty risk-averse in terms of adopting public cloud, but I see the number probably doubling within the next couple of years.

Audience: Is your database and everything in cloud, or is everything on-prem?

Brian McCarty: That is where the hybrid part comes in. There is work going on to build systems of engagement in public cloud but still have some reference data, mainframe-based data, and things like that exposed through APIs. The API would be exposed on-prem, but the application would be running in public cloud. There is this nasty thing called the speed-of-light problem that gets in the way of some of that, so we are still working through that. We have not figured out how to violate the laws of physics quite yet, but we are working on it.

Audience: You are using the CMDB as your source of truth? How did you figure out what upstream assets, repos, pipelines, dependencies, et cetera, belong to one of the apps listed in the CMDB?

Brian McCarty: The application ID, that tag right there, is the business application CI class in the CMDB. That application tag means anything deployed with that Terraform deployment, anything found with whatever was configured on deployment, any services spun up, are mapped back to that business application, the one that has its lifecycle managed. That is the short answer.

Audience: You track dependencies and source back to the app ID?

Brian McCarty: We are trying to. There are some technical limitations to still do that, but some of the large providers of CMDBs are working on discovery in the public cloud and that kind of thing, so that we can capture what it is supposed to be on the deployment, but then also discover it in the public cloud provider and try to match the two up. We deploy something and then verify by discovering it and bringing it back into the CMDB. There is definitely room for improvement on the technologies available to do that kind of thing, though.

Audience: How about if you have multiple assets under one application? One large application system?

Brian McCarty: It is not really a reasonable statistic because it could be a small app, or it could be replacing the entire claims system. So they are all apples and oranges. Does that make sense? Anything else? Because we are out of time. Great. I appreciate it. Thanks a lot.