Is This Thing On? Instrumenting DevOps for Architecture Health, App TCO and Compliance

Log in to watch

Las Vegas 2020

Is This Thing On? Instrumenting DevOps for Architecture Health, App TCO and Compliance

Large organizations are making progress towards agility yet struggling with the acceleration of technology investments and detecting desired outcomes.

DevOps and pipelines have driven efficiency gains, but what is the Total Cost of Ownership (TCO) for each application in a large portfolio? Are we achieving the ROI expected? Are compliance and architecture policy standards being met?

The next evolution of DevOps requires instrumentation converging with compliance, Enterprise Architecture (EA) and financials (TBM) to drive health and cost insights into the business planning near real-time.

In this session we will explore an approach used to unite DevOps, EA and TBM for cost elimination and help business partners make informed decisions.

Chapters

Full transcript

The complete talk, organized by section.

Brian McCarty

Good morning, everyone. My name is Brian McCarty. I'm with USAA. I'm here to talk a little bit about instrumenting DevOps for application total cost of ownership and compliance. Thanks for joining.

Like I said, my name is Brian McCarty. I'm a principal technical architect with USAA's Chief Technology Office. I work for the CTO. I do a lot with specializing in the practice and the business of IT. Specifically, I focus a lot around our cloud governance practice, the tools and techniques for architecture. And then I spend a lot of time on technology business management, which is really the discipline and the practice of merging technology consumption information, the usage of technology, with finance. So we try to understand things like total cost of ownership for technology, the return on investment for technology-based initiatives, as well as expense management, planning, budgeting, and forecasting. I also do a lot to support our agile tool chain as well.

USAA, if you're not familiar, is an organization that provides financial services to the United States military and their families. We start every mission, every meeting, with a look at our mission and our USAA standard to ensure that the topic, the decisions we're about to make, are in the best interest of our membership and the association. In particular, I'd like to point out number three under USAA standard. For the purposes of this presentation, I think it's most impactful that "be compliant and manage risk" is the standard that applies the most.

USAA, just a little bit about what we do: we provide a full range of insurance and banking investment products to the military community, like I mentioned. We are now a Fortune 100 organization and steadily growing. We do have a very large security and innovation practice in USAA, which does translate to a fairly large technology landscape that we're going to talk about here just a little bit more.

Just to give you an idea for the purposes of this discussion about what we had to construct to try to drive out more data from our technology environment, and to be able to create automation to eliminate manual and redundant tasks that were being placed on our development community: we do have over 30,000 employees. 4,400 of them work in what we refer to as a chief administrative office, basically technology, design, and digital professionals. We do have a 96% retention rating for technology staff, which is pretty good. It's one of the highest in the industry. We like to think that the reason that we have a high retention rate for technology staff is because we put a lot of attention, a lot of focus, on letting them do the job that they're best suited for: the job of actually designing, developing, testing, and supporting technology and applications that best serve the membership, as well as some internal type applications.

The topics that we're going to cover today are: first, we've got to hit a little bit of background and key terms so you understand where we're coming from and we have a common language, a common way of understanding some of the topics we're going to be covering later in the presentation. Then we're going to go straight into looking at a demonstration of some of the control automation that we've created, as well as the ability to do the application total cost of ownership calculation.

Now, this is probably the best time to stop and discuss why this guy is talking about controls automation and application total cost of ownership during the same presentation. They seem like topics that don't have much to do with each other. In fact, it turns out that they do. What we've learned through this process is that the data and the discipline around understanding the entire technology landscape, from all of our applications that we manage to support in IT down to the underlying infrastructure and supporting configuration items, that data and that discipline of having that complete and accurate inventory of our technology landscape actually drives the outcome for both controls automation and being able to calculate the total cost of ownership for an application. So we'll get into that a little more in the method, under more like how we were able to achieve this.

Just so you know, this is still early in our journey, but we've learned enough to understand that this does have the return that we're looking for, the return on that investment, as well as builds for the future. So I will show you some examples, but there is some additional room for opportunity, and so we'll cover that in the what's next.

A little background. At USAA, we've had a lot of demands placed on us the last few years, both the increase in regulatory scrutiny, our desire to be absolutely as compliant as possible, as well as that organic business growth of the organization. We've doubled in size multiple times since I've actually started working here 20 years ago. So these demands, of course, were not going away. At the same time, we actually have some additional new desires. We really want to attempt to accelerate our technology cycle to be even faster: our ability to develop and adopt new technologies as the business needs, as well as to pay down technical debt that may exist in some places.

Then another strong desire is that our business partners have become much more engaged the last few years, really on trying to understand where their costs are coming from and how to make good decisions, good technology investment choices. It used to be, 20 years ago, that there was business and then there was IT. The way that the world operates now is that every company that's going to succeed is really going to be a technology-driven company. So our business partners are much more educated and much more involved in wanting to understand their technology landscape: what are they paying for, and where can they make changes to drive better value?

That combination of the demands and the desires would be almost impossible to achieve if we hadn't adopted some of our DevSecOps principles way back earlier, as in a few years ago, to try to create efficiencies to be able to deal with these demands and desires. However, our DevSecOps disciplines alone weren't going to achieve the efficiency gains necessary to be sustainable further in the future, so we needed to make some new investments to try to generate more efficiency and to purge a growing set of manual tasks that have been created over the years to deal with compliance, application support, and that business growth. So we couldn't go any further.

What we've done is we've actually made new investments in automation and data about the technology landscape itself. We've actually figured out ways of extracting and aggregating information about the applications and their actual configurations, what exactly their dependencies are. We have over 3,000 applications in operation at any given time. Those islands of data about the underlying configuration ecosystem that those applications depend on, we needed to do a better job of aggregating that into one place so that we could drive and build automation that's effectively business rules around managing and monitoring that technology landscape. We're talking, of course, about two of those today.

So far, this investment is starting to pay off. However, we of course are going to do more. It's really a never-ending type of initiative. There's always ways of finding more efficiency and developing more capabilities. It's almost like scientific discovery. I answer one question; you end up asking many more.

A couple of key terms. General controls: we're referring to more of that regulatory terminology around general controls. Since USAA is also a bank, we do closely monitor and adopt the IT handbook from FFIEC. There's a link down there at the bottom if you're not familiar with it, but I extracted some relevant talking points for this key term: ensure the proper development and implementation of systems and integrity of program and data. So it's that kind of controls that we're talking about when I say the word control.

Application portfolio management: this is a sort of a modified definition from some industry thought leaders. This is the one we use internally. Basically, it's that actual inventory of all of our application landscapes to describe the technical architecture, as well as the information about each of the application's health, the business value that we expect from it, and some of the support information as well. Of course, value is a measure of cost as well. So the fact that we're working on application TCO is what's going to drive our value measurements for the future.

Configuration management database: probably a familiar term. Most people might spend a whole lot of time here. I just want to point out that what we really focused on is understanding the configuration items and their relationships. We've done a lot of work on that, as well as methods for certifying that data over the last couple of years, monitoring for health and for completeness and accuracy, and then building some controls directly into CMDB to ensure that those relationships and certification health are sustainable over the long term.

App TCO: lots of people probably have different definitions or opinions, but when it comes down to it, what we're really meaning by app TCO is we're looking at the sum total of all of the labor, hardware, software, and services, the cost per application. Sum that up to get a TCO. Now, we're not actually tackling the development costs yet. That's a next step for us. The TCO calculation that we're using right now is that direct and indirect cost for managing the applications, and it's helping estimate the business value so that we can rationalize applications from that.

Just a little bit more about regulatory and compliance. This is just an example of one of the control needs that we have out there. I just picked one of these at random from the FFIEC handbook, just as an example to say that regulators, it's important to remember, describe the requirement, not the implementation. They don't say necessarily the method you should use. They say establish appropriate change management standards and procedures. So what we do is you take that requirement, and you work with your first, second, third line to try to understand what is a satisfactory implementation of a control for that requirement of established appropriate change management standards and procedures.

Okay, so let's go right into some demo time. I'm going to show you a couple of views of our controls automation, the output from that, as well as app TCO.

This is an example of a report that we've automated the creation of. It's specific to one control. It's the detailed evidence that control 106156 is being met. This control, there's of course many of them, hundreds of potential controls. I did just pick this one to illustrate. If you see there in the change request, what is it we're doing here? This control, we're saying the requirement is that the changes need to be approved by an authorized person before implementation.

Out of our last change windows, probably last seven days, there were 666 changes that went to production. What we're saying is out of that 666, 664 of them passed. They met that business rule requirement of saying that the date of approval was prior to the date of implementation. So it's just a business rule looking at the data, right? Compare two dates together. If one's greater or equal to the other, pass. If it's less than, you fail.

In this case, it looks like we're doing really well. Over time, this view is the whole month of September. Looks like we've met that requirement, we stayed in the green for that the whole time. You say, "But it's not 100%." Well, from an organizational perspective, there are thresholds to say if you met or did not meet those expectations. In this case, as long as it's above 98.4, we're still in the green. But that doesn't mean it's perfect. There's always room for opportunity, and in this case, we're saying, well, we failed two of those 666. What do you do about that? This view, this environment that we've created, makes it very easy to create those coaching opportunities by presenting that failure detail right away. You can see down towards the bottom, the middle, the actual change request number is hyperlinked directly to the change management system so that all details about that are available. Those approvers, I've redacted the names there, but the approver could easily have a conversation, a coaching opportunity, with the implementer to make sure that that gets done. They partner up to get that done in the future.

So that's a view of all changes for one control, or the whole ecosystem for one control. This next view I'm going to show you is: let's look by application. What I've done here is this is a view showing how well all of the applications in my domain, I'm a domain architect, you can see up in the upper right-hand corner, it's saying display my apps. How well, for this one particular control, is my domain performing? You can see there's some risk level there. It's basically our method for understanding risk. Looks like we're doing pretty well. Out of 33 applications in my domain, only three of them are not meeting this control. You say, why is that, Brian? Well, let's find out.

Let's click on one of the ones that is not meeting that expectation and look at some details here. So it's saying, for this one application, there was no assignment group specified or it is not found in Active Directory. What this is saying is, for one app, there's an issue with the data in the CMDB telling us who is responsible for providing that initial return to service, that first contact for restoring service to this application. Since we have all of the data available to us, we can easily present those views very quickly. So it's easy for me now, when I have hundreds of controls to meet, to look very quickly and say: show me how am I performing for all controls, one control, and I can take action on that.

As the architect, I know the reason we haven't worked on this yet, why this is not being met, is because this application is a low-risk, small application that we have yet to clean up or sanitize some missing data values to bring this whole. So rather than email or spreadsheets or trying to dig up some details on that, what I can do is Slack the application owner directly the deep dive link and say, "Hey, we failed on this control. We need to get this cleaned up. Can you please take care of that?" Once that piece of data is established, that proves that there is in fact a proper support group for that application. Within four hours, that control will now be met, and so we'll go green.

Now let's pivot and look a little bit at application total cost of ownership. I've produced two views here. This is the end result of the automation that we're running to be able to perform a look at the entire landscape of the underlying configurations that went into the maintenance, the hosting, and support of that particular application. In this case, this is a business view. I'm going to show you the business view, then I'm going to show you an IT view.

This business view is saying to our business partners, they're fully accessible to everyone in the company that has a need to see this kind of information. They can easily see what they are getting from our IT hosting for their apps. They're very familiar with the application names. We're trying to get them more familiar with understanding what's the actual ITIL service portfolio that they're receiving benefit from, as well as very specifically out of the service catalog, what is the technology, the technical capabilities that application is dependent on. In this case, we're seeing that it's got some distributed databases. They've got middleware talent. That's actually what we refer to as our private cloud environments, our more modern ability to host applications basically based on Kubernetes. And then our classic, which is really just our traditional way of JEE support for deployment to JVMs. And there's some storage and some virtual costs.

We didn't tell them every specific server or storage block that was used, or which individual databases. This is rolled up to the level of understanding which applications they are, because they understand applications now. This, of course, has something to do with deposits. I'm redacting part of the name, but you can see that it's an application that revolves around deposits, as well as what was the unit of measurement they're being charged. There's a little bit of need to understand some things around support. For example, this is a medium-level support because the architectural complexity of this application, looking at our application portfolio management system, tells us this is a medium-complexity application.

As well as the contract between the two parties, in order for us to meet Reg W compliance, the lines of business need to contract with our internal affiliates, IT being one of them, and say, "What are you offering me? What am I paying for?" And so this is the contractual obligation record. After that, we can simply look at the unit of measurement. How many of that unit of measurement did you consume? What was the charge? What was the price for it? And then we just multiply those two together to get a cost: price times quantity. In this case, we can also look out for this number, 27,781. I'm going to show you that in the next chart and kind of tie this out between the business view and the IT view to show continuity here.

So let's look at the IT view. The IT view is saying, here's a collection of individual technology offerings. What we're saying is that from an IT perspective, we can deep dive and actually see from the data directly from the CMDB what was the relationship between that application and where it's being deployed. In this case, it looks like it's got four JVMs. It looks like it's probably four different JVMs that this application is deployed to. So of course, we need to understand the aggregate of that. That's where that 27,781 number comes back. That's the total megahertz consumed by all of the JVMs that this application is dependent on.

I know it's this application because this is the secret sauce here: we are keeping a unique UUID for every application in the entire ecosystem. Without doubt, without fail, this is the unique thumbprint on record throughout our entire ecosystem. Of course, we do show the name for convenience purposes, but when it comes from a data perspective, that unique ID is the important bit. So you say, how did it get there? We're going to talk about it a little more. What we're looking at in the CMDB is that all JVMs for this one app showed up in the CMDB because of this one-time registration that was done in the CI/CD pipeline. And so let's look at that method.

I'm going to walk through the method a little bit. From an imaginary perspective, think of this as a new initiative. Someone in the business area has decided they need to fund or establish a new initiative. So they work with someone like myself, a technical architect, to understand. The architect makes a distinction, a decision: I can tell you that this is not a capability we currently offer, so a new capability is going to have to be constructed, a new application. So the architect goes and establishes a new application in application portfolio management. They work with the DevOps team, usually a solution architect or a tech lead type person, someone leading the development team, to establish the initial configurations for that. Think of this as a one-time registration process for entering our pipeline.

Before source code is checked in, or before first deployment, initial deployment, you can't join up without that tech lead or that solution architect establishing a one-time configuration. That one-time config has got that UUID, that unique ID, so that downstream, anytime that application deploys, all of the dependent CIs that we are actually going to be looking for, for example a container, that container namespace is automatically related to that application ID, regardless of the number of times it's actually deployed.

What that means is that we can pick up in the CMDB that relationship between application portfolio management and all of the federated and discovered configuration items that exist in the ecosystem. This is where the point we're at a little while back, but what we've done now more recently is actually established an ability to collect the health, the consumption, the configuration items all in one location, in a data mart, so that we can run additional automated control monitoring, as well as the full technology business management calculations, which is where app TCO comes from, in one aggregated location. This is very powerful. This means we can answer a lot of questions much quicker. It eliminates islands of automation and eliminates manual handoffs between teams, and really establishes a holistic ecosystem between application development and support, architecture, and the business partners that are dependent on it.

Let's just talk a little bit about what's left for us to work on. This journey, like I said at the beginning, is not complete. We've laid the groundwork. We've implemented several controls, but we still have quite a bit of work to do to enhance the data that we're getting from our public cloud providers. We're finding ways of actually doing discovery or federation of that data, as well as ingesting financial data near real time. If you're not familiar with the term FinOps, it's something you can Google. It's basically a consortium that's a member of the Linux Foundation focusing on really maximizing the value you get and managing costs in your public cloud ecosystem. So we're going to bring that data in here in 2021 so we can start looking at studying that.

We're also going to expand control testing. Now that that framework is in place, it's just a matter of coding up as many rules as we can find to do testing in an automatic fashion. It should be eliminating hundreds or even thousands of manual effort to do that control testing for both auditors, regulators, and for our internal needs to understand the health of our environment.

We're going to do a lot more around deepening the actual APM, application portfolio management, and technology business management insights. This is a lot where I'm going to focus for the next year. Now that we actually have that financial and technology ecosystem data coming together, what can we do with it? What are the insights we can derive from it? I want to try to find ways to automate or find patterns that we could automate recommendations around application rationalization or making architectural recommendations back to our business partners. That's getting pretty exciting. There's a lot of opportunity for machine learning there, as well as some more traditional data analytics.

And then we're going to do some more work around automating the actual data certification process itself so that we can absolutely be very assured that all of the data that we're looking at in this environment is actually accurate and complete.

With that, I just want to say that also hopefully what's next is I'm going to get to see you all in 2021. Maybe, hopefully, everything will be going better next year, and we can actually see each other in person in Las Vegas. One more time, my Slack handle is @bmmc. Feel free to look for that in the conference Slack channel, as well as if you miss me at the conference, feel free to reach out to me on LinkedIn. Thanks a lot.