Automated Change Management

Log in to watch

Europe 2022

Automated Change Management

Executive Director Application Infrastructure · Morgan Stanley

Automate the change management process by using development activities and data driven automated assessment of risk of making a change to production, such that changes were automatically approved, resulting in both better delivery outcomes and reduced risk of change implementation.

Chapters

Full transcript

The complete talk, organized by section.

Host Intro (Gene Kim)

I am so excited about all the talks here at DevOps Enterprise this year. So our first speaker of the conference is Gus Paul, an executive director at Morgan Stanley, one of the largest financial services firms in the world. I can say with some certainty that in my decades of studying high performing technology organizations, I've never met anyone in banking like Gus. So when I met him, he told me about his 20 years at Morgan Stanley and his early days on the trading floor where traders and developers worked side by side, where quickly developing capabilities and even microsecond advantages could not only help their firm win, but more importantly to help the clients.

He talked so eloquently about how life changed after the 2007 financial crisis, when Morgan Stanley was one of nearly 30 banks designated as a systemically important financial institution. Those are the organizations that are sometimes called too big, too complex and too interconnected to fail. So all of those organizations are impacted, often creating an increased focus on controls, which in turn creates more processes, more rules, more approvals, which definitely changed the way that developers work. Those stories reminded me of that amazing poem from the Lego movie.

The World was once free and full of possibility, then came order, and after it authority, everything changed until nothing changed at all. So Gus is going to tell the incredible story of how a group of amazing and dedicated technologists worked to liberate over 15,000 software engineers across the firm from an ineffective change management process to liberate their full creativity and problem solving potential. And by doing so, improve the reliability, safety, and security of the control environment. Far better than what one could do with manual reviews alone.

Here's Gus Paul. Thanks Gene. As you said, I'm from Morgan Stanley. My name is Gus Paul. I'm a product owner in our software Delivery Assurance Squad, which is part of our DevOps enablement platform. I'd like to talk to you today about our journey from what we had as our change management process through to our new automated change management process, and hopefully give you some insights into what life is like in a huge financial institution. For those of you who don't know, Morgan Stanley, we've been in business for almost a hundred years.

Gus Paul

We split three ways. Institutional securities is our investment bank, investment banking, franchise sales and trading advisory business, and our wealth management and investment management franchises where we manage assets for our clients. For those of you in the US, you may have heard of a company called E-Trade. If you haven't heard of E-Trade, they were the original Robinhood. That's where you go if you wanna trade your securities as an individual customer. Wealth management is for people with lots of money. We split our revenues basically 50 50 between institutional securities and our asset management businesses.

We have global scale, we are on every continent, and as you can see, our institutional securities business is split fairly evenly between those different regions. So a huge business like this with over 76,000 employees. What does technology look like at that business? Well, we've got over 15,000 people working in technology. We've got three and a half thousand applications and systems, and those systems are processing up to 10 billion, 10 billion transactions every year. The critical things about Morgan Stanley technology are volume processing and time to market efficiency.

We have algorithmic trading algorithms that require sub-microsecond response times to give them that little bit of an edge in the market. Because of this. Morgan Stanley adopted technology aggressively in the early nineties, and back then there was no cloud. There was not a lot of stuff that you could use. So we started, we hired a bunch of smart engineers and we built a lot of things that are still with us today. Server management, binary artifact distribution on a global basis, our market data plant for providing price feeds to our traders.

All of those things are bespoke to Morgan Stanley and have given us a significant competitive edge over the years. It was fun being a developer. When I started 20 years ago, you would sit on the trading floor, you'd be with people you'd work with every day. If there was a problem, you roll it out. But that was part of the fun. You enjoyed being able to make, enable the business to work faster. When we became a bank holding company in 2008, there was a subtle change at first, but we increasingly ended up with more and more processes and procedures.

Some of them were completely understandable. We had to be able to document what we were doing and why. But some of the interpretations of those processes, I was convinced there was a better way of doing it. Why? What benefit was it to the firm that I had to go on a call every day at 2:00 PM to tell somebody what I was gonna do the next day for them to only just say yes every time? It seemed like we could find a better way of doing that In 2018, our CIO at the time decided we needed to get more on board with this fancy new cloud thing that everyone else was using.

So we started an agile and cloud transformation. I know sometimes those transformations with a capital T get a bad reputation, but I think we had a good balance between teaching people the foundational subjects. There's lots of lifers like me at Morgan Stanley who had no idea what Agile and DevOps was. And then once they have those foundational skills, being able to adapt and run their teams the way they think is appropriate for their business area. When we started that, DevOps was seen as a critical enabler.

And like most at Morgan Stanley, we have a big hinterland of people, uh, working in tech who are asked to volunteer for these new initiatives. When the volunteer call went out for DevOps, I couldn't wait to get involved because I knew that there was a better way of doing change management. We started off the first year of transformation just teaching people what DevOps was. It was around the time accelerate came out, and it was really useful to have the DORA metrics to be able to base our conversations on, because we already had a pipeline that we used to build our applications, we were able to use that as the starting point for increasing people's automation and making sure that they knew that it wasn't just about speed, it was also about efficiency and reducing risk.

That went well for the first, uh, year or so, we decided to turn DevOps into a strategic program at Morgan Stanley. Uh, from that point onwards until now, and we continue today, we've had three key focus areas, accelerating software delivery and deployment. That's the one where we talk about how can we improve automated deployment. Automated testing, Morgan Stanley has many different endpoints for doing web services, for doing binary distribution, for doing, uh, all kinds of things. Those are all custom Morgan Stanley.

We need custom tools to be able to talk to those increasing predictability, frequency and quality of change. That's one of our three key areas, why I'll talk to you about that in a second. And then revolutionizing how we operate technology and SRE was, was, was opened a lot of eyes on Morgan Stanley. We always had a very close relationship. We took it in our institutional business between our developers and our operations people. I was never in a situation where we had to give our release to somebody else to deploy it.

We deployed our own stuff, but there was definitely a divide between the guys who had to sit on the trading floor every day, listen to the support queries from the traders and developers, sometimes in other regions who were not quite as close to that cutting edge. How could we bring that back together? The SRE principles seem to fit, and we've been on that journey for the last three years. But back to change management, why do we bother doing change management? But we did a developer survey and it came out with the result that I wasn't the only one who thought that our change process needed some work.

One engineer took it on themselves to document all the things they had to do to get one line of code into production. Three different JIRAs, a change ticket, 81 individual steps just to get that paperwork created and get it approved. You can have up to seven or eight different approvers on that. Usually senior people in the business or senior people in it. They have better things to do than approve these change tickets. Surely four hours of effort over two business days just to to, to get the permission to put something into production.

I really took heart from the part of accelerate in chapter seven when they talked about how change approval, change advisory boards and manual approvals are not correlated with better outcomes when it comes to risk or efficiency. In fact, having no change approval process, having a change approval process is worse than having none at all. That definitely rang a bell with me. So what else was wrong with our change process? Well, it was manual. Our tooling is very old. We don't use ServiceNow. So we built our own change management system because we're Morgan Stanley and we had many different approvals that were required when we started.

There were only one or two approvals, but we added a few more over the years when individual events resulted in the call going out, we need more oversight. So let's add another approver. That approver stayed forevermore and we kept adding additional bits and pieces to the change management process on their own. They all sounded fine, but as we continued to scale up and increase the amount of change we were doing, it was really slowing down the amount of, uh, the change process.

We also, after we became a bank holding company, had to do a software delivery lifecycle procedure that came in after the change management procedure and actually duplicated large parts of it. Things like testing, approval, uh, approval of the, the, the re the the, the thing you were deploying. Those were repeated in both processes. One thing that came in half, uh, uh, about five years after we started this more formal change approval process was we needed something better than just a risk assessment that said low, medium, and high.

So we put in place a process. We had to answer eight questions, but those questions were filled in by the person creating the change. And then you had to just be used and they became so routine because you're doing so much change, you answered them the same way every time. So they were not contextual. They were not necessarily providing the right level of risk assessment. They weren't bad, but surely we can do better than routinely filling in the same questions. And then we were behind the curve by this point.

Like we did that in the year, in the era of ITIL two. Now we're at ITIL four. ITIL, for better or worse, is a good way of managing our IT service management function. We need to make sure that we adopted the best practices from there while still being able to be efficient and as, as, as as, uh, uh, risk, uh, reducing as possible. And finally, it was time to make sure that change wasn't seen as this spooky barrier by a lot of developers.

Quite often developers were not necessarily aware that it was the change process that was causing some of their behaviors. Some teams over interpreted the rules. We had to make them simpler to follow so that everyone could benefit. What kind of volume are we talking? So As this slide said we've got three and a half thousand systems over two and a half thousand software systems. This chart is showing you the amount of volume of change we do every year from 160,000, 2019 to 175,000 last year.

A small dip in 2020. But if you look on the right hand side, the percentage of software change keeps increasing every year. This is a trend we definitely expect to continue 'cause we're gonna start doing more infrastructure as code, more cloud deployment. Those are still early days in Morgan Stanley. How are we gonna cope if we have this same onerous change process? We also have lots of different change restrictions that we put in place that we don't have good ways to make granular enough to allow 'em to be effective.

How can you do a change on the weekend if you are an infrastructure team? If every change, every weekend is restricted for one reason or another, because of this onerous process, we saw a lot of people batching up their changes into larger and larger releases. And because the releases were then so large, we've gotta do it at a time when we're not gonna risk the business being unstable. Where Morgan Stanley, we don't operate 24 by seven apart from E-Trade, but the rest of the business, particularly the big volume businesses, they don't quite, they don't work.

They don't operate on Saturdays. So a lot of our changes got pushed to the weekend because that was the time when we thought it was easier to recover from 'em if things went wrong or the change themselves was complicated. Now, some of that is 'cause of the volume of the architecture of the, of the systems, but there's also no incentive to change the architecture because there's, you still have to suffer the same change process, whatever architecture you have. How can we make this better? First of all, we focused on the SDLC side of the, the equation we had, as well as the change management process with its flaws.

We had an SDLC process where we were asking developers to commit their code of source control to use Jira for their requirements to make sure they were testing stuff in a pipeline. And then we'd get to the end of that process and have someone approve all those things all over again. Why do we have to do it twice? Let's fix the SDLC. People are already approving the requirements in their squads. People are already approving the code because the most part, everyone was using the pull request workflow in Git.

People were already generating test cases. They had the results from their automated testing system. Does it really need someone else to say, yes, I approve that this log file says all the tests pass. Is the log file itself not evidence enough? So we were able to change the test to change the pipeline and shift left. We, for SDLC, reduced the approvers from the senior people who were approving at the end to the people in the squads who are working together, approving their own, uh, approving each other's work.

This crucially meant we were able to keep the separation of duties that's required. Sometimes you have conversations I've, I've, I've seen discussions on Twitter where people are like, oh, even pull request is you shouldn't do that. Like it, it's constraining on the team. Unfortunately, we do end up ultimately working for a regulated environment. Separation of duties is critical for things we do, but separation of duties only has to be one person. It doesn't have to be five. So let's see how far can we get If the pull request and the requirements are approved by a human, can we automate the rest of the process?

So with that in place, we are ready to start tackling the change management side. So change management and Morgan Stanley, as I said, because we have such a complicated process, we were actually able to divide lead time up into subcategories of lead time, believe it or not. So the lead time from the time the story is marked in progress in Jira to the time it was deployed, was further broken down into the delivery lead time. So that's the point when the change was committed into the source code.

So it was released in production and we broke that down further because there was enough of a gap between these pieces. So the delivery prep time, which is the time from when the code was committed to when it was built and ready to deploy, and then the change approval lead time. So everything is finished, it's ready to deploy, we've just got to get the paperwork done and then push it out the door. We wanted to know like if we actually fix the change management process, would we actually see any benefit from this value stream mapping exercise?

The answer was yes, because on average from our data, we saw that it was three and a half days to get a change approved in the change approval lead time, but only half a day after that approval happened for it to be deployed. Now, quite often that's probably because people were getting to the point where they wanted to deploy their change and then saying to their boss, oh, quick, can you approve that ticket for me so I can get it out the door?

So that was more evidence that we thought that we need to fix this process and we can maybe get three days potentially back on average from each team. Some of that architecture, some of that batching I said is due to architecture. Some of that batching is to do with, uh, the business requirements of of, of changing things when, when the business is not open. But we were confident that we could find a way to get that down and get back to a point where we didn't have such a big chunk of our time spent doing chasing approvals for changes.

So how are we gonna do that? We looked at the different bits of the change process and wonder, what, what could we do that can automate here? So we have the SDLC, as I said, that's taking care of the code approval separation of duties. Then we have the risk assessment. Those eight questions, some of them, most of them were to do with things like, are you deploying this in a north, in, in, in, in a, in a repeatable way? Do you have good sense of the risk of the system you're deploying it to?

Do you have a good sense of like the backup procedures? Couldn't we automate some of those things? We know how risky the system is. That's metadata. We have a history of how many incidents this system has caused related to change. We can track that. We know whether this deployment, we know how this deployment is done. We can tell the difference between automated and a manual deployment. We have the evidence from the SDLC that it's been tested. Let's feed all of that data into the, into a risk calculation and we'll give you a score back.

We decided to make it a number rather than just low, medium, and high because that would allow more differentiation. And we decided that the more points you'd get, the more risky your change was if you reach a certain threshold, which we'd work out, which we'd decided we'd work out from looking at the data. If you were under that threshold, you could go down this automated change approval path. It wasn't no approval. It was automated approval that the systems decided that you are low risk.

If we could strip out all those low risk changes, then the really risky changes could get the focus they needed from the human approvals who could drill into what they were happening at the moment. It was too much wood for the trees. You couldn't work out. Every process change went through the same process. How can you work out which ones are higher risk and which ones you need to focus your time on? This would give us increased confidence in our risk assessment, which would allow us to do the automatic approval, which would then allow us to push these things out without having to go through many different change approval boards or change advisory boards.

Later on, we'd go back and fix the normal change process; even there, We don't need five approvals. Let's work out the best way to get the next approval, but that's for later. So what things do we think would make up this automated risk assessment? There's an, there's a whole list of things here on this slide just to highlight some of the things we thought were things that we thought were important, like the quality of your SDLC release. Other things in terms of came from, from industry statistics as well as our own data analysis.

Things like the size of the change, uh, how long it takes you to do the change, the longer it takes, the likely it's more complicated. Are you automating your execution? Things with manual steps usually more likely to go wrong. What's your previous history? Both of incidents and deployment failures? That's likely to be a leading indicator of like, you might have a problem again and what's the risk of the system? You're changing? High risk systems will have more impact if you cause an incident.

Remember our whole basis here is are we gonna have an incident from this change? Not is it a good change, not is it gonna make money? Just are we gonna cause stability problems or incidents when we make this change? Because if we can eliminate that risk, if there's a business problem, we'll just turn around and do another change to fix the business problem. We don't have to wait another week to do that. So that was our hypothesis. We built a whole matrix of the things we thought were important, the things to score weighting, we'd apply to each one.

We had over 200,000 changes every two years. We were, we could use that data, take the rules we've got, apply them back to that data and say, are we, do we think this is gonna work? What that meant was we had confidence in the data structure, the data levels, we, data thresholds we built because we found some things that were not relevant to the, to the scoring and we found other things that are very strongly correlated. The ones that were strongly correlated, we were able to wait.

We were able to wait higher than the ones that were not as strongly correlated. That meant we were fairly confident we had a good range of things that would allow us to reduce the impact of changes if they went wrong. But there's no Morgan Stanley's a big place. There's a lot of people who've been here longer than me. There's people who've been here less time than me. Some people have never done anything other than traditional change management, lucky them. But that meant we had to have a process where we could both show that the system was working as we expected and overcome the resistance of people who were still convinced that manual approval was always gonna be better than having a machine do it for you.

So what we did was, first of all, we built the risk service. We had a system where we calculated the risk score based on the different inputs and we said, let's roll that out first because then everyone can find out what they need to do, how they need to build it, uh, and how they can change their system potentially to adjust underneath the thresholds. One, one senior person said to me, you, I don't understand these rules. You set it up so people can easily game these rules and get under the score.

I'm like, yeah, that's exactly the point. We want people to game this because the rules we've set up, if they game it, their system will be lower risk. That's exactly what we wanted to happen. Then we started a pilot with the actual change process from end to end. First of all, we did it internally with our own teams. That was used as proof to say, can we start a pilot with the wider teams? You know, Morgan Stanley's regulated environment. It's not easy to vary things like change procedure without having new procedures in place.

First this way, the organization worked with us and said, yes, we can give you a risk exception to do a pilot and then that can inform whether we go forward with the full thing. So we did a pilot, took us six months, second half of last year we did 1500 changes across 58 different systems. There was not one change related incident, not one. Our average is about 1.5%. There were no, uh, there was significant improvement in lead time, even better than we thought there was gonna be.

And we reckoned about $10,000 in savings just a month just in approvals from doing this work. So then we released it to everybody. This year we've seen increasing adoption. We're up to about 10% of all software changes now are done by uh, uh, this, this process. We're hoping 25% this year and then maybe 40% next year. But just to emphasize, this is not just about going fast. You know, you get a lot of people say, oh, developers, you just don't wanna have a change call because you wanna just do it and not have any oversight.

We also strongly this is about reducing the risk. We have two examples in the pilot. First of one, there was a, one of our critical systems, uh, in terms of uh, compliance. The product owner found a bug 5:57 PM by 6:42 PM less than an hour. That change was fixed and in production, this is a system that used to take two weeks to get the change out the door. Didn't have to go through any exception process, no emergency change, like break the glass, totally legit, fully automated, committed the code, code review, boom in production.

Similarly, there's another system, much bigger system, one of our key components around, uh, one of our key, uh, infrastructure components. They did a regular change the old fashioned way where they rolled it out, they did all the paperwork. They saw there was an issue with that prop, with that change that if they'd left it untouched, would've resulted in an incident, um, of increasing severity the longer it went on. But because they, it was a small fix, they were able to use systematic change to get that fix out the door without, again, without any break the glass, without any additional, uh, waking people up in the night to approve things.

And they were able to resolve that incident before it became a big incident, which was obviously beneficial. Sometimes there's not opportunity cost for these things, right? You don't necessarily see the incident didn't happen. You can't prove to me that this was the thing that helped it. Here we have evidence that it prevented a bigger incident and then the stats that were even better than we expected, right? So this is from the pilot. The blue bars are the old fashioned way. The orange bars are the systematic change way significantly increased, increased delivery lead time, significantly reduced deployment size, significantly improved, increased frequency of deployment and much more changes done on weekdays.

That was critical because we have, uh, we had a large, we had quite a large morale issue with our developers who were fed up of doing releases on Saturday morning. They wanted to do it when they were in the office. Why can't we do that? Key to that was a lot of these systems had to adopt zero downtime releases. But that's part of the, the architecture discussion. There's probably another talk. And then was this broad base, was it just one system that did all those changes and made the numbers look good?

No, almost three quarters of the systems in the pilot showed improvement in these metrics, both on the delivery, lead time, delivery size, and the deployment frequency. I don't necessarily like to say I told you so, but I was pretty pleased with these numbers because they really bore out that we thought this could be the impact that happened and then al almost as important as a sentiment. As I said, I used to be a developer. My heart, in my heart, I'm still a developer, but I wanted to make sure that people saw the value in this.

These are the kind of quotes we were getting. Things like the ability to move faster and more in a more sustainable way. People saying this is the best thing that happened to them in 20 years at the firm. People who felt like they were working with a different company and even somebody who felt this was just an example of us implementing exactly what happened in the Phoenix project. Those kind of quotes really gave us heart that we were doing the right thing.

And it's created its own culture of increasing adoption because people are seeing these things shared by other teams and they want a piece of it too. And then how is this impacting the business? Well these are quotes from our overall DevOps program and how they impacted particular businesses in wealth management. There was one non-technical product owner. They were very skeptical about doing things more frequently. They thought going faster means higher risk, surely. But we were able to show them that actually going faster with smaller releases more often is a reduced risk because it's less complex in algorithmic trading.

Where key is time to market. The faster you can get something in of benefit to the customer in production, the faster the customer benefits that iterating rapidly thing unlocked. What I used to do when I worked on the trading floor, again, about being able to push stuff out very quickly, even the same day, to be able to benefit those clients. And then one of our big IPO customers, one of our big IPO clients, we had to build in a tiny system. 'cause they wanted to run the IPO in a specific way that was built with automation from the start.

And it was revolutionary to the, to the, uh, investment bankers who'd been used to working with the old tech and how fast and how it easily it could be built and rolled out to enable that client business as usual. You know, Gene always asks, you know, asks us to ask what help we're looking for big-cap financial services firms. There's only a few of us. We're trying to work together to see is there a way we can present a consistent view of this stuff.

You know, we've, we went through that internal overcoming people who thought manual changes, manual approval is better than automated approval. Can we present that consistently across the, across all our different organizations externally, to external regulators and say, look, we've got the results that prove this is good. We are working together through an organization called FINOS, which is the Financial Open Source Foundation. Many big firms are part of it. You probably are if you don't realize you are, if you're a financial firm.

Um, and we've got a working group in there called DevOps Mutualization. Uh, if you wanna get involved with that, you're a member of FINOS. Let me know. If you're not a member of FINOS, I'd still love to hear from you about what you are doing, uh, around automated DevOps in a regulated environment. Our risk assessment, can we make that better? It's targeted for software changes right now. What things we need to think about when we're talking about infrastructure changes. Uh, we're looking at a number of vendor products that could potentially do machine learning on this.

So rather than being a static set of rules, every change gets its own assessment that's contextual to that change. If you're doing something in that space, we'd love to hear from you. And then the audit trail, we're getting better at this, but we still have a number of different tools involved in this pipeline. It's still hard to present an end-to-end picture, uh, simply and consistently to our auditors. Uh, if you've got anything working on that space or you solved that problem, I'd love to hear from you as well.

So that's it. I hope you enjoyed, uh, uh, and found this talk useful. Um, and I'm, uh, I'd love to hear from you if, if you're going through something similar.