Log4j Hygiene
Investigating log4j vulnerabilities within the Morgan software real estate.
Leveraging various tools to scan and provide reports for remediation. The ideas and tools are neither log4j specific, nor Morgan Stanley specific.
Chapters
Full transcript
The complete talk, organized by section.
Paul Fox
Welcome to this presentation on the Log4j vulnerability and the remediation exercise that took place in Morgan Stanley since the middle of December 2021. During the course of this exercise, I estimate about 40,000 pieces of work were executed.
Quick table of contents: who am I, Morgan Stanley, some information about Log4j in case you're not aware of what it is, some comments by Gene Kim when we had a pre-presentation discussion, and a little bit of the story as it unfolded internally. And many of the interesting aspects of the questions and issues that arose during the exercise of discovering data and remediating.
Who am I? I've been at Morgan Stanley for more than 20 years working in production, plant engineering, configuration management, many tools, and really with an interest in root cause analysis: why do things go wrong? And when something goes wrong, what's the chance of it happening again and again? It fits quite nicely and squarely with things like vulnerability management and software quality.
I work with many people solving some of these problems, trying to understand them, helping people out with career aspirations, talent spotting. So hopefully, a general all-rounder.
An interest in software quality, security, licensing, standards. They're all very similarly related items about software. What is it made of? What are the components and the applicability to the application? With such a vast software real estate in the organization, it opens itself up to a lot of data mining, with some very interesting patterns becoming apparent due to the many years of software and many types of software that undergo in such a large organization.
And really, all of these things eventually culminate in some of the big-ticket items. So Struts from a few years ago, the big Equifax issue, piece of open source. Now, as of a few months ago, Log4j. And these are escalating in terms of number, frequency, and severity. So H2, Spring4Shell, and others. So with all of this experience, all of these tools, it's got to be easy. It's not as easy as it would at first appear.
Before we go into the details, let's talk about Morgan Stanley. It's an investment bank: more than 100,000 employees, 20,000-plus developers, 25 years of software, new, old, legacy, pretty much every technology ever. And over the last many years, has embraced continuous integration pipelines, DevOps, Agile, all of the buzzwords to help improve software quality and improve throughput and velocity.
We have our own bad practices, like everybody. We're an attack victim. The nature of our business: lots of data and lots of dollars. We're proud of what we do, like everybody, and billions of dollars of trades of money moves through all the machinery every day. That's millions of trades and so much complexity.
So this means we've got a lot of diversity in terms of software. Generally, people don't like to work on legacy. They like to work on the latest and greatest things. Developers want to develop, not fix things, and more use of open-source industry standards, which leads to more surface area.
So as technologists, we consider ourselves to work for an IT company that happens to sell stocks and shares. As a business, the traders think of us as a bank, and IT is just a tool. So we have a sort of diversity of opinion or approach to doing things. And cyber is happening more often with more impact, more eyes, more effort by the bad guys. We, like all organizations, are a target. We don't want to be in the headlines for the wrong reasons.
Just a brief recap on Log4j, if you're not aware of it. It's a hugely popular logging library used by many or most Java applications. It has a long history. Pretty much most applications use it. The recent publicity of a vulnerability got the maximum score from the people who rate and grade these things. It basically allows remote code execution, which basically means if you've got an external web-facing Java application, a user could sit there and type in the right magic characters in order to download code and execute it on your systems.
Obviously not a good thing to have happen, and because many Java applications tend to be web-based services, that is even worse. It sat in the wild for years until it was uncovered, and so in before publication, I guess the bad actors were busy writing the tools to start scanning the world, attack or investigate every website in the world, and every organization, big and small, worked very quickly after the public announcement to investigate or remediate or block these attacks.
The attack itself turned out to be very easy and had very little in terms of precondition. So if you read the internet guides on this, it shows how simple it is to do this. So the mere presence of the Java file in an application is enough to presume that the application is at risk of attack, i.e., it's guilty.
So the Log4j story within Morgan Stanley, I like to think of it as the beginning, a middle, and an end. It started for myself one Friday morning with some emails that were coming in and me trying to figure out what I'm going to do for the day. The subsequent days and weeks were a lot of concentrated work by many teams and people collecting data, interpreting it, getting the message out, and getting accountability for application remediation.
And then as we get to the end of the exercise, interesting things come around. At some point, the company wanted to take a firm decision that we were going to destroy the copies of software in the organization, knowing that nothing was using it. So that required a lot more analytics and discussion in order to figure out: are we ready? Have we got all the data and signs that we need?
I had a conversation with Gene Kim before doing this presentation, and it was an interesting conversation we had because I guess to an outsider, renovating or remediating this issue is really very simple. There is a vulnerability in the library, just fire off an email to everybody and tell them to go fix it.
I said it wasn't as simple as that. That's the high-level goal. You want people to go off and fix it. So how many ways are there for it to go wrong if you did happen to take that approach? Maybe in a small organization that's viable, but in a very large organization, you really want to be confident that you've covered all your bases, all applications are properly remediated, investigated, and that you don't have legacy sitting in the corner somewhere that everybody's forgotten about, that is open to attack.
So do you know all the applications you have? Do you know who owns them? Who's accountable? In a large organization, that data is changing as people come and go, and all sorts of things happen. How many are going to understand the email and actually respond? People are busy doing their day job. You're now asking them to do some high-priority thing, and for all of us on the exercise, it's a lot of learning to understand the nuances.
Whatever the answers are that may come back, do you trust them? So an example is an application team may say, "We looked at our code. There is no Log4j." Do you trust that answer? The answer for myself is no. If the data, the tools, the probes that exist in our organization demonstrate that it is using the technology, even if the application owner doesn't believe it is, we need to trust the tools and the data that's being collected.
Whether you use Log4j or not, it's not necessarily obvious. The layering of software components and libraries can mean that maybe one of the components that you're using itself is using Log4j, and you need to remediate that component. So it's either a direct dependency or an indirect dependency or transitive dependency. That can get quite complicated for application owners who may not actually be very familiar with their own application.
Lack of proof means you're still susceptible to an attack. The bad guy doesn't care how good or bad your internal processes are. They can just fire off a script and demonstrate that you've got an application weakness. We don't want to be in that situation where we believe we've done a good job and provably we haven't.
Tracking hundreds or even thousands of remediations based around some honor system has too many weak spots. People believe to the best of their knowledge on the results, and that's why you can't have anything better to keep track of everything going on, and occasionally you need to re-question everybody because maybe some of the boundary conditions changed.
So in the beginning, it was a quiet Friday. I'm getting ready to wind down for Christmas and an email asking for details on applications using Log4j. I delved into my tool and sent out a quick reply with this link: "Help yourself." And then another email arrived and another. So I started thinking, "Hmm, by the time multiple emails are coming in, that tends to indicate high severity," and started reading up on the issue. And yep, it was a big one.
Fundamentally, the questions that everybody is asking at the beginning of this exercise is: what is our exposure? Which applications are impacted? And the answer, generally speaking, is all of them or most of them, which is a very vague answer. And that is not a good answer. Unless you can answer the question very scientifically, it's going to be a bad day, bad week, bad month.
Now, most of what I'm talking about here is a personal view, but behind this is teamwork, cross-silo, cross-division, getting anyone and everyone to do whatever it takes to get results.
We have two key technologies in the firm which can help answer these questions. One is called AFS, the Andrew File System. It's a globally distributed network where all the applications reside. And we have our continuous integration system called Train, where all software is built and deployed.
And I've got a footnote there that the word "all" does not mean whatever you think it means. It's nice to say all of our software is built on an SDLC-approved continuous integration process. But not all software is built by Morgan Stanley. We have external vendor applications, legacy applications, things that just are not doing the right thing. So we need to be cautious when we consult these systems. And I've deliberately simplified in mentioning AFS and Train because we have other pieces of technology for deploying applications, such as Docker, which adds more complexity to the question and answer.
So it was Christmas or nearly Christmas, and the meetings had started. The initial meetings were getting lots of people together to try and understand the gravity of what this issue is. What is the first steps in remediating? I would probably describe at the beginning of an exercise, it's all very opaque. We don't understand the true impact of what the vulnerability is. We don't understand how many applications. We are probably all guessing how long it's going to take to remediate the one, the tens, the hundreds, the thousands of applications.
So many meetings started happening within the first few hours and probably for the next week: regular meetings, status updates, what are we doing, what have we collected, where are we going? And that built up a strong hierarchy and a strong sense of presence by those concerned: senior management, the cyber team, the hunt team, the software developers, and many other people.
In looking at all of the applications, primary focus was on anything that's external facing. Anything that can be attacked from morganstanley.com is a very high priority. And anything that is a vendor application, something not built by Morgan Stanley, would be also of high interest because somebody needs to reach out to them and find out what are they doing. Are they issuing a patch? So suddenly a lot of work was going on.
Whilst most software in the firm is built internally, sits on our CI system, and we have data catalogs that let us know which applications may be impacted, I decided to go off at a tangent and start looking at the external applications and to see what evidence existed to demonstrate they're using Log4j.
So this leads to a series of questions, really. What is a vendor-based application? What is a vendor system? How do we catalog these? Are they correctly cataloged? What can we tell by inspection of the applications?
Generally speaking, we do a good job of cataloging the applications because that data is used in so many different ways internally. But if an external application was not marked as an external or a vendor application, that might mean we skip over it. We may not notice that we don't have the metadata to tell us whether it uses Log4j or not. We wouldn't know what to do. So that's a very important part of the exercise: are the catalogs up to date? Are they trustworthy?
In starting to look at the vendor applications, there were definite signs of the offending versions of Log4j, so that was very useful and confirmed that we could detect them. We couldn't tell whether the application was vulnerable, so merely using the software doesn't necessarily imply the application is vulnerable, but it's a very strong likelihood.
In parallel to the data gathering and analysis, vendor engagement was taking place. So this is an interesting item: a large organization deals with many businesses, some big, some small, the IBMs, the Oracles, the software components, business, cloud, whatever. Turns out in a company of our size, we deal with tens of thousands of organizations that supply everything from simple little libraries to major applications.
So they needed to reach out to the vendors in order to find out what their take was on mitigation or remediation. And although I had nothing to do with that, it was just interesting watching it because this culminated in thousands of emails being sent out, asking for responses by the application companies.
Also, bearing in mind we're a regulated industry, the regulators were interested in understanding what our exposure was. So trying to pull this data together, whether it's the conversations with external vendors or internal applications, required responses to the regulators about where we were in the discovery process.
Whilst we were busy trying to contact the companies that supply us with software, of course, our clients were busy trying to contact us. How is Morgan Stanley doing? Are we impacted? What's our response? Where are we? One can imagine the torrent of incoming and outgoing traffic and communication required careful management. Certainly, those of us on the technical side, we're not spokespeople for the organization. Luckily, people don't reach out to us to ask for our opinion on what's going on. But managing that communication in a mature fashion is quite something else.
On the technical side, we ended up creating a repeatable process. We're not only discovering the applications, we need to provide the information to the application owners and to the people monitoring the situation in the firm. This is not something you want to do by hand, collecting spreadsheets, merging data. You need to have an automation system that can collect this stuff and generate reports and free us up, the humans, to get on with it.
Knowing we would need to track this to zero, this was not a desirable program. This was a mandatory program. We needed to eradicate this, so this was going to stick around for a while until all the due diligence remediation had happened.
This basically entailed a load of standard report generation and data acquisition. A lot of caching. I was writing a lot of the tooling, and having a report that takes more than 24 hours to generate results is not really desirable, so a lot of work was put in to ensure it could run reasonably fast.
So the plausible data is what's running, what are the applications, what are the processes? What source code has references to these versions of Log4j? What Docker images exist? What old releases exist and need to be destroyed? We really wanted to ensure that no old application prior to remediation could suddenly be fired up, because that would reintroduce the vulnerability into the ecosystem. It's a multi-dimensional view to help isolate the applications and owners, not just for remediation, but accountability.
And one of the interesting things that happened at the beginning of the exercise was, which is the correct version of Log4j to use? And in the early days of this remediation exercise, the Apache Foundation were releasing brand-new releases. Subsequently, they were found to have vulnerabilities, and so the versioning kept changing daily. Trying to keep it in your head was becoming impossible. So we needed a central place that would catalog what are the bad versions, what are the good versions that are acceptable.
A lot of the early work was about senior management and data collecting. But ultimately, this data is all meant for the developers to remediate apps. They had no idea this was going to hit them in the couple of weeks just before Christmas, and a lot of communication, education, and knowledge needed to be disseminated to get people to the same level of understanding.
The program itself was split into a tactical and strategic approach. So highest priority applications needed to be fixed prior to the Christmas time, so that's about a two-week timeframe. And then slightly more leisure in 2022 as post-Christmas, post-New Year. And that worked really well to focus on getting early results, proving the methodology, and then following through in January.
The next few weeks turned into months. Whilst the original goal was full remediation by the end of January timeframe, things just tend to take a little bit longer than expected. So this presentation, which is dated in May, we've pretty much reached the end of the line. The remediation has completed, but that happened sometime in April.
So what are some of the support questions that happened along the way? One of the most interesting ones I saw in my inbox was, "Why is my .NET application showing up in the report?" And the answer turned out to be that whilst it was a .NET application that didn't use Java, the Docker image in which it was running had a Java component, and that Java component had an issue.
There was a lot of conversations and questions around Docker images. Once we identified the family of images that had the Log4j in there that needed to be rebuilt and destroyed, it turns out that nobody actually knew how to destroy one. The use of Docker technology within the firm is relatively new in the last few years, and nobody had really given much thought to at-scale deletion of them. So a lot of conversations went on with the Docker team to educate everybody how to do the deletion and handle some bugs which nobody had seen before in doing that.
Another issue was people remediating things and saying, "Why am I still on the report?" So the report is pulling in data which is often a few days out of date, so it can take a few days for things to drop off. Trying to explain this to people time and time again is really difficult, but it did highlight later on in the workflow that as people were remediating, they really wanted hard, real-time updates to the spreadsheet. They knew they were accountable, and they wanted the positive feedback that they had been dropped from the report.
To me, as an implementer of some of this tooling, demonstrated that as you get closer to the finishing line, the demand for hard real-time data increases, and the technical implementation of that gets much more complicated. So if you're generating a report and the data is a few days, a few weeks out of date, you have people who are okay with that. But when people want the data within 24 hours or less, the tool needs to do more polling, more probing, more computation, and that is actually quite a stretch.
I'm aware that we're running out of time, unfortunately, in this presentation, so I'm going to have to fast-forward over the subsequent slides. One of the interesting facets of this investigation is, whilst I spent a lot of time early on generating the raw data for people to consume, somebody put together a web service built out of Bash shell, of all things, to help people query the data.
And I thought it was a really innovative and very useful tool that I was very proud to see that somebody had spent the time to do that. And I actually took that code and then enhanced it many fold over, because having a central portal that people could go to, rather than people emailing spreadsheets that would inevitably be days out of date by the time you're looking at them. Being able to look at the live data at the point it's being consumed and generated turned out to be really useful.
And then this system is being reused for other vulnerabilities and other hygiene exercises. So out of the chaos that ensued of Log4j, we've built a data collector that's reusable for other vulnerabilities. But also the end-user experience is now the same, and a lot of smart features are available in this system.
What are some of the takeaways in this whole exercise? I've listed a whole bunch of things. In the heat of the moment when you're trying to generate data, everybody's looking at it, everybody's asking questions. There's a lot of similarity no matter the organization or the problem. Generate a report and people consume it. But then people start asking, "What does this column mean, and why is that row there?"
I'll just stick for the moment with the very bottom item, which is focus on success, not failure. We have thousands of line items that needed to be fixed, and people were being hounded because they still had one thing they left to do. Even if they'd fixed 99, the one thing that was left was being the main focus. Being able to show the positive work that people were doing can make us all feel good as technologists, that people are actually doing what they're being asked to do. And yeah, where senior management may focus on has the risk been eradicated, we need to be fair to the effort thrown at this by the developer teams.
My apologies, I've run out of time, and I'd love to talk about this for much, much more, but I'm going to have to terminate here. So thank you all very much for your time, and I hope to speak to you in person at some point in the future.