Saving Millions of Dollars by Auto-Configuring JVM Memory Settings (Introducing Adaptable Heap Sizing!)
Saving Millions of Dollars by Auto-Configuring JVM Memory Settings (Introducing Adaptable Heap Sizing!)
Chapters
Full transcript
The complete talk, organized by section.
Host Intro (Gene Kim)
The first speaker this afternoon is Jonathan Joo. He is a senior software engineer at Google and is part of the Java Platform Team. This is a team that supports the thousands of Google developers who rely on Java or the JVM to develop and run some of Google's largest and most critical services.
His team is responsible for many things, including creating and maintaining all of the custom tools that Google has built over the decades, as well as liberating developers from having to know about all of these sometimes arcane aspects of running the JVM in production.
But something surprising happened as they started migrating Google services from Java 8 to Java 11: workloads started blowing up in production. To massively oversimplify the problem, they discovered that bad things can happen if developers are allowed to use as much memory as they think they need.
This talk is so startling for so many reasons, beyond just the incredible amount of ingenuity that Jonathan brought to bear to solve this problem. His talk also gives us a glimpse of how platform teams at Google work and what it takes to run things at Google scale, of how even Google doesn't want to be on Galapagos Island all by themselves and is another example of them building bridges back to the mainland.
And Jonathan also presents a radically different philosophy on memory garbage collection than I've ever heard, but one that I think you'll agree is nearly impossible to argue against because of the money it can save. I am so excited that Jonathan is willing to share his story with this community because it shows the incredible value that platform expertise can create. Here's Jonathan.
Jonathan Joo
Hi everyone. Thank you all for coming to my talk. As Gene mentioned, I'll be talking about one of the projects I've been working on at Google called Adaptable Heap Sizing, and this is sort of our solution to Java memory management, which has been a proper pain for developers for a long time. The title of the talk, No More Xmx, is a bold claim, but we will hopefully convince you that this may be the case in the future one day.
A little bit about me: my name is Jonathan Joo. I'm a senior software engineer on the Java Platform Team at Google. I work in California and I've been at Google for around two and a half years now. Prior to Google, I worked at a startup called Rubrik on their distributed file systems team for also two and a half years, so I have a lot of experience doing backend development stuff. This is kind of where my interests are at the moment. I've also included my email and my LinkedIn on this slide, so if someone has any questions during the talk or just wants to connect, feel free to shoot me an email.
Let's talk a little bit about what the Java Platform Team does at Google. We're responsible for making sure that Java just works. There are thousands of Java developers and they all rely on having a reliable Java platform. That's the goal of our team. At Google we run a custom fork of the OpenJDK. We pretty much use everything that is in the OpenJDK at first, but then we have applied our own local patches to help things run better at Google scale.
We also handle upgrades and improvements. For example, one of the upgrades that we were responsible for recently was the JDK 8 to 11 migration, which was a pain because there are so many servers running Java, and changing all of them to a fundamentally different garbage collector is tough. But we were able to get through it with some of the Google engineering unity, as one might say. I'll get more into that in a little bit. We are also responsible for making sure that Java runs well at Google scale, and this involves some projects, for example the one that I'll be talking about today.
Getting into the agenda: first, I'll describe a little bit of the background of Java memory management and Java in a container, as well as the history of Java memory at Google. Next, we'll talk a bit about the implementation of Adaptable Heap Sizing, as well as the benefits that we've seen from applying Adaptable Heap Sizing in some of our biggest jobs, and then looking forward to what we want to get out of AHS in the future.
As a quick note for the rest of this presentation, I'll be using the phrase AHS instead of the full name Adaptable Heap Sizing because the latter has too many syllables.
Let's talk a little bit about the background. When we look at Java memory in a containerized environment, this is sort of the mental model one can have. I've color-coded it to make it easier to comprehend. The green sections represent the memory allocated for the Java heap. Java as a process has both non-heap and heap RAM usage, but the heap is the one that we're most interested in, and this is represented by the green.
The dark green box is the Xmx. It is the maximum amount of RAM that could be used by the Java process. This is something you set in advance when you start up the job, but you'll find that the actual heap usage, or the actual RAM usage by the Java heap, tends to be less than the full amount. It cannot be more than that, so it is a hard limit.
The yellow box corresponds to the RAM required by the Java process not as part of its heap, but just as part of its other things, for example thread stack, code cache, metaspace, etc. Inside the container you may also find that you are running other processes. At Google, for example, we have a number of them, but they all consume varying amounts of RAM and it's hard to tell in advance just how much RAM they will use.
This is kind of what a container looks like. In an optimal scenario, we see that there is a good amount of free space so that if one of these things fluctuates, we should be able to handle the increases in RAM without having any problems. But as there's more variance and unknownness in the amount of RAM that a process can use, we sometimes run into a scenario where these different processes end up using the full amount of container space, and that is what results in a container out-of-memory error, a container OOM.
You might be thinking, why is this such a common issue? To understand why it is so common at Google, we have to talk a little bit about the history of Java memory at Google. Most notably, we recently migrated from JDK 8 to JDK 11. While there are quite a few changes, the main thing that caused us problems was the change in the default garbage collector. JDK 8 uses a garbage collector called Concurrent Mark Sweep, which has a fundamentally different algorithm from that of the new one in JDK 11 called Garbage First. I say new; it's not really that new. I think it's been worked on for many, many years, but it's new to us.
There are a lot of differences between the two garbage collectors, which I've listed on this slide, but the one that we care about the most is the fact that G1 tends to use heap in a more greedy fashion than CMS. This is a problem because, going back to our previous Java memory model, it's possible for all of these in the worst case to exceed the container. With CMS, the Java heap, the RAM used by the Java heap, is not using the full amount that it is allotted, and this is a common configuration we've seen. But with the same configuration, once switched to G1, we see that the heap now uses pretty much all of the RAM that's given, and this results in a container OOM.
Another common problem we notice at Google is this anti-pattern, which I like to call the memory cycle of doom. Someone who owns a Java service observes that their container goes OOM, and they say, okay, if the container is OOM, then that means clearly the container is not big enough. So they increase the container size of their service, thinking that fixes the problem. Sometimes it does alleviate the problem, but the problem is that if you also increase Xmx by the same amount, and Java heap is using the full Xmx allotted, then you really still are just as likely to run into a container OOM as before.
We have this thing where people see a container OOM, they say, okay, it's not enough memory, let's add more memory to the container, and then they keep doing this until the memory of the service has just blown up in size, and they still continue to hit container OOMs once in a while.
What is the solution to all of these problems? AHS. The goals for AHS are: one, we don't want to have to worry about setting the correct value for Xmx, so we can let it be something high and let the JVM manage its usage on its own. Next, we want to try and prevent container OOMs if possible. Third, like I mentioned in the cycle of doom case, we want to prevent excessive heap usage or excessive container usage.
Let's talk a bit about the implementation of Adaptable Heap Sizing. We'll start with the high-level overview here with a diagram which I've created. There aren't actually too many parts. It's fairly straightforward, and we'll go into each of these individual parts shortly.
First, let's talk about how we get the container information. Understanding the state of the container is fundamentally important for being able to manage the memory inside the container. How do we get this container information? We get it by querying the cgroup information that's available via the Linux kernel cgroup, and the things that we look for are the container limits and the container size. This gives us a snapshot of what the container looks like at any given point in time.
Looking again at this Java memory model, the total size of the container is represented by the memory.stat hierarchical memory limit cgroup information. The actual usage of the container, the RAM usage of the container, is obtained via memory.memsw.usage_in_bytes. Keep in mind that this is not the worst-case usage. This is the actual literal usage at this given point in time. That is what we are querying.
Now that we have that, I'll explain how that is useful later. Next, we also need to obtain information about GC. We do this by reading from the HSPerfData file. This is a file that is written by the JVM. It's a custom-formatted binary file and it's used in part for monitoring and performance testing, but is also nice because it allows us to write to it our own custom stats that we want.
As part of a Google-local specific patch, we have exported a lot of the metrics for GC time and the CPU times specifically to this HSPerf file, so that we can at any point in time query how much time do we spend on GC. We then export this information to the AHS worker thread, which reads from this HSPerfData file to obtain all this GC information.
Now let's get to the bread and butter of AHS, which is the worker thread. The worker thread is a lightweight native thread written in C++, and it is an external-to-JVM thread that is enabled at runtime. It periodically queries, like I mentioned before, container RAM information as well as information about the JVM GC. It uses this information to make some calculations and then tell the JVM how to react.
It does this through two manageable flags that we've created, which we call ProposedHeapSize and CurrentMaxExpansionSize. What is a manageable flag? It is simply a flag that can be changed dynamically at runtime. This is very useful for us because the AHS worker thread is constantly creating this information and feeding this information to the JVM.
What are these two flags and what do they do? Before I talk about what ProposedHeapSize is, let's talk about how AHS actually functions. The key insight is that we have a target GC CPU overhead percentage. By that I mean the amount of time that we spend doing GC versus doing, like, serving application threads. This is something that we found at Google: a default of 20% actually works pretty well for most jobs. In other words, 20% of time from the CPU is spent doing GC and the remainder we serve application threads with. This is actually a lot higher than we've seen in most cases, which you'll see in the case studies coming up. We set it to 20%, which was kind of a revolutionary idea in that most people were afraid to spend so much time on GC.
How do we actually achieve this target GC CPU overhead? We do so via the ProposedHeapSize flag. The idea here is that if we see that the amount of time we spend doing GC is higher than our target, so we spend say 70% of our time doing GC, then we increase the value of ProposedHeapSize, and vice versa if we see less time spent doing GC than we expect. The bigger the heap, the less amount of time we spend doing GC. This intuitively works because the larger the heap, the more the frequency of GC decreases, as well as each GC is more efficient. We can clear more dead objects, and the ones that get promoted to the next generation tend to be ones that are actually expected to be longer-lived.
Just a visual on what that might look like: if the GC is on fire, we increase the size of the heap. If the GC is relaxed, we can decrease the size of the heap.
What is CurrentMaxExpansionSize then? This is a flag that helps us from expanding the heap too much. In other words, it is a safeguard against container out-of-memory errors. The value of CurrentMaxExpansionSize is generally computed to be somewhere around container limit minus container usage, in other words how much free space is left in the container. Of course, there will be a little bit of buffer because of the variability in all the other memory consumers in the container. In general, we set CurrentMaxExpansionSize to be just the difference between the limit and the usage, and this time this is a hard limit. We actually prevent any expansions of the heap larger than this because from our perspective, we say, hey, if we expand past this amount, this is guaranteed container OOM. We'd rather try to do more GCs and do whatever we need to do to keep the heap small rather than expand past this amount.
This is also just another visual: as heap expands, CurrentMaxExpansionSize says, hey, that's fine. But once we get to a point where it's pretty limited, we will stop further expansions.
To recap quickly, the AHS worker thread is responsible for taking container and GC information and then using it to set the two manageable flags, which then the JVM uses to modify its own behavior. That is the implementation section.
Let's talk now a little bit about the benefits we've seen from using Adaptable Heap Sizing within Google. Why AHS? The benefits we have are simplified tuning, decrease in container out-of-memory errors, memory savings for those overconfigured jobs, as well as we're able to do all this without actually having any latency regressions, which is pretty cool.
Simplified tuning is kind of hard to show via case study or anything like that. In general, we no longer have to worry about figuring out what is the right Xmx value for this service. Let's just set Xmx to something arbitrarily high, or set it to even just the size of the container, and then let AHS do all of the heap sizing management on its own. It'll use more heap if it has a lot of free space, but if there's not much free space, then we won't be at risk of a container OOM.
That leads into the second point, which we can actually see via a case study of Google Earth. This is a real production workload for services that you might have used yourself. We applied AHS at two different times to two different cells, represented by the green and the blue on this graph. The first time we enabled AHS, in the first arrow, we went from hundreds of container OOM errors per hour to absolutely zero. It's a little hard to see because the two lines are similar colors, but it's even more noticeable once we turned it on for the second cell, because after that you'll see that in either cell there are no errors whatsoever. Deploying AHS resulted in perfect behavior in terms of preventing the container out-of-memory errors that were so prevalent before.
To give you an idea of the scale of these jobs, we had 20,000 or so live tasks spread among 600 or so live jobs at any given time. These are constantly running, and in aggregate all of these jobs consume like 90 terabytes of RAM. This is a huge service, a lot of impact there, and according to the person who was responsible for rolling AHS out on these jobs, they didn't see any regressions in throughput, CPU use, or anything. Even just by having this safeguard, we were able to prevent the container OOMs. This has been running for probably like six months now or so and we haven't had any complaints or issues so far. I would say this was a win.
Let's talk a little bit now about the memory savings we've seen for poorly tuned jobs. The previous example showed that clearly for OOM scenarios this was effective, but how do we take the servers which previously the owners had given way too much memory to, and how does it reduce the memory usage there?
We can observe this via Google Meet. Google Meet has a lot of jobs, but we applied it to some of their servers. This is an example of two cells before and after they enabled AHS. As you can see, there's a drastic drop in memory consumption from over 600 gigabytes of RAM to around 150 or so for both of these servers before and after AHS was enabled. This kind of speaks for itself. It's pretty clear when AHS was turned on because of how much memory we started saving.
That memory savings was just looking at two cells. When we aggregate across a bunch of the jobs that this production service is running, these are the orders of terabytes. If you look at the bottom red and green lines, we went from using 16 terabytes of RAM for these servers to 8 terabytes of RAM. This is just in terms of the heap usage. Because of the drastic reduction we've seen in the heap usage, we're able to also bring down the containers by around the same amount, like around half the size of the containers. We went from using like 48 terabytes of RAM for all of the containers for these jobs to around 24 terabytes of RAM, so that's a massive 24 terabytes of savings. This is just for one part of Meet. This is not all of Meet. This is just one specific job related to Meet.
Let's look at some of the garbage collection metrics and see how AHS affected its behavior. When AHS was enabled on these cells, it's pretty clear from the timing that we see significantly more GC behavior. This is what we want. This is actually what we told AHS to do, because like I mentioned earlier we set a target GC CPU overhead percentage of around 20%. Basically we're saying, hey, you're using 1% of your CPU time on GC before. We've got to crank that up. The old gen and young gen pauses all go up a lot. What's interesting is the middle graph on the bottom row. This is directly the metric that we're measuring. Even though we told it to crank up to 20, it says, hey, we don't actually need all 20; we're happy staying between two and 16% or whatever. Despite all of the things AHS is telling it to shrink its heap and do GC more, it still wasn't even hitting 20.
This goes to show that this server in the past was massively overconfigured, and those RAM savings we see were not too expensive from a latency or throughput perspective. Throughput before and after AHS was enabled: these are the two cells that I was talking about before, represented by a big vertical red bar. If you look before and after, you really cannot see much of a difference at all in terms of throughput. If you had told me to draw the line somewhere where I would have guessed, I would have had nowhere to put it. This shows that there really was no throughput regression after enabling AHS. Very similarly for latency, we also see that before and after AHS was enabled, there really isn't much of a latency regression, if at all. I don't think there's any here. We're able to get these massive RAM savings, dozens of terabytes, without really any cost to the service at all.
Hopefully this convinced you that there were some benefits that we've seen from enabling AHS, and that while these benefits do come at the cost of higher time spent doing GC and potentially a little bit more CPU usage, the RAM savings that we see are massive. Right now I think RAM tends to be more expensive than CPU, so it's a good cost trade-off there.
Looking forward, what do we want to do next with this? First, obviously, there is the rollout within Google. Right now very few people are using it, just the teams that I've worked with directly, but to roll it out on a larger scale will require a lot of monitoring and whatnot, so that will be tough. Beyond Google, we also want to start upstreaming this solution to the OpenJDK.
I've already begun conversations with the OpenJDK via the mailing lists, and we discussed how AHS works and the design. We noticed that there are some similarities to existing requests for enhancement. Notably, ProposedHeapSize has a very similar parallel RFE called "Use SoftMaxHeapSize to guide GC heuristics," and just from the title, you can see how they're somewhat similar. The idea of CurrentMaxExpansionSize is also not new; there is something called a dynamic max memory limit, which if implemented we could use sort of in place of CurrentMaxExpansionSize.
Of course, I think this dynamic max memory limit doesn't take into account the container, but we can pipe in the information that we grab from, for example, the cgroup, and we could use that to set this dynamic max memory limit. These are the similarities we see in terms of how AHS could be integrated into the OpenJDK via things that already exist.
You might be asking why we even want to upstream this solution to the OpenJDK. The answer is two-fold. First, if we have AHS working in OpenJDK, that's less of a maintenance burden for us within Google, because right now every time they release a new patch, if something changes the behavior, then we have to scramble and figure out how we're going to integrate those changes with what we already have in our own custom version of the JDK.
Secondly, and I think this is also often not really focused on, as more people use it, and if it's part of the OpenJDK there will presumably be more users, the product can become more mature and we can get better feedback and have different edge cases. Ultimately it'll help everyone, not only ourselves, but everyone using the product. I think it's a good thing to have. It's a good thing for everyone in the OpenJDK community to be aware of, and by having it available as an option for people to experiment with or try in their own servers, we can hopefully have industry-wide savings and also improve how well it works for ourselves as well.
This is where I could use your help. As you can see, the principles established as part of these prototypes, or even the working product, have demonstrated millions of dollars of savings within Google already. If we were able to get this sort of thing working industry-wide, the savings could be massive. We could put the RAM industry out of business. No, I'm just kidding, but that would be nice if we could save lots of RAM, right? I think that would be cool.
Again, this is a lot of work for just one person. I am the only person working on this project at Google, and so my time is mostly spent just trying to make sure everything is working and answering questions and fixing fires as they come up. It would be awesome if any of you who are interested in helping with maybe upstreaming some of these ideas, or becoming familiar with the concepts of AHS yourself so that you could look into how this might work in your own company or upstream, would get involved. I would just love to have more people talking and discussing this.
I'll list the RFEs here that I think are most relevant or most related to what I'm working on at Google, and maybe those of you who are interested could take a look at this or comment on it or say, hey, we think this is important, we can work on it.
If any of you are interested in helping with this or just want to learn more or want to try developing something like this for your own company, please reach out to me. I'm very happy to help and I'm excited to see where this is going next.
I put my contact information on this slide again. Feel free to reach out to me via LinkedIn or my email. Happy to set up some time to talk about this with anyone. I appreciate you all for listening. Thank you.