newrelic-java-agent: Caffeine cache causing memory leak and OOM
Description
Recently, we’ve updated our services to Java 17 (v 17.0.1), which also forced us to use the NewRelic java agent version 7.4.0.
After the migration, we’ve noticed our services regularly crash due to OOM. First, we’ve suspected some issue with Java 17, but after obtaining several heap dumps, it looks like it’s a problem with Caffeine caching introduced in 7.3.0 #258 (we’ve migrated from an older version of the agent).
We believe it might be related to the MariaDB driver we’re using. In one of the heapdumps, the com.newrelic.agent.deps.com.github.benmanes.caffeine.cache.WS map has 2.1 mil entries of prepared statements. It could be the case as it would rely on mariadb-java-client implementation of PreparedStatement which does not implement equals/hashCode.
After doing two other heapdumps, one right after startup of the app and one an hour later, this is the comparison:

Looks, like the cache contained only PreparedStatement entries.
Expected Behavior
Agent and service running smoothly, no memory leak.
Troubleshooting or NR Diag results
Steps to Reproduce
Your Environment
NewRelic agent: 7.4.0
Java: eclipse-temurin:17 docker image (currently 17.0.1)
Spring Boot: 2.5.6
MariaDB driver: 2.7.4 (managed by Spring Boot BOM)
Additional context
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 10
- Comments: 61 (30 by maintainers)
Commits related to this issue
- Protect reference caches from a discarding executor The maintenance task is scheduled on an executor and disables further scheduling until the submitted task runs. If the task is not run when expecte... — committed to ben-manes/caffeine by ben-manes 3 years ago
- Configure bounded Caffeine caches with a same-thread executor By default Caffeine uses `ForkJoinPool.commonPool()`. When running Java 17, usage of `ForkJoinPool` early in the agent's life cycle may t... — committed to PicnicSupermarket/newrelic-java-agent by Stephan202 3 years ago
- Configure bounded Caffeine caches with a same-thread executor By default Caffeine uses `ForkJoinPool.commonPool()`. When running Java 17, usage of `ForkJoinPool` early in the agent's life cycle may t... — committed to PicnicSupermarket/newrelic-java-agent by Stephan202 3 years ago
- Configure bounded Caffeine caches with a same-thread executor By default Caffeine uses `ForkJoinPool.commonPool()`. When running Java 17, usage of `ForkJoinPool` early in the agent's life cycle may t... — committed to PicnicSupermarket/newrelic-java-agent by Stephan202 3 years ago
- Configure bounded Caffeine caches with a same-thread executor By default Caffeine uses `ForkJoinPool.commonPool()`. When running Java 17, usage of `ForkJoinPool` early in the agent's life cycle may t... — committed to PicnicSupermarket/newrelic-java-agent by Stephan202 3 years ago
Caffeine’s fix is released in v2.9.3 and v3.0.5. This will recover if the executor malfunctions, but affected users should apply a workaround for the JDK bug.
Today the application used less memory, though there still was a slight upward trend. I created two heap dumps; in the mean time the application was restarted for unrelated reasons. Key findings:
The
WSinstance referenced byExtensionHolderFactoryImpl.ExtensionHolderImpl.instanceCachethat contains the prepared statements is now small (just 267776 and 37752 bytes retained, respectively), withkeyReferenceQueue.queueLength == 0, as it should be. 🎉All other
ExtensionHolderFactoryImpl.ExtensionHolderImpl.instanceCacheinstances are now also small (on the order of kilobytes). I went back to yesterday’s heap dump and saw that a few other such instances were also larger back then (on the order of megabytes).All in all: whatever the issue/bug is,
Caffeine.executor(Runnable::run)avoids it. ✔️Two other
WSinstances that take up a significant amount of memory remain:JdbcHelper.connectionToURLcomprising 22.17% / 8.12% of the heap, with akeyReferenceQueue.queueLengthof 616446 / 102386.JdbcHelper.connectionToIdentifiercomprising 18.47% / 5.74% of the heap, with akeyReferenceQueue.queueLengthof 501032 / 76343.IIUC these caches are created by
AgentCollectionFactory.createConcurrentWeakKeyedMap, for which we did not setCaffeine.executor(Runnable::run). It’s safe to say that these two caches explain why memory usage was still trending upwards today.As previously discussed, the next step would now be to test whether the issue is specific to Java 17, by downgrading to (e.g.) Java 16. Unfortunately my schedule today was even busier than expected, and tomorrow will certainly be no better. So instead I have started a different experiment that took me a bit less time to set up: I now updated the New Relic Agent to use
Executors.newCachedThreadPool()rather thanRunnable::run(this time also for the cache inAgentCollectionFactory). This experiment should hopefully tell us whether (a) the issue is somehow related to any kind of scheduling/thread switching or (b) it is more likely to be specific toForkJoinPool.commonPool().Stay tuned 😃
Yes. I’ll provide a patch jar this evening (PST) and notify you here. If it helps then we can merge that in for a release. We can target the weekend to give us both some bandwidth if non-urgent. If urgent then I think a temporary workaround on your side is most prudent so that you can control the code.
@ben-manes tnx for the update: that makes sense 😃
As for the Java Agent warning: I suppose that could be turned into a full-fledged new feature: detection of in-use functionality impacted by known bugs (not just in the JDK, but also other libraries on the classpath). 📈
@Stephan202 Had a sick kid this week so wasn’t able to engage with this issue as I’d hoped. We’re running in k8s so it’s almost certainly https://bugs.openjdk.java.net/browse/JDK-8274349 as you suggest. Testing whether that arg resolves the leak now
Edit: Been running for a few hours now with that arg. OU space isn’t growing, so I’m pretty confident that the suggestion is definitely a usable workaround until Java 17.0.2. Thanks for figuring this out
@trask currently we have not reproduced but reasoned through it based on heap dumps, Doug’s deep knowledge, and code spelunking. The main bit of insight came from the email below,
A quick status update.
I completed the work on v3, but have not yet backported it to v2 used here. Given the U.S. holiday week, I’ll try but may not have a release until next weekend. Those changes only mask the problem because the ForkJoinPool may be in a corrupted state and impact the application’s behavior.
@Stephan202 and Doug further isolated the problem to static initialization circularity. While there are some in progress improvements to relax this, this sounds like a mistake in NewRelic’s agent which performs weaving during class loading. I suspect this is because Caffeine defaults to using
ForkJoinPool.commonPool(), so constructing a cache forces that to initialize when it sets the internal executor field. This oversight might have occurred when replacing Guava’s CacheBuilder usages with a Caffeine builder, as they appear to be nearly identical. However Guava lacks this option and simply runs work on a calling thread. That has the potential to degrade external latencies if user supplied logic is expensive, such as callbacks to a removal listener. Similar behavior can be obtained usingRunnable::runwhich was shown to be fine for this workload.A similar project is OpenTelemetry’s java agent, which also embeds Caffeine for its weak keyed caches. They switched to a same-thread executor after other approaches to wrangle FJP become tiresome. /cc @anuraaga @trask
While preliminary findings, I believe the correct fix is to specify the
Caffeine.executorin all instances created by the java agent.Okay, I didn’t wait for the application’s memory usage to reach critical levels, but I can confirm that the patch doesn’t make the issue go away. A new heap dump shows that
ExtensionHolderFactoryImpl.ExtensionHolderImpl.instanceCacheagain references a largeWSinstance, now accounting for 47.58% of the heap. This timekeyReferenceQueue.queueLengthis 298808. (AnddrainStatusis againPROCESSING_TO_REQUIRED.)I have now updated the New Relic Agent fork to include one more commit, configuring
Caffeine.executor(Runnable::run)for the relevant cache instance. I’ll let this run over night and check again some time tomorrow morning. (The next two days are also full of meetings, so I won’t be able to iterate much faster, sorry.)Whoa, awesome! Yes, thank you @ben-manes @Stephan202 and @mateusz-bajorek for this community effort. We’ll plan to update our dependency with the patch version in the next release.
@Stephan202 When working on the release, I decided to remove the proposed detection and fallback logic. It would only mask the problem in some ways, but the application would be broken in any other usage. While it is very surprising that the commonPool is broken, it’s a JVM issue that we can now explain to confused users. The only change that the library will make is to be more robust about evicting in this case, which would have avoided the memory leak.
I do think it might be reasonable for a java agent or similar tools to detect these types of bugs and warn users of problematic application settings. That fits with their goal of monitoring the system’s health.
I toyed with that docker flag quite a bit and I think checking the java version is unfortunately the best way. The snippet below is what I’m thinking of for 3.x and for 2.x it would have to check the
java.versionsystem property. wdyt?Informing users of the incompatibility and pointing to JDK-8274349 does seem reasonable and would be in-line with how we’ve treated other issues that point back to the JDK (usually with some SIGSEGV on an older JDK). Although, if we can provide a code fix that avoids the issue we’d do that as well. We do plan to take https://github.com/newrelic/newrelic-java-agent/pull/572 as it’s more in-line with the what was done in the old Guava dependency and if I.U.C. avoids this issue all together, and would therefore prevent time spent on future issue reports.
I will add that we saw 8 full garbage collections on startup (and, to be honest, a judicious 9th would’ve been good) with newrelic enabled, and without it there are none, so even though the profile stabilises I feel like there are still issues here. I haven’t yet tested parallel=1 + newrelic=off, or number of FGCs for Java11+newrelic. Will try to do so on Monday, but for the latter I need to to get tools installed on the Java 11 image so I can compare, and… 🙄 . Please don’t hold your breath, anyway.
Thank you @Stephan202 . Sadly the issue does not impact me directly so I do not have a setup prepared to retest it. There is no OOM with my application, even when I tried to force them. Probably growth is too small to impact frequently deployed app instances. Still, this does block wider adoption of Java 17 for me 😿 I asked for the information to better plan future work. Thank you for doing an awesome job with this issue 🙇♂️
^ I just added a comment to the PR (https://github.com/newrelic/newrelic-java-agent/pull/572#issuecomment-976944700), and based on an analysis of the latest memory dump it might be that @ben-manes’s suggestion is a hard requirement for a successful resolution of this issue.
If possible, you might want to initialize
ForkJoinPool.commonPool()explicitly and early in the life cycle, in a manner that avoids circular initialization. There are some rare gotchas with j.u.c. and classloading for very core features, such as the lost unpark bug. If you can safeguard during bootstrap then maybe you’ll be protected from this happening again due to hidden usages that you did not know to account for. I’m not familiar enough with agents and weaving to know if that is possible and its implications, though.left some comments in the PR.
Makes sense. We’ll of course continuing working with the customer and see if there’s anything else we can surface around it.
I see you already found #572 😃. I noticed that you’re no longer an outside collaborator on the fork, but if you’d like to have access again let me know and I’ll send a new invite.
That’s interesting indeed. Not sure if we can see the source code of that JDK version, but the current JDK 17 implementation of
ForkJoinPoolusesVarHandle, so it’s unlikely that those changes were backported all the way to Java 8. Maybe there’s a subtly different root cause there. In any case, it can’t hurt for them to try theCaffeine.executor(Runnable::run)changes as well, I suppose.Alright, with
Executors.newCachedThreadPool()the issue also does not reproduce. 🎉 (The heap dump now contains only smallWSinstances, with the largest one (containing the prepared statements) retaining only 339046 bytes.) So this definitely points to the issue being somehow related toForkJoinPool.commonPool().I have now repackaged the application based on Docker image
openjdk:16.0.2-slim-bullseye(Oracle Java 16.0.2+7-67) and deployed it in combination with the original New Relic Agent fork, i.e. without any of the modifications applied over the past few days. I hope to have a moment EOD to report any findings of this (hopefully final 🙃) experiment.Wonderful! The openjdk github shows that changes were made to ForkJoinPool in jdk17, which had been untouched since jdk13.
Thanks @Stephan202! Then this is as expected that Caffeine’s memory barrier wasn’t the cause. This next test run should let us know if the executor is silently discarding the task.
I deployed New Relic Agent 7.3.0 + Java 17 support + the custom Caffeine SNAPSHOT from Jitpack to the affected application. It’ll take a bit (possibly until tomorrow, since some key fulfillment center processes will soon wrap up) before I can say with certainty whether the issue still occurs. Will post another message once I know more.
Thanks for that context @ben-manes! I just checked the heap dump associated with the screenshot above, and the shown
WSinstance has akeyReferenceQueuefield with aqueueLengthof 1641240 (📈). ThedrainKeyReferences()logic looks fine i.c.w. theWSmethod overrides, so (IIUC) the question now becomes: why isdrainKeyReferences()not invoked?I see
drainKeyReferences()is only invoked frommaintenance(), which is guarded byevictionLock. The thread dump shows thatevictionLock.sync.stateis0(with the other fields beingnull), which IIUC means that this lock wasn’t held at the time. (Unfortunately we didn’t also create a thread dump.)I see that the cache’s
drainStatusis3(PROCESSING_TO_REQUIRED); not sure whether that’s a relevant hint.That’s all I got for now. If something else comes to mind I’ll share here.
Had a closer look, and indeed the
WSinstance is referenced by aExtensionHolderFactoryImpl.ExtensionHolderImpl.instanceCache. It appears that the agent loads this code in some special way usingCaffeineBackedExtensionClass. No idea whether class loader/class weaving/java agent “magic” is relevant here. 🤔