newrelic-java-agent: Caffeine cache causing memory leak and OOM

Description

Recently, we’ve updated our services to Java 17 (v 17.0.1), which also forced us to use the NewRelic java agent version 7.4.0.

After the migration, we’ve noticed our services regularly crash due to OOM. First, we’ve suspected some issue with Java 17, but after obtaining several heap dumps, it looks like it’s a problem with Caffeine caching introduced in 7.3.0 #258 (we’ve migrated from an older version of the agent).

We believe it might be related to the MariaDB driver we’re using. In one of the heapdumps, the com.newrelic.agent.deps.com.github.benmanes.caffeine.cache.WS map has 2.1 mil entries of prepared statements. It could be the case as it would rely on mariadb-java-client implementation of PreparedStatement which does not implement equals/hashCode.

After doing two other heapdumps, one right after startup of the app and one an hour later, this is the comparison:

image

Looks, like the cache contained only PreparedStatement entries.

Expected Behavior

Agent and service running smoothly, no memory leak.

Troubleshooting or NR Diag results

Steps to Reproduce

Your Environment

NewRelic agent: 7.4.0 Java: eclipse-temurin:17 docker image (currently 17.0.1) Spring Boot: 2.5.6 MariaDB driver: 2.7.4 (managed by Spring Boot BOM)

Additional context

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 10
  • Comments: 61 (30 by maintainers)

Commits related to this issue

Most upvoted comments

Caffeine’s fix is released in v2.9.3 and v3.0.5. This will recover if the executor malfunctions, but affected users should apply a workaround for the JDK bug.

Today the application used less memory, though there still was a slight upward trend. I created two heap dumps; in the mean time the application was restarted for unrelated reasons. Key findings:

  • The WS instance referenced by ExtensionHolderFactoryImpl.ExtensionHolderImpl.instanceCache that contains the prepared statements is now small (just 267776 and 37752 bytes retained, respectively), with keyReferenceQueue.queueLength == 0, as it should be. 🎉

    All other ExtensionHolderFactoryImpl.ExtensionHolderImpl.instanceCache instances are now also small (on the order of kilobytes). I went back to yesterday’s heap dump and saw that a few other such instances were also larger back then (on the order of megabytes).

    All in all: whatever the issue/bug is, Caffeine.executor(Runnable::run) avoids it. ✔️

  • Two other WS instances that take up a significant amount of memory remain:

    • Instances referenced by JdbcHelper.connectionToURL comprising 22.17% / 8.12% of the heap, with a keyReferenceQueue.queueLength of 616446 / 102386.
    • Instances referenced by JdbcHelper.connectionToIdentifier comprising 18.47% / 5.74% of the heap, with a keyReferenceQueue.queueLength of 501032 / 76343.

    IIUC these caches are created by AgentCollectionFactory.createConcurrentWeakKeyedMap, for which we did not set Caffeine.executor(Runnable::run). It’s safe to say that these two caches explain why memory usage was still trending upwards today.

As previously discussed, the next step would now be to test whether the issue is specific to Java 17, by downgrading to (e.g.) Java 16. Unfortunately my schedule today was even busier than expected, and tomorrow will certainly be no better. So instead I have started a different experiment that took me a bit less time to set up: I now updated the New Relic Agent to use Executors.newCachedThreadPool() rather than Runnable::run (this time also for the cache in AgentCollectionFactory). This experiment should hopefully tell us whether (a) the issue is somehow related to any kind of scheduling/thread switching or (b) it is more likely to be specific to ForkJoinPool.commonPool().

Stay tuned 😃

Yes. I’ll provide a patch jar this evening (PST) and notify you here. If it helps then we can merge that in for a release. We can target the weekend to give us both some bandwidth if non-urgent. If urgent then I think a temporary workaround on your side is most prudent so that you can control the code.

@ben-manes tnx for the update: that makes sense 😃

As for the Java Agent warning: I suppose that could be turned into a full-fledged new feature: detection of in-use functionality impacted by known bugs (not just in the JDK, but also other libraries on the classpath). 📈

@Stephan202 Had a sick kid this week so wasn’t able to engage with this issue as I’d hoped. We’re running in k8s so it’s almost certainly https://bugs.openjdk.java.net/browse/JDK-8274349 as you suggest. Testing whether that arg resolves the leak now

Edit: Been running for a few hours now with that arg. OU space isn’t growing, so I’m pretty confident that the suggestion is definitely a usable workaround until Java 17.0.2. Thanks for figuring this out

@trask currently we have not reproduced but reasoned through it based on heap dumps, Doug’s deep knowledge, and code spelunking. The main bit of insight came from the email below,

On Sun, Nov 21, 2021 at 12:14 PM Doug Lea dl@cs.oswego.edu wrote: Also…

On 11/21/21 1:55 PM, Stephan Schroevers wrote:

In summary, the head dump contains a single ForkJoinPool instance, referenced by ForkJoinPool.common and the various Caffeine caches. Inspecting its fields, I see that:

  • ctl, mode, bounds, threadIds, scanRover and stealCount are 0.

The only thing I know that can account for this is if there is a static initialization circularity in the system. Which is one of the possibilities that the in-progress updates try to reduce. (They can’t be eliminated because they are due to user programs, but opportunities can be reduced.)

Well, the other possibility is memory corruption or a GC bug. Maybe try different GCs.

A quick status update.

I completed the work on v3, but have not yet backported it to v2 used here. Given the U.S. holiday week, I’ll try but may not have a release until next weekend. Those changes only mask the problem because the ForkJoinPool may be in a corrupted state and impact the application’s behavior.

@Stephan202 and Doug further isolated the problem to static initialization circularity. While there are some in progress improvements to relax this, this sounds like a mistake in NewRelic’s agent which performs weaving during class loading. I suspect this is because Caffeine defaults to using ForkJoinPool.commonPool(), so constructing a cache forces that to initialize when it sets the internal executor field. This oversight might have occurred when replacing Guava’s CacheBuilder usages with a Caffeine builder, as they appear to be nearly identical. However Guava lacks this option and simply runs work on a calling thread. That has the potential to degrade external latencies if user supplied logic is expensive, such as callbacks to a removal listener. Similar behavior can be obtained using Runnable::run which was shown to be fine for this workload.

A similar project is OpenTelemetry’s java agent, which also embeds Caffeine for its weak keyed caches. They switched to a same-thread executor after other approaches to wrangle FJP become tiresome. /cc @anuraaga @trask

While preliminary findings, I believe the correct fix is to specify the Caffeine.executor in all instances created by the java agent.

Okay, I didn’t wait for the application’s memory usage to reach critical levels, but I can confirm that the patch doesn’t make the issue go away. A new heap dump shows that ExtensionHolderFactoryImpl.ExtensionHolderImpl.instanceCache again references a large WS instance, now accounting for 47.58% of the heap. This time keyReferenceQueue.queueLength is 298808. (And drainStatus is again PROCESSING_TO_REQUIRED.)

I have now updated the New Relic Agent fork to include one more commit, configuring Caffeine.executor(Runnable::run) for the relevant cache instance. I’ll let this run over night and check again some time tomorrow morning. (The next two days are also full of meetings, so I won’t be able to iterate much faster, sorry.)

Whoa, awesome! Yes, thank you @ben-manes @Stephan202 and @mateusz-bajorek for this community effort. We’ll plan to update our dependency with the patch version in the next release.

@Stephan202 When working on the release, I decided to remove the proposed detection and fallback logic. It would only mask the problem in some ways, but the application would be broken in any other usage. While it is very surprising that the commonPool is broken, it’s a JVM issue that we can now explain to confused users. The only change that the library will make is to be more robust about evicting in this case, which would have avoided the memory leak.

I do think it might be reasonable for a java agent or similar tools to detect these types of bugs and warn users of problematic application settings. That fits with their goal of monitoring the system’s health.

I toyed with that docker flag quite a bit and I think checking the java version is unfortunately the best way. The snippet below is what I’m thinking of for 3.x and for 2.x it would have to check the java.version system property. wdyt?

Executor getExecutor() {
  if ((executor != null)
      && !((executor instanceof ForkJoinPool) && (executor == ForkJoinPool.commonPool()))) {
    return executor;
  } else if ((ForkJoinPool.getCommonPoolParallelism() == 1) && (Runtime.version().feature() == 17)
      && (Runtime.version().interim() == 0) && (Runtime.version().update() <= 1)) {
    return Runnable::run; // JDK-8274349
  }
  return ForkJoinPool.commonPool();
}

If that is the case then the fix for this issue is to warn users of the incompatibility, such as by logging an error on startup if the incompatibility is detected.

Please confirm if my understanding is correct and @tbradellis if that suggested change would be an acceptable fix.

Informing users of the incompatibility and pointing to JDK-8274349 does seem reasonable and would be in-line with how we’ve treated other issues that point back to the JDK (usually with some SIGSEGV on an older JDK). Although, if we can provide a code fix that avoids the issue we’d do that as well. We do plan to take https://github.com/newrelic/newrelic-java-agent/pull/572 as it’s more in-line with the what was done in the old Guava dependency and if I.U.C. avoids this issue all together, and would therefore prevent time spent on future issue reports.

I will add that we saw 8 full garbage collections on startup (and, to be honest, a judicious 9th would’ve been good) with newrelic enabled, and without it there are none, so even though the profile stabilises I feel like there are still issues here. I haven’t yet tested parallel=1 + newrelic=off, or number of FGCs for Java11+newrelic. Will try to do so on Monday, but for the latter I need to to get tools installed on the Java 11 image so I can compare, and… 🙄 . Please don’t hold your breath, anyway.

Thank you @Stephan202 . Sadly the issue does not impact me directly so I do not have a setup prepared to retest it. There is no OOM with my application, even when I tried to force them. Probably growth is too small to impact frequently deployed app instances. Still, this does block wider adoption of Java 17 for me 😿 I asked for the information to better plan future work. Thank you for doing an awesome job with this issue 🙇‍♂️

^ I just added a comment to the PR (https://github.com/newrelic/newrelic-java-agent/pull/572#issuecomment-976944700), and based on an analysis of the latest memory dump it might be that @ben-manes’s suggestion is a hard requirement for a successful resolution of this issue.

If possible, you might want to initialize ForkJoinPool.commonPool() explicitly and early in the life cycle, in a manner that avoids circular initialization. There are some rare gotchas with j.u.c. and classloading for very core features, such as the lost unpark bug. If you can safeguard during bootstrap then maybe you’ll be protected from this happening again due to hidden usages that you did not know to account for. I’m not familiar enough with agents and weaving to know if that is possible and its implications, though.

I see you already found #572 😃

left some comments in the PR.

Maybe there’s a subtly different root cause there.

Makes sense. We’ll of course continuing working with the customer and see if there’s anything else we can surface around it.

I’m starting a branch that will use Caffeine.executor(Runnable::run) as @ben-manes suggested.

I see you already found #572 😃. I noticed that you’re no longer an outside collaborator on the fork, but if you’d like to have access again let me know and I’ll send a new invite.

We have a customer reporting this in a support ticket on IBM J9 VM JDK 8u301. The heap profile is the same.

That’s interesting indeed. Not sure if we can see the source code of that JDK version, but the current JDK 17 implementation of ForkJoinPool uses VarHandle, so it’s unlikely that those changes were backported all the way to Java 8. Maybe there’s a subtly different root cause there. In any case, it can’t hurt for them to try the Caffeine.executor(Runnable::run) changes as well, I suppose.

Alright, with Executors.newCachedThreadPool() the issue also does not reproduce. 🎉 (The heap dump now contains only small WS instances, with the largest one (containing the prepared statements) retaining only 339046 bytes.) So this definitely points to the issue being somehow related to ForkJoinPool.commonPool().

I have now repackaged the application based on Docker image openjdk:16.0.2-slim-bullseye (Oracle Java 16.0.2+7-67) and deployed it in combination with the original New Relic Agent fork, i.e. without any of the modifications applied over the past few days. I hope to have a moment EOD to report any findings of this (hopefully final 🙃) experiment.

Thanks @Stephan202! Then this is as expected that Caffeine’s memory barrier wasn’t the cause. This next test run should let us know if the executor is silently discarding the task.

I deployed New Relic Agent 7.3.0 + Java 17 support + the custom Caffeine SNAPSHOT from Jitpack to the affected application. It’ll take a bit (possibly until tomorrow, since some key fulfillment center processes will soon wrap up) before I can say with certainty whether the issue still occurs. Will post another message once I know more.

Thanks for that context @ben-manes! I just checked the heap dump associated with the screenshot above, and the shown WS instance has a keyReferenceQueue field with a queueLength of 1641240 (📈). The drainKeyReferences() logic looks fine i.c.w. the WS method overrides, so (IIUC) the question now becomes: why is drainKeyReferences() not invoked?

I see drainKeyReferences() is only invoked from maintenance(), which is guarded by evictionLock. The thread dump shows that evictionLock.sync.state is 0 (with the other fields being null), which IIUC means that this lock wasn’t held at the time. (Unfortunately we didn’t also create a thread dump.)

I see that the cache’s drainStatus is 3 (PROCESSING_TO_REQUIRED); not sure whether that’s a relevant hint.

That’s all I got for now. If something else comes to mind I’ll share here.

(I suspect it’s this cache, but don’t yet understand the New Relic Agent code well enough to be sure.)

Had a closer look, and indeed the WS instance is referenced by a ExtensionHolderFactoryImpl.ExtensionHolderImpl.instanceCache. It appears that the agent loads this code in some special way using CaffeineBackedExtensionClass. No idea whether class loader/class weaving/java agent “magic” is relevant here. 🤔