classgraph: OutOfMemoryError spins library out of control

Issue summary

nonapi.io.github.classgraph.fastzipfilereader.NestedJarHandler#close may never complete if Thread was interrupted, which is always if scanning failed. See below for details.

Background

I was writing some unit tests for a really basic Neo4J graph, but while Neo4J-OGM was looking for my Entities during test setup, it just never completed. With some very roundabout debugging (all the async and OOM interplay was making me tear my hair out), I managed to narrow down the root cause to how classgraph works. I have 32G memory, and Java defaults to using 1/4th of memory per process, which means I cannot run much in parallel, so I have a global 256M override for each process. It works fine for Gradle, Neo4J, and other traditionally resource-hungry setups. It even works for my test setup most of the time, so for this report repro I lowered the memory limit to be consistent.

Repro Setup

trivial library usage:

	// implementation "io.github.classgraph:classgraph:4.8.62"
	ScanResult result = new ClassGraph()
			.whitelistPackages("pack.age")
			.verbose()
			.scan()
			;

plus a dependency with a nested jar in it:

runtimeOnly "org.neo4j:neo4j-lucene-upgrade:3.4.9"

non-trivial: resource constrained memory: -Xmx128M (see build.gradle)

Full minimal repro can be found here: https://github.com/TWiStErRob/repros/tree/master/classgraph/oom-NestedJarHandler-spin

Repro steps

gradlew run

Import project to IntellIJ
Execute Main.main (should pass, depending on machine setup)
Edit Run Configuration to have VM options: -Xmx128M
Run or debug again (will fail, expected: print a line)

gradlew run.log

Pieces of the puzzle

What I found to be contributing to the issue.

Nested Jar files

runtimeOnly "org.neo4j:neo4j-lucene-upgrade:3.4.9"

an example from Neo4J/OGM, which contains !/lib/lucene-backward-codecs-5.5.0.jar This is required to trigger

                    } else {
                        // This path has one or more '!' sections.

code path in NestedJarHandler.

Large allocation to trigger OOM

This is baked into the library. MappedByteBufferResources allocates 64M per JAR file, which is quite large, considering there could be many jars like this on the classpath, there could a lot of allocations happening at the same time and consuming tons of memory. Note: I have 14 core/28 logical core processor, and the library runs on 39 threads. So if all those are processing nested JARs, 2.5G of memory is required. That doesn’t sound “ultra-lightweight” 😃

This allocation throws an OOM, which is caught by AutoCloseableExecutorService.afterExecute and the thread is interrupted. I think this interrupt doesn’t hurt NestedJarHandler.close, so it could be an irrelevant red herring to the issue. But this location came up during the investigation so I thought I would mention it.

`Scanner.call` catches and cleans up

When an exception occurred (e.g. our OOM), removeTemporaryFilesAfterScan is set to true and threads are interrupted. finally then comes and tries to close nestedJarHandler. The relevant parts of the method:

    @Override
    public ScanResult call() throws InterruptedException, CancellationException, ExecutionException {
        try {
            scanResult = openClasspathElementsThenScan();
        } catch (final Throwable e) {
            // Since an exception was thrown, remove temporary files
            removeTemporaryFilesAfterScan = true;

            // Stop any running threads (should not be needed, threads should already be quiescent)
            interruptionChecker.interrupt();
        } finally {
            if (removeTemporaryFilesAfterScan) {
                // If removeTemporaryFilesAfterScan was set, remove temp files and close resources,
                // zipfiles and modules
                nestedJarHandler.close(topLevelLog);
            }
        }
        return scanResult;
    }

`NestedJarHandler#close` spins

See numbered inline comments for explanation.

            if (canonicalFileToPhysicalZipFileMap != null) {
                // (4) spins out of control, because (3)
                while (!canonicalFileToPhysicalZipFileMap.isEmpty()) {
                    try {
                        // (1) throws InterruptedException if thread was interrupted, which is always because Scanner error handling interrupted
                        for (final Entry<File, PhysicalZipFile> ent : canonicalFileToPhysicalZipFileMap.entries()) {
                            final PhysicalZipFile physicalZipFile = ent.getValue();
                            physicalZipFile.close();
                            // (3) never executed, because of ConcurrentHashMap's interruption in SingletonMap
                            canonicalFileToPhysicalZipFileMap.remove(ent.getKey());
                        }
                    } catch (final InterruptedException e) {
                        // (2) re-interrupts Thread
                        interruptionChecker.interrupt();
                    }
                }
                canonicalFileToPhysicalZipFileMap = null;
            }

Proposed solutions

Quick treatment with library gracefully executing to successful completion:
Catch OOM explicitly when allocating buf in MappedByteBufferResources and spill to disk if that happened. This would be in line with how IOException from makeTempFile is caught.
Lower memory usage (could be combined with previous):
fastZipEntryToZipFileSliceMap knows the size of the nested Jar file (childZipEntry), if that was passed down the decision can be immediately made to spill to disk before allocating and without reading the whole stream.
Don’t spin
In any case, NestedJarHandler.close should probably be able to close canonicalFileToPhysicalZipFileMap even in an interrupted state, so that fail classpath scans can receive the exceptional termination and handle accordingly.

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 2
Comments: 18 (18 by maintainers)

Commits related to this issue

Repro for classgraph/classgraph#400 — committed to TWiStErRob/repros by TWiStErRob 4 years ago
Trim 64MB buffer if jar is smaller; use InputStream size hint (#400) — committed to classgraph/classgraph by lukehutch 4 years ago
Add `ClassGraph#setMaxBufferedJarRAMSize(int)` (#400) — committed to classgraph/classgraph by lukehutch 4 years ago
Add disableMemoryMapping() to use RandomAccessFile instead of mmap: #400 — committed to classgraph/classgraph by lukehutch 4 years ago
Merge pull request #401 from TWiStErRob/issue400_test Add performance regression test for #400 — committed to classgraph/classgraph by lukehutch 4 years ago
Avoid infinite loop on interruption (#400) — committed to classgraph/classgraph by lukehutch 4 years ago
Replace file slicing layer and remove mmap support (#400) — committed to classgraph/classgraph by lukehutch 4 years ago
Add mmap support back as an option (#400) — committed to classgraph/classgraph by lukehutch 4 years ago
GH-762 - Update classgraph dependency to start up in low-memory environments. See https://github.com/classgraph/classgraph/issues/400 for more details. Also be precise in our internal docs why we ar... — committed to neo4j/neo4j-ogm by TWiStErRob 4 years ago
GH-762 - Update classgraph dependency to start up in low-memory environments. See https://github.com/classgraph/classgraph/issues/400 for more details. Also be precise in our internal docs why we ar... — committed to neo4j/neo4j-ogm by TWiStErRob 4 years ago
[SVN] r2932 Cinema - Heroku in /Projects/Cinema/ [FIX] Increase Gradle daemon and hence Unit runner memory to avoid https://github.com/classgraph/classgraph/issues/400 — committed to TWiStErRob/net.twisterrob.cinema by TWiStErRob 4 years ago

Most upvoted comments

After working around #403 by using JDK 9 for runtime, I can confirm that a complex Neo4J OGM, Kotlin, Ktor app runs with Xmx64M happily, and Xmx32M still runs into OOM, but doesn’t spin.

TWiStErRob on Feb 16, 2020

Wow, lot of info 😃 And also I’m not used to getting fixes (let alone replies) this fast. Well done!

You’re welcome! That’s the goal. Especially when somebody files a bug that is as detailed and deeply researched as yours is!

I usually try to ensure any bug that is reported is fixed within 24 hours, and a new release pushed out – but this particular bug report has triggered a whole set of domino changes that affect ClassGraph at a very deep level, and it might take a few days to get all the changes in place. I’ll try to get some more of that done tomorrow.

I hadn’t considered though that Java limits not just the resident size but also the entire accessible virtual memory size

Yes, Java OOM means there’s no more JVM heap available. JVMs always have an upper limit for memory (Runtime.getRuntime().maxMemory()), you cannot new things more than that. All that virtual memory and paging only applies to native allocations. If anyone, lib or app, does a new during runtime that’ll chip away from the Xmx.

Right, but I thought you were saying -Xmx sets the maximum limit for all allocations, both heap and memory mapped files? Or do memory mappings fall outside the -Xmx limit (while still being limited, probably relative to RAM size)? (I guess I could just go test this, but it seems like you know the answer…)

I wholeheartedly support this, but for a 1k nested JAR there is no need to allocate and consume 64MB of Java heap. Which you fixed in 037983e.

Yes, this was a major oversight, that I forgot to trim that array! I think that alone is the source of the sometimes dramatic drop in memory consumption you saw in your latest comment.

Yep, I checked inside neo4j-lucene-upgrade-3.4.9.jar, it has 3 jar files inside its lib folder. ZIP-ing a ZIP always gives a bit of extra gain. In this case it’s 9% (500k) deflated vs. stored version. In principle I agree, JARs should always store inner JARs to be able to use optimizations for classpath like you did.

I assumed that zipping a zipfile rarely if ever resulted in further compression, but I guess if a zipfile already contained stored uncompressed content, it should be compressible. Or maybe some zipfiles contain central directories that are large enough that zip thinks it is worth it to zip the file for the compression gains. However I just looked at my ~/.m2/repository cache, and there isn’t a single jar in there that contains a nested jar. So I think it is only web applications, Spring-Boot applications, etc. (with custom classloaders) that cause this issue to arise, and not libraries in general.

Note: This alone wouldn’t have helped, because I’m not using this library, Neo4J OGM does and I don’t have access to ClassGraph builder. Luckily using the hint will help a lot in this case.

Well that’s even more amazing then that you put so much time and effort into tracking down the source of the problem, when it was a dependency of a dependency that was causing the problem!

Memory mapping sounds like the optimal way to go, as it’s handled natively, doesn’t copy data (saves JVM heap and native memory).

I always thought so before, but I’m not so sure anymore. Actually the new RandomAccessFile method only ever relies on operating system buffers, which will only ever use unused RAM, so the memory overhead is basically zero.

I guess in the end memory mapping still has to read from disk when the memory is accessed and because I/O is magnitude slower than RAM you won’t see a difference in processing speed. The win is the change in amount and location of allocated memory.

Yes, and this is done lazily. But I didn’t know that OS file buffering worked so well with RandomAccessFile – I thought it would cause many additional disk reads, leading to greater overhead.

I should test this on at least Windows before I totally pull the plug on MappedByteBuffer, but to be honest the mmap support in ClassGraph has caused a neverending series of headaches over the years, e.g. because without using misc.Unsafe, you can’t forcibly unmap mapped byte buffers, you just have to wait for the GC to decide to collect them, then they’re freed – and if you don’t unmap them, Windows holds a file lock on them, which can cause files that were supposed to be deleted to be left behind. Anyway although ClassGraph seems to work flawlessly with mmap now across all three major operating systems (and it took a lot of work to get there), I won’t shed any tears ripping all of it out as long as the performance is not impacted.

The last two are artificial files with 128 nested JAR files, that I can’t attach here 😦 I know it’s unrealistic, but it proves your fix working, and it is working really well! See also PR #401.

Thanks, I really appreciate the contribution!

lukehutch on Feb 10, 2020

I played around a bit more, I was curious if I can reproduce what I jokingly referred to

So if all those are processing nested JARs, 2.5G of memory is required.

I changed the terminating code to:

long used = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
System.out.println("Memory used: " + used / (1024 * 1024) + "M");

and gave the process much more memory with -Xmx10G

The numbers are: before GC/after GC 4.8.62 → before GC/after GC 4.8.63-SNAPSHOT, in megabytes. Without any dependency: 9M/2M → 9M/2M With runtimeOnly "org.neo4j:neo4j-lucene-upgrade:3.4.9": 208M/202M → 15M/4M With runtimeOnly files("libs/nested-stored.jar"): 7M/3M → 7M/3M With runtimeOnly files("libs/nested-deflated.jar"): 8482M/8453M → 14M/4M

The last two are artificial files with 128 nested JAR files, that I can’t attach here 😦 I know it’s unrealistic, but it proves your fix working, and it is working really well! See also PR #401.

TWiStErRob on Feb 9, 2020

@TWiStErRob you sir deserve a medal for this bug report. I wish all bug reports were this detailed and well-researched!!

So if all those are processing nested JARs, 2.5G of memory is required. That doesn’t sound “ultra-lightweight” 😃

Not quite – firstly this is virtual memory space that is used, not allocated RAM. The full 2.5GB will only become “resident” (swapped in) if the entire range is in constant active use – and under RAM pressure, virtual memory pages that haven’t been used recently can simply be swapped back out to disk if they have changed, or in the case of jars (which never change), simply dropped from RAM. Also unless you are running multiple copies of the same program in completely different processes, then memory-mapped jars for a single ClassGraph instance are shared between threads in the same process (e.g. the ClassGraph worker threads) – and even if you do have multiple processes mapping the same file to RAM, any modern OS (at least Linux, but I assume also Mac OS X and Windows?) should see that the same file is already mapped when a request to map it again is issued, and re-use the mapping without allocating more virtual memory range. This should also cause mapping reuse across ClassGraph instances within or across processes.

It’s still possible to run out of virtual memory space easily on a 32-bit OS though, and it’s possible to hit an OOM condition by running out of Java memory map file handles on Linux (there’s a limit of 64k of them).

(Update, based on your suggestion to use -Xmx128M: Oh, I hadn’t considered though that Java limits not just the resident size but also the entire accessible virtual memory size with -Xmx! That’s actually pretty horrifying if true…)

ClassGraph can still consume an enormous amount of RAM, simply to build the graph of ClassInfo objects, and if you have multiple instances of this, and/or you’re scanning a large number of classes, this can quickly consume all RAM (#338, #371). ClassInfo objects cannot be reused across ClassGraph instances or across processes. I spent a lot of time carefully looking at RAM usage for #338, and reduced RAM usage as much as possible, but there was really no way to reduce it further without the caller taking actions to reduce the size of the ScanResult by using more aggressive whitelisting etc. – so further lowering memory usage in general is not really a possibility.

This is baked into the library. MappedByteBufferResources allocates 64M per JAR file, which is quite large, considering there could be many jars like this on the classpath

This is in tension with the fact that some users of ClassGraph really, really don’t want ClassGraph ever writing anything to disk, or perhaps don’t even have a writeable temporary directory at all (#236). The 64MB default was intentionally set high to cause ClassGraph to avoid writing to disk wherever possible. Also, what you’re suggesting, of lowering the 64M and spilling to disk more readily, would absolutely kill performance for some runtimes with lots of nested jars, e.g. a Spring-Boot application, because right now as long as a nested jar is stored and not deflated, the nested jar can be read directly, using file slicing, when its own entries need to be deflated. If the 64M were lowered, then it would be more common that the inner jar would need to be copied to a temporary file on disk just to have its entries read, which would be wasteful.

Probably though this should be configurable on a per-application basis, so it’s a good idea to add a configuration option for it. I’ll do that.

On your suggestions:

Handling OOM by automatically spilling to disk is a good idea. I’ll look into how to do this safely. This will still bring the system close to OOM, but it may be enough to stop the program bombing in many situations.
Good suggestion to pass the filesize down.
Oops, good catch and great analysis on the spinning issue. This is a pretty serious bug that I’ll get working on too.

Lots to do here – thanks for bringing these issues to my attention!

lukehutch on Feb 8, 2020