spring-boot: Iterating over JarFile.entries() is very slow.

Iterating over org.springframework.boot.loader.jar.JarFile#entries() is very slow. The successive calls to JarFileEntries#getEntry by EntryIterator#next are causing the underlying RandomAccessFile to jump back and forth in order to re-read the file headers from the central directory of the jar file. This could be much faster if JarFileEntries#visitFileHeader would store the complete FileHeader instead of just it’s offset and the hash of its name.

Some Background

While working on https://github.com/joinfaces/joinfaces/pull/565 i noticed that ClassGraph is about 500ms slower when scanning a repackaged Spring Boot application than scanning the same application in its unpacked form. So I had a look at the code of ClassGraph and saw that it extracted all the nested jars in order to scan them. Hoping to improve the performance of scanning nested jars, I prepared the following patch to use the JarFile implementation of spring-boot-loader in order to avoid the extra cost of extracting all the nested jars: https://github.com/larsgrefer/classgraph/compare/ba4c69347eaf915571e9f5142e09f7a481471570...cfb317aff4d6949afbddcc1fd0ad78b118ef52ec?expand=1

Surprisingly this approach is even a bit slower, so I dug deeper into the code and traced the performance difference down to the Iteration done here: https://github.com/classgraph/classgraph/blob/b170d2bebb871824f7d53d54aa7a9b6939f25cf0/src/main/java/io/github/classgraph/utils/JarfileMetadataReader.java#L148

In my tests, iterating over a org.springframework.boot.loader.jar.JarFile is about 5 to 10 times slower than iterating over a java.util.zip.ZipFile. Ths performance impact is so severe that its even faster to extract the nested jar first, to be able to use java.util.zip.ZipFile

Conclusion

After I read the commit message of https://github.com/spring-projects/spring-boot/commit/e2368b909b46bc5bcec6792fb208ba9bd0fe6aaa it seems to me, that memory efficiency is more important for you than iteration performance. So my question is if you would accept a pull request which changes the current behavior or allows ClassGraph to change the behavior of the JarFileEntries implementation and how this PR should look like.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 24 (21 by maintainers)

Most upvoted comments

While working on this I noticed that my idea would have a much bigger impact on the memory usage than I thought it would. In the current implementation only one CentralDirectoryFileHeader instance is created for each jar file and subsequently filled with the information of the jar file entries. With my implementation I’d have to create one instance per entry.

In the meantime we found two other solutions to our inital problem:

  1. @lukehutch implemented a custom directory parser for classgraph, so classgraph can now scan nested jar files directly, without extracting them first.
  2. At JoinFaces we implemented a Maven plugin and a Gradle plugin which perform the classpath scan at build-time.

With the build-time scan we reduced this part of the application startup to ~200ms for reading a list of class names from a text file and loading them.