async-profiler: Wall-clock profiler - hangs JVM

Java version: # java -version openjdk version "11.0.6" 2020-01-14 LTS OpenJDK Runtime Environment Corretto-11.0.6.10.1 (build 11.0.6+10-LTS) OpenJDK 64-Bit Server VM Corretto-11.0.6.10.1 (build 11.0.6+10-LTS, mixed mode)

Tomcat server 9.0.33.0.

after running wall clock profiler JVM hanged in such state that process was not killed and there was no coredump. This was production evironment, so we needed to resrtart application ASAP. What data can I gather when next time such hang occurs?

Server didn’t accept any connection from user perspective, neither http nor jmx.

Profiler was started as agent and was managed by: profiler.sh start -e wall -o jfr -f <file> <pid> profiler.sh stop <pid> > /dev/null

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 32 (21 by maintainers)

Commits related to this issue

Most upvoted comments

Released 1.8

So far so good, 24h on 2 nodes on test env., everything is working. I will try to test it on production next week.

No, it’s another issue.

After investigation I came to conclusion that this is a JVM bug appeared with JDK-8132510. Since JDK 9 the implementation of Thread::current() in HotSpot was changed from pthread_getspecific to a glibc thread-local variable. However, the latter is not async signal safe. Formally, pthread_getspecific is not safe either, but in practice it is.

AsyncGetCallTrace API requires a pointer to JNIEnv*. But all legal ways to get JNIEnv* inside a signal handler result in Thread::current() call, which in turn accesses thread-local variable and causes a non-reentrant call to malloc.

Why I think this is a VM bug.

  1. It is a regression. Worked fine on JDK 8.
  2. It can be easily fixed in the JVM, while there is no safe straightforward workaround in library code.
  3. All profilers relying on AsyncGetCallTrace are potentially affected. There should be a safe way provided by the JVM to call AsyncGetCallTrace.

There was a long discussion thread about signal safety of the proposed solution, but unfortunately it did not consider profiling scenarios.

So, I think I should probably resurrect the discussion again. Meanwhile I came up with an idea for a workaround. Will post the update a bit later.