async-profiler: Wall-clock profiler - hangs JVM

Java version: # java -version openjdk version "11.0.6" 2020-01-14 LTS OpenJDK Runtime Environment Corretto-11.0.6.10.1 (build 11.0.6+10-LTS) OpenJDK 64-Bit Server VM Corretto-11.0.6.10.1 (build 11.0.6+10-LTS, mixed mode)

Tomcat server 9.0.33.0.

after running wall clock profiler JVM hanged in such state that process was not killed and there was no coredump. This was production evironment, so we needed to resrtart application ASAP. What data can I gather when next time such hang occurs?

Server didn’t accept any connection from user perspective, neither http nor jmx.

Profiler was started as agent and was managed by: profiler.sh start -e wall -o jfr -f <file> <pid> profiler.sh stop <pid> > /dev/null

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 32 (21 by maintainers)

Commits related to this issue

#335: Do not restart poll() calls with finite timeout — committed to async-profiler/async-profiler by apangin 4 years ago
#335: Fixed unsafe thread local storage access — committed to async-profiler/async-profiler by apangin 4 years ago

Most upvoted comments

Released 1.8

apangin on Aug 10, 2020

So far so good, 24h on 2 nodes on test env., everything is working. I will try to test it on production next week.

krzysztofslusarski on Jul 30, 2020

No, it’s another issue.

After investigation I came to conclusion that this is a JVM bug appeared with JDK-8132510. Since JDK 9 the implementation of Thread::current() in HotSpot was changed from pthread_getspecific to a glibc thread-local variable. However, the latter is not async signal safe. Formally, pthread_getspecific is not safe either, but in practice it is.

AsyncGetCallTrace API requires a pointer to JNIEnv*. But all legal ways to get JNIEnv* inside a signal handler result in Thread::current() call, which in turn accesses thread-local variable and causes a non-reentrant call to malloc.

Why I think this is a VM bug.

It is a regression. Worked fine on JDK 8.
It can be easily fixed in the JVM, while there is no safe straightforward workaround in library code.
All profilers relying on AsyncGetCallTrace are potentially affected. There should be a safe way provided by the JVM to call AsyncGetCallTrace.

There was a long discussion thread about signal safety of the proposed solution, but unfortunately it did not consider profiling scenarios.

So, I think I should probably resurrect the discussion again. Meanwhile I came up with an idea for a workaround. Will post the update a bit later.

apangin on Jul 29, 2020