async-profiler: Wall-clock profiler - hangs JVM
Java version:
# java -version
openjdk version "11.0.6" 2020-01-14 LTS
OpenJDK Runtime Environment Corretto-11.0.6.10.1 (build 11.0.6+10-LTS)
OpenJDK 64-Bit Server VM Corretto-11.0.6.10.1 (build 11.0.6+10-LTS, mixed mode)
Tomcat server 9.0.33.0.
after running wall clock profiler JVM hanged in such state that process was not killed and there was no coredump. This was production evironment, so we needed to resrtart application ASAP. What data can I gather when next time such hang occurs?
Server didn’t accept any connection from user perspective, neither http nor jmx.
Profiler was started as agent and was managed by:
profiler.sh start -e wall -o jfr -f <file> <pid>
profiler.sh stop <pid> > /dev/null
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 32 (21 by maintainers)
Commits related to this issue
- #335: Do not restart poll() calls with finite timeout — committed to async-profiler/async-profiler by apangin 4 years ago
- #335: Fixed unsafe thread local storage access — committed to async-profiler/async-profiler by apangin 4 years ago
Released 1.8
So far so good, 24h on 2 nodes on test env., everything is working. I will try to test it on production next week.
No, it’s another issue.
After investigation I came to conclusion that this is a JVM bug appeared with JDK-8132510. Since JDK 9 the implementation of
Thread::current()
in HotSpot was changed frompthread_getspecific
to a glibc thread-local variable. However, the latter is not async signal safe. Formally,pthread_getspecific
is not safe either, but in practice it is.AsyncGetCallTrace
API requires a pointer toJNIEnv*
. But all legal ways to getJNIEnv*
inside a signal handler result inThread::current()
call, which in turn accesses thread-local variable and causes a non-reentrant call tomalloc
.Why I think this is a VM bug.
AsyncGetCallTrace
are potentially affected. There should be a safe way provided by the JVM to callAsyncGetCallTrace
.There was a long discussion thread about signal safety of the proposed solution, but unfortunately it did not consider profiling scenarios.
So, I think I should probably resurrect the discussion again. Meanwhile I came up with an idea for a workaround. Will post the update a bit later.