dd-trace-java: Frequent JVM crashes on 1.16.0
We recently introduced the DataDog JVM agent for our applications. We used 1.15.3 initially and then upgraded to 1.16.0. With both versions we are seeing frequent JVM crashes. We get the following message during shutdown:
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007fc3e4316cdd, pid=58, tid=205
#
# JRE version: OpenJDK Runtime Environment (17.0.7+7) (build 17.0.7+7-Ubuntu-0ubuntu120.04)
# Java VM: OpenJDK 64-Bit Server VM (17.0.7+7-Ubuntu-0ubuntu120.04, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C [libjavaProfiler13615730602053061346.so+0x31cdd] FlightRecorder::recordTraceRoot(int, int, TraceRootEvent*)+0x3d
#
# Core dump will be written. Default location: /opt/site/app/bin/core.58
#
# JFR recording file will be written. Location: /opt/site/app/bin/hs_err_pid58.jfr
#
# An error report file with more information is saved as:
# /opt/site/app/bin/hs_err_pid58.log
#
# If you would like to submit a bug report, please visit:
# Unknown
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
/opt/site/app/bin/app.jar: line 279: 58 Aborted (core dumped) "$javaexe" "${arguments[@]}"
Can you help us understand if this error is triggered by DataDog?
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 26 (13 by maintainers)
Hi @romaintonon apologies for the inconvenience. This crash is an unrelated issue. As a temporary workaround, can you set
-Ddd.profiling.ddprof.enabled=falsewhich will fall back to built in JFR profiling while we reproduce and fix this issue.Hi, We are facing the same issue in one of our application since version 1.18.0 of the tracer. The JVM crashes after a while. We are using temurin jdk17 in an alpine docker image. Here are the log if it can help to solve this issue :
@Stephan202 thanks so much for the useful diagnostic information and helping to confirm the fix. I’ll close this and reopen the issue if it recurs on 1.17.0+.
So far no more segfaults! 🚀
1.17.0 has been released which should resolve this issue, please report back if it does not, though we reduced the problem to a reproducible test case which was fixed in 1.17.0.
We don’t need the hs_err file, the cause is now understood.
Any users encountering this issue should stay on 1.14.0 until 1.17.0 is released, or set
-Ddd.profiling.ddprof.enabled=falseensuring fallback to built-in JFR until the user is ready to upgrade to 1.17.0.@Stephan202 thanks for your efforts and reporting back.
For transparency’s sake, since GA, this is now the second time we have had crashes reported related to the serialisation of JFR events in our native profiler after adding new event types, and preventing it from crashing the profiled process again is going to be our top priority in the short term. We have a change in the pipeline to change the behaviour when there is a buffer overflow, which would result in a truncated recording (which we have metrics for so can react to) rather than risk crashing the process or writing to arbitrary memory locations. In the longer term, we will completely rewrite the event serialisation to prioritise safety.
Unfortunately, as of now we have no way to reproduce this on a non-prod environment. For now we are downgrading to 1.14.0 to be on the safe side. If that doesn’t work out we will try this new patch version. Thanks for the quick response anyway. I’ll let you know about our findings if any.
Hi @martin-tarjanyi thanks for the bug report and apologies for the crash. We will fix this and put out a patch release ASAP.