dd-trace-java: Frequent JVM crashes on 1.16.0

We recently introduced the DataDog JVM agent for our applications. We used 1.15.3 initially and then upgraded to 1.16.0. With both versions we are seeing frequent JVM crashes. We get the following message during shutdown:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fc3e4316cdd, pid=58, tid=205
#
# JRE version: OpenJDK Runtime Environment (17.0.7+7) (build 17.0.7+7-Ubuntu-0ubuntu120.04)
# Java VM: OpenJDK 64-Bit Server VM (17.0.7+7-Ubuntu-0ubuntu120.04, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C  [libjavaProfiler13615730602053061346.so+0x31cdd]  FlightRecorder::recordTraceRoot(int, int, TraceRootEvent*)+0x3d
#
# Core dump will be written. Default location: /opt/site/app/bin/core.58
#
# JFR recording file will be written. Location: /opt/site/app/bin/hs_err_pid58.jfr
#
# An error report file with more information is saved as:
# /opt/site/app/bin/hs_err_pid58.log
#
# If you would like to submit a bug report, please visit:
#   Unknown
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
/opt/site/app/bin/app.jar: line 279:    58 Aborted                 (core dumped) "$javaexe" "${arguments[@]}"

Can you help us understand if this error is triggered by DataDog?

About this issue

Original URL
State: closed
Created a year ago
Comments: 26 (13 by maintainers)

Most upvoted comments

Hi @romaintonon apologies for the inconvenience. This crash is an unrelated issue. As a temporary workaround, can you set -Ddd.profiling.ddprof.enabled=false which will fall back to built in JFR profiling while we reproduce and fix this issue.

richardstartin on Aug 3, 2023

Hi, We are facing the same issue in one of our application since version 1.18.0 of the tracer. The JVM crashes after a while. We are using temurin jdk17 in an alpine docker image. Here are the log if it can help to solve this issue :

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f80b404a17f, pid=1, tid=20
#
# JRE version: OpenJDK Runtime Environment Temurin-17.0.8+7 (17.0.8+7) (build 17.0.8+7)
# Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.8+7 (17.0.8+7, mixed mode, sharing, tiered, compressed class ptrs, z gc, linux-amd64)
# Problematic frame:
# C  [libjavaProfiler4010294222656270447.so+0xbb17f]  ITimer::signalHandler(int, siginfo_t*, void*)+0xbf
#
# Core dump will be written. Default location: /app/core
#
# JFR recording file will be written. Location: /app/hs_err_pid1.jfr
#
# If you would like to submit a bug report, please visit:
#   https://github.com/adoptium/adoptium-support/issues
#

---------------  S U M M A R Y ------------

Command Line: -Xmx1g -Xms512m -XX:+UseZGC -javaagent:/dd-agent/dd-java-agent.jar -Ddd.profiling.enabled=true -Ddd.logs.injection=true -Ddd.agent.host=172.17.0.1 -Ddd.agent.port=8126 -Duser.timezone=Europe/Paris --add-opens=java.base/java.util.regex=ALL-UNNAMED --enable-preview -XX:+AlwaysPreTouch -Djava.security.egd=file:/dev/./urandom org.springframework.boot.loader.JarLauncher

Host: AMD Ryzen 5 3600X 6-Core Processor, 12 cores, 4G, Alpine Linux v3.18
Time: Thu Aug  3 09:09:38 2023 UTC elapsed time: 1636.086627 seconds (0d 0h 27m 16s)

---------------  T H R E A D  ---------------

Current thread (0x00007f80fbf2ad00):  Thread "RuntimeWorker#4" [stack: 0x00007f80fba1d000,0x00007f80fbb1daa8] [id=20]

Stack: [0x00007f80fba1d000,0x00007f80fbb1daa8],  sp=0x00007f80fbb1cfa0,  free space=1023k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libjavaProfiler4010294222656270447.so+0xbb17f]  ITimer::signalHandler(int, siginfo_t*, void*)+0xbf
C  [ld-musl-x86_64.so.1+0x495b7]

romaintonon on Aug 3, 2023

@Stephan202 thanks so much for the useful diagnostic information and helping to confirm the fix. I’ll close this and reopen the issue if it recurs on 1.17.0+.

richardstartin on Jun 28, 2023

So far no more segfaults! 🚀

Stephan202 on Jun 28, 2023

1.17.0 has been released which should resolve this issue, please report back if it does not, though we reduced the problem to a reproducible test case which was fixed in 1.17.0.

richardstartin on Jun 28, 2023

We don’t need the hs_err file, the cause is now understood.

Any users encountering this issue should stay on 1.14.0 until 1.17.0 is released, or set -Ddd.profiling.ddprof.enabled=false ensuring fallback to built-in JFR until the user is ready to upgrade to 1.17.0.

richardstartin on Jun 27, 2023

@Stephan202 thanks for your efforts and reporting back.

For transparency’s sake, since GA, this is now the second time we have had crashes reported related to the serialisation of JFR events in our native profiler after adding new event types, and preventing it from crashing the profiled process again is going to be our top priority in the short term. We have a change in the pipeline to change the behaviour when there is a buffer overflow, which would result in a truncated recording (which we have metrics for so can react to) rather than risk crashing the process or writing to arbitrary memory locations. In the longer term, we will completely rewrite the event serialisation to prioritise safety.

richardstartin on Jun 27, 2023

Unfortunately, as of now we have no way to reproduce this on a non-prod environment. For now we are downgrading to 1.14.0 to be on the safe side. If that doesn’t work out we will try this new patch version. Thanks for the quick response anyway. I’ll let you know about our findings if any.

martin-tarjanyi on Jun 23, 2023

Hi @martin-tarjanyi thanks for the bug report and apologies for the crash. We will fix this and put out a patch release ASAP.

richardstartin on Jun 22, 2023