grpc-java: JVM crash with grpc-java 1.42.x and alpine docker image

An attempt to upgrade from grpc-java 1.41.1 to 1.42.x ends with JVM crash. It looks like the problem is specific to Alpine Linux. It reproduces on openjdk:15-jdk-alpine and openjdk:8-alpine and goes away with a switch on openjdk:X-slim [debian] images. Maybe also affected by the fact that openjdk:X-alpine images are not maintained anymore, hence have no new JDK updates. The first version of grpc-java with the problem is 1.42.0, the versions before work fine.

It may be related to https://github.com/grpc/grpc/issues/27995

    #
  | # A fatal error has been detected by the Java Runtime Environment:
  | #
  | #  SIGSEGV (0xb) at pc=0x0000000000003efe, pid=372, tid=0x00007fbffbc9bb10
  | #
  | # JRE version: OpenJDK Runtime Environment (8.0_212-b04) (build 1.8.0_212-b04)
  | # Java VM: OpenJDK 64-Bit Server VM (25.212-b04 mixed mode linux-amd64 compressed oops)
  | # Derivative: IcedTea 3.12.0
  | # Distribution: Custom build (Sat May  4 17:33:35 UTC 2019)
  | # Problematic frame:
  | # C  0x0000000000003efe
  | #
  | # Core dump written. Default location: /temporal-java-client/temporal-kotlin/core or core.372
  | #
  | # An error report file with more information is saved as:
  | # /temporal-java-client/temporal-kotlin/hs_err_pid372.log
  | #
  | # If you would like to submit a bug report, please include
  | # instructions on how to reproduce the bug and visit:
  | #   https://icedtea.classpath.org/bugzilla
  | #

hs_err_pid372.log

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 8
  • Comments: 19 (8 by maintainers)

Commits related to this issue

Most upvoted comments

I had the same issue having grpc libraries in the classpath like:

io.grpc:grpc-stub:1.42.1

and apline:3.15.0 with JDK

openjdk version "11.0.13" 2021-10-19
OpenJDK Runtime Environment (build 11.0.13+8-alpine-r0)
OpenJDK 64-Bit Server VM (build 11.0.13+8-alpine-r0, mixed mode)

Thanks to @ejona86, what I’ve made:

  • I added gcompat lib
apk add gcompat
  • Run Java like this now
LD_PRELOAD=/lib/libgcompat.so.0 java -cp @/app/jib-classpath-file io.my.MyApplication

Or Dockerfile should include just:

RUN apk add gcompat

ENV LD_PRELOAD=/lib/libgcompat.so.0

TL;DR: Alpine doesn’t have compatibility for the __strndup symbol. I don’t know why the behavior is k8s-dependent, though. And it’ll take some more research to determine appropriate next steps.

Looking at objdump output, it looks like the problem is happening in parsePackagePrefix.

libio_grpc_netty_shaded_netty_transport_native_epoll_x86_642526953976876250345.so+0xb487:

    b47c:	48 89 de             	mov    %rbx,%rsi
    b47f:	4c 89 e7             	mov    %r12,%rdi
    b482:	e8 59 fe ff ff       	call   b2e0 <parsePackagePrefix>
    b487:	83 7d dc ff          	cmpl   $0xffffffff,-0x24(%rbp)

But I don’t see any obvious places the stack could get corrupted in parsePackagePrefix and there’s no callout to a passed function. I’m suspicious though that the linker is broken. This is in parsePackagePrefix:

    b395:	e8 5e 8b ff ff       	call   3ef8 <__strndup@plt>

The address displayed is relative to the .so, so isn’t the problem itself. It jumps to the PLT:

0000000000003ef8 <__strndup@plt>:
    3ef8:       ff 25 52 c2 20 00       jmp    *0x20c252(%rip)        # 210150 <__strndup@GLIBC_2.2.5>
    3efe:       68 36 00 00 00          push   $0x36
    3f03:       e9 80 fc ff ff          jmp    3b88 <.plt>

And the indirect jump goes to the GOT which should be filled with the adjusted address of 3efe. But maybe something is broken in the linker and it didn’t get adjusted?

/ # apk add gcompat
/ # ldd libnetty_transport_native_epoll_x86_64-4.1.63.so 
	ldd (0x7fa4cfc2f000)
	librt.so.1 => ldd (0x7fa4cfc2f000)
	libdl.so.2 => ldd (0x7fa4cfc2f000)
	libc.so.6 => ldd (0x7fa4cfc2f000)
Error relocating libnetty_transport_native_epoll_x86_64-4.1.63.so: __strndup: symbol not found

Well, there we go… Older versions of epoll linked against strndup, not __strndup. This difference may have been caused by a glibc upgrade when compiling.

Or Dockerfile should include just:

RUN apk add gcompat

ENV LD_PRELOAD=/lib/lib/libgcompat.so.0

Thank you, @artemptushkin ! Your solution works. 😃 But there’s a typo in the path. It should be /lib/libgcompat.so.0 like you used for the previous command.

Relying on glibc-compat is not a safe or wise thing to do… and I can confirm that using -Dio.grpc.netty.shaded.io.netty.transport.noNative=true avoids the segfault. E.g.

# hello, I am an alpine 3.15
java -D-Dio.grpc.netty.shaded.io.netty.transport.noNative=true my_grpc_app.jar

adding the grpc-java package did not seem to help.

TL;DR: Try setting the LD_PRELOAD=/lib/libgcompat.so.0 environment variable on ~old~ Alpine versions, so gcompat is actually used. Hopefully that won’t break any of the musl binaries. ~Best approach is probably “upgrade Alpine to 3.13 or later” though.~ (Edit: New versions exhibit similar behavior, because this flow avoids libc6-compat glue)

openjdk:8-alpine is based on alpine 3.9.4, and using apline:3.9.4 directly produces similar ldd results. Interestingly apline:3.15.0 (latest) has more unresolved symbols:

/ # ldd libnetty_transport_native_epoll_x86_64-4.1.63.so 
	/lib/ld-musl-x86_64.so.1 (0x7f7bfe2f5000)
	librt.so.1 => /lib/ld-musl-x86_64.so.1 (0x7f7bfe2f5000)
	libdl.so.2 => /lib/ld-musl-x86_64.so.1 (0x7f7bfe2f5000)
	libc.so.6 => /lib/ld-musl-x86_64.so.1 (0x7f7bfe2f5000)
Error relocating libnetty_transport_native_epoll_x86_64-4.1.63.so: __strdup: symbol not found
Error relocating libnetty_transport_native_epoll_x86_64-4.1.63.so: __strndup: symbol not found

But I think that isn’t the full story. Looking at the older Alpine:

/ # cat /etc/os-release 
NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.9.4
...
/ # apk info -L gcompat libc6-compat
gcompat-0.3.0-r0 contains:
lib/ld-linux-x86-64.so.2
lib/libgcompat.so.0

libc6-compat-1.1.20-r6 contains:
lib/libm.so.6
lib/libutil.so.1
lib/libc.so.6
lib/libpthread.so.0
lib/libcrypt.so.1
lib/librt.so.1
lib64/ld-linux-x86-64.so.2
/ # ls -l lib64/ld-linux-x86-64.so.2
lrwxrwxrwx    1 root     root            26 Dec 20 18:30 lib64/ld-linux-x86-64.so.2 -> /lib/libc.musl-x86_64.so.1

lib64/ld-linux-x86-64.so.2 is used for x86_64, not lib/ld-linux-x86-64.so.2. So I suspect gcompat is useless on this older version. Instead, libc6-compat is providing the linker which just forwards to the musl linker, and is missing the symbol. So I think ldd on this older Alpine is accurate.

/ # cat /etc/os-release
NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.15.0
...
/ # apk info -L gcompat libc6-compat
gcompat-1.0.0-r4 contains:
lib/ld-linux-x86-64.so.2
lib/libgcompat.so.0
lib64/ld-linux-x86-64.so.2

libc6-compat-1.2.2-r7 contains:
lib/libc.so.6
lib/libcrypt.so.1
lib/libm.so.6
lib/libpthread.so.0
lib/librt.so.1
lib/libutil.so.1

/ # ls -l lib64/ld-linux-x86-64.so.2
lrwxrwxrwx    1 root     root            27 Dec 20 18:57 lib64/ld-linux-x86-64.so.2 -> ../lib/ld-linux-x86-64.so.2

Here though, gcompat is providing the linker which loads lib/libgcompat.so.0. That means I don’t think the ldd output is accurate. I see that gcompat 0.3.0 (Alpine 3.9) and 1.0.0 (Alpine 3.15) have __strndup, so I think the wrong linker on older Alpine versions is the trouble. Trying the gcompat linker approach “manually” on the old Alpine seems to work:

/ # cat /etc/os-release 
NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.9.4
...
/ # LD_PRELOAD=/lib/libgcompat.so.0 ldd libnetty_transport_native_epoll_x86_64-4.1.63.so 
	ldd (0x7f72fc25e000)
	/lib/libgcompat.so.0 => /lib/libgcompat.so.0 (0x7f72fc03c000)
	librt.so.1 => ldd (0x7f72fc25e000)
	libdl.so.2 => ldd (0x7f72fc25e000)
	libc.so.6 => ldd (0x7f72fc25e000)

Given what I saw with grpc/grpc#27995, I expect that if the binary you execute is musl-based then gcompat wouldn’t be used automatically. That’s just a deficiency of how gcompat linker works. libc6-compat could have provided symbols, but not with its symlink-to-musl approach. So I guess LD_PRELOAD is with us long-term.

@alexfeigin we hope for the one after 3.15

Could this be related to https://github.com/netty/netty/issues/11701 as a project I’m working in has been experiencing similar issues which appear to indicate the netty build being pulled in by grpc-netty-shaded breaking in alpine Docker bases.

Not likely. We have this issue on an alpine image with installed glibc. Also, the image works with grpc-java (,1.42.0) and doesn’t work with [1.42.0,). So, the problem doesn’t look like just an absent glibc and not handling it by netty gracefully.