testcontainers-java: Execution gets stuck on ResourceReaper.start on linux host
The jvm gets stuck and has to be manually terminated. This happens intermittently and is hard to reproduce. This issue only happens after upgrading to 1.15.0 (was also seeing issues on 1.15.0-rc2`, 1.14.3 was working fine. Would appreciate any help with figuring out why this happens.
Environment information
$ docker version
Client: Docker Engine - Community
Version: 19.03.12
API version: 1.40
Go version: go1.13.10
Git commit: 48a66213fe
Built: Tue Jul 28 02:14:52 2020
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 19.03.12
API version: 1.40 (minimum version 1.12)
Go version: go1.13.10
Git commit: 48a66213fe
Built: Tue Jul 28 02:14:13 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: v1.3.6
GitCommit: be75852b8d7849474a20192f9ed1bf34fdd454f1
runc:
Version: 1.0.0-rc10
GitCommit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
docker-init:
Version: 0.18.0
GitCommit: fec3683
$ uname -r
4.14.158
$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 8 (jessie)"
NAME="Debian GNU/Linux"
VERSION_ID="8"
VERSION="8 (jessie)"
ID=debian
Here is the jvm thread trace
Test has timed out! Here is a trace of what it is currently doing:
2020-11-23 13:41:46
Full thread dump OpenJDK 64-Bit Server VM (25.265-b01 mixed mode):
"Attach Listener" #1216 daemon prio=9 os_prio=0 tid=0x00007fbde8001000 nid=0x563 waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
Locked ownable synchronizers:
- None
"testcontainers-ryuk" #1215 daemon prio=5 os_prio=0 tid=0x00007fbd80488000 nid=0xf592 in Object.wait() [0x00007fb8157d6000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at org.testcontainers.utility.ResourceReaper.lambda$null$1(ResourceReaper.java:142)
- locked <0x000000073a8214c8> (a java.util.ArrayList)
at org.testcontainers.utility.ResourceReaper$$Lambda$589/432240114.run(Unknown Source)
at org.rnorth.ducttape.ratelimits.RateLimiter.doWhenReady(RateLimiter.java:27)
at org.testcontainers.utility.ResourceReaper.lambda$start$2(ResourceReaper.java:133)
at org.testcontainers.utility.ResourceReaper$$Lambda$588/1491766901.run(Unknown Source)
at java.lang.Thread.run(Thread.java:748)
Locked ownable synchronizers:
- None
"docker-java-stream-2071761001" #1214 daemon prio=5 os_prio=0 tid=0x00007fbd80377800 nid=0xf568 runnable [0x00007fb8153d2000]
java.lang.Thread.State: RUNNABLE
at org.testcontainers.shaded.okio.Buffer.getByte(Buffer.java:312)
at org.testcontainers.shaded.okio.RealBufferedSource.readHexadecimalUnsignedLong(RealBufferedSource.java:310)
at org.testcontainers.shaded.okhttp3.internal.http1.Http1ExchangeCodec$ChunkedSource.readChunkSize(Http1ExchangeCodec.java:492)
at org.testcontainers.shaded.okhttp3.internal.http1.Http1ExchangeCodec$ChunkedSource.read(Http1ExchangeCodec.java:471)
at org.testcontainers.shaded.okhttp3.internal.connection.Exchange$ResponseBodySource.read(Exchange.java:286)
at org.testcontainers.shaded.okio.RealBufferedSource$1.read(RealBufferedSource.java:436)
at org.testcontainers.shaded.com.github.dockerjava.core.FramedInputStreamConsumer.accept(FramedInputStreamConsumer.java:30)
at org.testcontainers.shaded.com.github.dockerjava.core.FramedInputStreamConsumer.accept(FramedInputStreamConsumer.java:12)
at org.testcontainers.shaded.com.github.dockerjava.core.DefaultInvocationBuilder.lambda$executeAndStream$1(DefaultInvocationBuilder.java:275)
at org.testcontainers.shaded.com.github.dockerjava.core.DefaultInvocationBuilder$$Lambda$578/715686037.run(Unknown Source)
at java.lang.Thread.run(Thread.java:748)
Locked ownable synchronizers:
- None
"OkHttp ConnectionPool" #1206 daemon prio=5 os_prio=0 tid=0x00007fb76003c000 nid=0xf32b in Object.wait() [0x00007fb8155d4000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:460)
at org.testcontainers.shaded.okhttp3.internal.connection.RealConnectionPool.lambda$new$0(RealConnectionPool.java:62)
- locked <0x000000070f5aed10> (a org.testcontainers.shaded.okhttp3.internal.connection.RealConnectionPool)
at org.testcontainers.shaded.okhttp3.internal.connection.RealConnectionPool$$Lambda$550/880380434.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Locked ownable synchronizers:
- <0x000000073df35e20> (a java.util.concurrent.ThreadPoolExecutor$Worker)
"spatialIndexComponent-258" #1051 daemon prio=5 os_prio=0 tid=0x00007fbd80139000 nid=0xf0cc waiting on condition [0x00007fb815ad9000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000710880420> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Locked ownable synchronizers:
- None
"spatialIndexComponent-257"
.
.
.
"spatialIndexComponent-0"
"DelegateRunnerWithTimeout-1" #28 prio=5 os_prio=0 tid=0x00007fbfdd9dd000 nid=0xca86 runnable [0x00007fbda08af000]
java.lang.Thread.State: RUNNABLE
at org.testcontainers.shaded.okio.Buffer.indexOf(Buffer.java:1463)
at org.testcontainers.shaded.okio.RealBufferedSource.indexOf(RealBufferedSource.java:352)
at org.testcontainers.shaded.okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:230)
at org.testcontainers.shaded.okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:224)
at org.testcontainers.shaded.okhttp3.internal.http1.Http1ExchangeCodec$ChunkedSource.readChunkSize(Http1ExchangeCodec.java:489)
at org.testcontainers.shaded.okhttp3.internal.http1.Http1ExchangeCodec$ChunkedSource.read(Http1ExchangeCodec.java:471)
at org.testcontainers.shaded.okhttp3.internal.Util.skipAll(Util.java:204)
at org.testcontainers.shaded.okhttp3.internal.Util.discard(Util.java:186)
at org.testcontainers.shaded.okhttp3.internal.http1.Http1ExchangeCodec$ChunkedSource.close(Http1ExchangeCodec.java:511)
at org.testcontainers.shaded.okio.ForwardingSource.close(ForwardingSource.java:43)
at org.testcontainers.shaded.okhttp3.internal.connection.Exchange$ResponseBodySource.close(Exchange.java:313)
at org.testcontainers.shaded.okio.RealBufferedSource.close(RealBufferedSource.java:476)
at org.testcontainers.shaded.okhttp3.internal.Util.closeQuietly(Util.java:139)
at org.testcontainers.shaded.okhttp3.ResponseBody.close(ResponseBody.java:192)
at org.testcontainers.shaded.okhttp3.Response.close(Response.java:290)
at org.testcontainers.shaded.com.github.dockerjava.okhttp.OkDockerHttpClient$OkResponse.close(OkDockerHttpClient.java:280)
at org.testcontainers.shaded.com.github.dockerjava.core.DefaultInvocationBuilder.lambda$null$0(DefaultInvocationBuilder.java:272)
at org.testcontainers.shaded.com.github.dockerjava.core.DefaultInvocationBuilder$$Lambda$579/1129446976.close(Unknown Source)
at com.github.dockerjava.api.async.ResultCallbackTemplate.close(ResultCallbackTemplate.java:77)
at org.testcontainers.utility.ResourceReaper.start(ResourceReaper.java:177)
at org.testcontainers.DockerClientFactory.client(DockerClientFactory.java:203)
- locked <0x0000000710684178> (a [Ljava.lang.Object;)
at org.testcontainers.LazyDockerClient.getDockerClient(LazyDockerClient.java:14)
at org.testcontainers.LazyDockerClient.createContainerCmd(LazyDockerClient.java:12)
at org.testcontainers.containers.GenericContainer.tryStart(GenericContainer.java:355)
at org.testcontainers.containers.GenericContainer.lambda$doStart$0(GenericContainer.java:322)
at org.testcontainers.containers.GenericContainer$$Lambda$534/2098911188.call(Unknown Source)
at org.rnorth.ducttape.unreliables.Unreliables.retryUntilSuccess(Unreliables.java:81)
at org.testcontainers.containers.GenericContainer.doStart(GenericContainer.java:320)
at org.testcontainers.containers.GenericContainer.start(GenericContainer.java:308)
at com.uber.fievel.testing.mysql.UberMariaDBContainer.start(UberMariaDBContainer.java:56)
at org.testcontainers.containers.GenericContainer.starting(GenericContainer.java:1021)
at org.testcontainers.containers.FailureDetectingExternalResource$1.evaluate(FailureDetectingExternalResource.java:29)
at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
About this issue
- Original URL
- State: open
- Created 4 years ago
- Reactions: 17
- Comments: 17 (3 by maintainers)
Commits related to this issue
- Upgrade testcontainers to work around a race testcontainers 1.15.? has a race condition that occassionally causes deadlocks. This can be worked around by upgrading to 1.15.2 and set transport type to... — committed to weiminyu/nomulus by weiminyu 3 years ago
- Upgrade testcontainers to work around a race (#1080) * Upgrade testcontainers to work around a race testcontainers 1.15.? has a race condition that occassionally causes deadlocks. This can be worked... — committed to google/nomulus by weiminyu 3 years ago
- [FLINK-22971][tests] Bump testcontainers to 1.16.0 This version uses http5 as the default transport, which is not affected by the race condition we run into. https://github.com/testcontainers/testco... — committed to zentol/flink by zentol 3 years ago
- [FLINK-22971][tests] Bump testcontainers to 1.16.0 This version uses http5 as the default transport, which is not affected by the race condition we run into. https://github.com/testcontainers/testco... — committed to zentol/flink by zentol 3 years ago
- [FLINK-22971][tests] Bump testcontainers to 1.16.0 This version uses http5 as the default transport, which is not affected by the race condition we run into. https://github.com/testcontainers/testco... — committed to apache/flink by zentol 3 years ago
- [FLINK-22971][tests] Bump testcontainers to 1.16.0 This version uses http5 as the default transport, which is not affected by the race condition we run into. https://github.com/testcontainers/testco... — committed to apache/flink by zentol 3 years ago
- [FLINK-22971][tests] Bump testcontainers to 1.16.0 This version uses http5 as the default transport, which is not affected by the race condition we run into. https://github.com/testcontainers/testco... — committed to niklassemmler/flink by zentol 3 years ago
This is a race condition in the
docker-javalibrary between two threads reading the sameokio.RealBufferedSourceobtained by aokhttp3.Response. The two threads are thread#1214(docker-java-stream-2071761001) and Thread#28(DelegateRunnerWithTimeout-1) in the thread dump of the issue description. Both threads reading simultaneously from the buffered source, which is not thread-safe, screws up the buffer’s internal bookkeeping leading to an infinite loop on both of them.The issue is triggered in the
ResourceReaperwith thelogCallbackinitialized here: https://github.com/testcontainers/testcontainers-java/blob/f588082636dc2749f4129e943b02ede62527b41c/core/src/main/java/org/testcontainers/utility/ResourceReaper.java#L114-L124The
.execinvocation internally spawns a thread nameddocker-java-stream-...in thedocker-javalibrary that starts streaming the response of the log container command: https://github.com/docker-java/docker-java/blob/f9d2db6efff4d7fc7092b53209027d6846de1894/docker-java-core/src/main/java/com/github/dockerjava/core/DefaultInvocationBuilder.java#L262-L284The
callbackparameter contains the same object aslogCallbackin theResourceReaperabove. TheDockerHttpClient.Responseretrieved on line 269 is handed to thesourceConsumer. ThesourceConsumeris aFramedInputStreamConsumer(https://github.com/docker-java/docker-java/blob/f9d2db6efff4d7fc7092b53209027d6846de1894/docker-java-core/src/main/java/com/github/dockerjava/core/FramedInputStreamConsumer.java) that was initialized with the same callback. The consumer starts reading the response stream and invokesonNext(ResourceReaperline 121 above) with each new frame read from the stream.Additionally, the
DockerHttpClient.Responseretrieved on line 269 is also used in the lambda that is passed into thecallbackvia theonStartmethod. This lambda will be called when theclosemethod of the callback is invoked (it instantiates theCloseablefunctional interface):https://github.com/docker-java/docker-java/blob/f9d2db6efff4d7fc7092b53209027d6846de1894/docker-java-api/src/main/java/com/github/dockerjava/api/async/ResultCallbackTemplate.java#L36-L40 https://github.com/docker-java/docker-java/blob/f9d2db6efff4d7fc7092b53209027d6846de1894/docker-java-api/src/main/java/com/github/dockerjava/api/async/ResultCallbackTemplate.java#L72-83
Now to close the circle, while the response is being streamed by the streaming thread, the callback is eventually closed in the
ResourceReaperby the thread that issued the log container command: https://github.com/testcontainers/testcontainers-java/blob/f588082636dc2749f4129e943b02ede62527b41c/core/src/main/java/org/testcontainers/utility/ResourceReaper.java#L205As seen above, this closes the
DockerHttpClient.Responsein the lambda which internally reads the buffered source (see stack trace of thread#28). Together with the streaming thread, this makes two threads simultaneously reading from the buffered source.The issue was most likely introduced by this PR: https://github.com/docker-java/docker-java/pull/1421.
I’m not entirely sure how to fix it though. I guess ideally, the streaming thread should close the response itself and the main thread should only cancel the request, such that the response stream is finished and the streaming thread stops reading.
(embedding code snippets from another repository apparently doesn’t work 😢)
I am using
1.15.1and also running into this issue. The behavior is basically the same to what @raviagarwal7 describes.For now, the workaround for us would be to downgrade to the
1.14.3, although that brings its own challenges.Can’t wait to see a fix coming out.
EDIT: Details of our setup: Running a Jenkins pipeline on a repo containing 70+ modules, many of them using testcontainers. Typical Jenkins’ node setup: CPUs: 4, RAM: 8192MB, Disk: 40GB, SWAP: 0MB, Ephemeral: 0GB. An alternative way to reproduce is to run a single module (the one we tested uses Postgresql test container) many times in a row. With 50 iterations, the incidence of the error raises above 80 %, I would guess.
@jsmrcka could you please try the httpclient5 transport? You can enable it by setting
transport.type=httpclient5in your config (see the available locations here: https://www.testcontainers.org/features/configuration/ )Updated to 1.15.2. Unfortunately, the issue still happens. Same setup as in my previous comment.