moby: Upgrade to 20.10 breaks log tailing: unexpected EOF errors
Description
Our E2E test suite uses testcontainers-java to run our microservices and their dependencies in containers for the duration of our tests. All of the tests were passing with 19.03.14 until we updated our build nodes to Docker 20.10.1, at which point we started repeatedly getting STDOUT: $Error grabbing logs: unexpected EOF errors
, and our Wait.forLogMessage calls started timing out.
Note however, that this issue seems sporadic as about 1 out of 10 builds still passes. After we reverted to Docker 19.03.14 the issue went away. We’re using the latest version of Testcontainers (1.15.1 at the time of reporting the issue).
After contacting the developers I was told this is most likely a Docker issue.
Steps to reproduce the issue: Unfortunately our projects are not OSS, so I cannot give you the exact containers used. I’ll try to describe a matching scenario.
- Set up a project that uses the latest version of testcontainers-java (
1.15.1
). - Set up a docker image that produces a lot of standard output.
- Start the image using testcontainers-java and wait for a log message using
Wait.forLogMessage()
. - A huge percentage of the times this will result in the error message
STDOUT: $Error grabbing logs: unexpected EOF errors
and the wait policy will not trigger even though the messages are present in the log.
Describe the results you received:
Error message: STDOUT: $Error grabbing logs: unexpected EOF errors
, wait policy for log message not triggering even though the message was present in the log.
Describe the results you expected: No EOF error, wait policy for log message working.
Additional information you deem important (e.g. issue happens only occasionally): As mentioned in the description the issue seems sporadic, but it happens in about 80+% of our CI runs.
Output of docker version
:
Previous version where the tests were passing:
Client: Docker Engine - Community
Version: 19.03.14
API version: 1.40
Go version: go1.13.15
Git commit: 5eb3275d40
Built: Tue Dec 1 19:20:26 2020
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 19.03.14
API version: 1.40 (minimum version 1.12)
Go version: go1.13.15
Git commit: 5eb3275d40
Built: Tue Dec 1 19:18:53 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.3.9
GitCommit: ea765aba0d05254012b0b9e595e995c09186427f
runc:
Version: 1.0.0-rc10
GitCommit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
docker-init:
Version: 0.18.0
GitCommit: fec3683
Current version where the issue is present:
Client:
Version: 20.10.1
API version: 1.41
Go version: go1.15.6
Git commit: 831ebeae96
Built: Tue Dec 15 22:25:01 2020
OS/Arch: linux/amd64
Context: default
Experimental: true
Server:
Engine:
Version: 20.10.1
API version: 1.41 (minimum version 1.12)
Go version: go1.15.6
Git commit: f0014860c1
Built: Tue Dec 15 22:24:28 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: v1.4.3
GitCommit: 269548fa27e0089a8b8278fc4fc781d7f65a939b.m
runc:
Version: 1.0.0-rc92
GitCommit: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
docker-init:
Version: 0.19.0
GitCommit: de40ad0
Output of docker info
:
Previous version where the tests were passing:
Client:
Debug Mode: false
Server:
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 92
Server Version: 19.03.14
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: ea765aba0d05254012b0b9e595e995c09186427f
runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
init version: fec3683
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 5.4.0-58-generic
Operating System: Ubuntu 20.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 3
Total Memory: 15.64GiB
Name: build1
ID: QWYR:EE5C:DWTE:MY65:MQ7Y:FERG:F7J2:IDTT:7LVR:VHSE:REFY:6AQH
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No swap limit support
Current version where the issue is present:
Client:
Context: default
Debug Mode: false
Plugins:
app: Docker App (Docker Inc., v0.9.1-beta3)
buildx: Build with BuildKit (Docker Inc., v0.5.1-tp-docker)
Server:
Containers: 134
Running: 8
Paused: 0
Stopped: 126
Images: 1390
Server Version: 20.10.1
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 269548fa27e0089a8b8278fc4fc781d7f65a939b.m
runc version: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
init version: de40ad0
Security Options:
seccomp
Profile: default
Kernel Version: 5.9.14-arch1-1
Operating System: Arch Linux
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 31.24GiB
Name: RE02WL03
ID: YC3H:JCWA:APFN:VOQ7:CWPG:B4VO:5FYW:UPVL:2H5B:PO4H:SXO7:LSGH
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No blkio weight support
WARNING: No blkio weight_device support
The current version and info parts were copied from an Arch box, while the previous version and info parts were copied from an Ubuntu box, but we could also reproduce the issue on Ubuntu after upgrading to 20.10.
Additional environment details (AWS, VirtualBox, physical, etc.):
- Could reproduce using both Ubuntu 20.10 and Arch Linux.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 30
- Comments: 57 (21 by maintainers)
Commits related to this issue
- Pin docker version (https://github.com/moby/moby/issues/41820) — committed to whole-tale/terraform_deployment by craig-willis 3 years ago
- Pin docker version (https://github.com/moby/moby/issues/41820) — committed to whole-tale/terraform_deployment by craig-willis 3 years ago
https://github.com/moby/moby/pull/42104 should take care of this, but will need to be backported after it is merged.
I’ve upgraded the docker version on a test machine and still have the same warnings:
Okay, I have come up with an easy way to reproduce it:
In my case, I get the output:
I checked a few of my instances at random and they are running version 20.10.6 where the bug should be fixed, but instead I still see lots of warning messages when
journalctl -afu docker.service
is executed:level=warning msg="got error while decoding json" error="unexpected EOF" retries=X
Maybe there’s something else causing it too?
@cpuguy83 did you run ˋdocker logs -fˋ? Do you use the json-file logging?
We’ve hit the same bug in the https://github.com/trinodb/trino while running our product tests (testcontainers based). When Trino is starting, it generates a lot of logs. During the startup, 20% of runs fail with EOF.
It’s pretty straightforward to reproduce it locally:
I can’t see it in the 20.10.6 milestone, where I think a lot of us would hope to.
Thanks @tobiasstadler for the steps to reproduce. I’ve also tried your change and the log output no longer exits with the error 👍. The daemon log is still filled with errors still, though this could be resolved by better placement of the
logrus
error to the bottom of the function.I’ve found that the issue won’t appear until the logs are queried for the running container. To do this I needed to run multiple terminals.
The following will reproduce the error in development (docker-dev:dry-run-test): Terminal 1:
make run
Terminal 2 (nyancat):
docker exec -it $(docker ps -q --filter ancestor=docker-dev:dry-run-test) bash
docker run -it --rm --name nyancat 06kellyjac/nyancat:alpine
Terminal 3 (docker logs):
docker exec -it $(docker ps -q --filter ancestor=docker-dev:dry-run-test) bash
docker logs --tail 1 -f nyancat
Maybe
should be
in
func (d *decoder) Decode() (msg *logger.Message, err error)
?@cpuguy83 I think it is related to the json-file logging driver.
docker logs -f foo
ondocker run --log-driver=json-file -it --name foo ubuntu sh -c "apt-get update && apt-get install -y nyancat && nyancat"
reproduces the error, whiledocker run --log-driver=local -it --name foo ubuntu sh -c "apt-get update && apt-get install -y nyancat && nyancat"
doesn’t.Also I see a lot of
level=warning msg="got error while decoding json" error="unexpected EOF" retries=...
in ~/Library/Containers/com.docker.docker/Data/log/vm/dockerd.log (on macOS)I seems like it is not able to recover from a json parsing error
I am still encountering these errors, while running the latest docker version: 20.10.6. Can somebody else confirm?
How to reproduce (tested on Ubuntu 20.04.2 LTS & MacOS): Terminal 1:
docker run -it --name foo ubuntu sh -c "apt-get update && apt-get install -y nyancat && nyancat"
Terminal 2:docker logs -f foo
(expect the same “You have nyaned for x seconds” screen as in terminal 1) Terminal 3 (Ubuntu):journalctl -u docker -f
Terminal 3 (MacOS):tail -f ~/Library/Containers/com.docker.docker/Data/log/vm/dockerd.log
Terminal 3 (other OS): see https://stackoverflow.com/a/30970134/9103163 how to view the logs of the docker daemon process on your OS.The error is showing up immediately.
It actually is in the 20.10.6 milestone: https://github.com/moby/moby/milestone/93?closed=1 https://github.com/moby/moby/pull/42174
+1 to @tobiasstadler 's last comment, the issue is only present if you run
docker logs -f
on an already running container.Thanks, I’ll give the nyan reproducer a try tomorrow.
Thumbs down because it’s not “just output a lot of logs”, otherwise it would have been reproduced.
The thread is full of one-liners that people used to reproduce the issue, I’m not sure I can provide anything better. As multiple people pointed out, just use a container that produces a lot of logs, then tail it’s logs.
This issue prevents us from updating Docker since early December, as it makes all of our tests flaky. Can we please get some feedback from the developer team if this was successfully reproduced/is being looked into?
I wasn’t able to reproduce using @pete-woods’s approach either, but it always seems to happen in a few seconds if I run
in one terminal window, and then
in a separate terminal window.
For what it’s worth, it also happens if I tail the logs over Docker’s HTTP API.
Running Docker 20.10.2 on MacOS on a Macbook Pro.
Indeed we do handle the case where there can be a race between reading/writing a json entry. It seems like we have messed up the byte stream somehow.
Working on getting more debugging in here (at least adding container ID to these error logs from the daemon).
Do you have log rotation or anything else enabled? I have not been able to reproduce this trying in a few different ways.
I also can confirm this is happening. Worked fine before upgrade of docker engine from 19 to 20.
I have multiple containers(about 7 running), 3 of them produce generous output.
Usually, for one or two containers the logging fails with the EOF error during integration testing.
Logs in the dockerd are filled with the following:
Best regards, Sorin