moby: Random task failures with "non-zero exit (137)" since 18.09.1

Description

Since the upgrade from 18.09.0 to 18.09.1 (same with 18.09.2) we experience random task failures.

The tasks die with non-zero exit (137) error messages, which indicate a SIGKILL being received. A common reason is a OOM kill, but this is not the case for our containers. We monitor the memory usage, and the affected containers are well below all limits (per container and also the host has enough memory free). Also there is not the usual stack trace from the kernel in the logs and also a docker inspect on the dead containers show "OOMKilled": false. We tried forcefully provoking a OOM kill and it shows the expected stack trace and in this case also the OOMKilled flag set to true.

Also the containers are not supposed to shut down and also the health checks are not the culprit. We experience this with practically all our containers, which are very different. For example also with the official nginx image only serving static files, so we don’t expect our containers to be the culprit. Also because with the very same images we don’t experience this issue with 18.09.0.

So the question is who is killing our containers?

We managed to capture the culprit sending the SIGKILL with auditd. Here is the relevant ausearch output:

----
type=PROCTITLE msg=audit(02/21/19 00:22:56.549:883) : proctitle=runc --root /var/run/docker/runtime-runc/moby --log-format json delete --force ad6e6f45bec2a3201313b02af4f038830aaad7ab052548608
type=OBJ_PID msg=audit(02/21/19 00:22:56.549:883) : opid=2488 oauid=unset ouid=root oses=-1 ocomm=wrapper-linux-x
type=SYSCALL msg=audit(02/21/19 00:22:56.549:883) : arch=x86_64 syscall=kill success=yes exit=0 a0=0x9b8 a1=SIGKILL a2=0x0 a3=0x0 items=0 ppid=795 pid=27127 auid=unset uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=unset comm=runc exe=/usr/sbin/runc key=kill_signals

So it seems runc is for some reason killing the container. But why?

Steps to reproduce the issue:

  1. Setup a docker swarm with 18.09.2 with a lot of running container and wait a few days.
  2. Experience random task failures.

Describe the results you received: Tasks randomly failing:

dockerd[794]: time="2019-02-21T00:22:56.945715049Z" level=error msg="fatal task error" error="task: non-zero exit (137)" module=node/agent/taskmanager node.id=3whne2nbw3atm30sxp2w9elhx service.id=jjt0smq927b91b4s68lp9lquj task.id=p
0kmz6uqsp7izlfkrr25534sc

Describe the results you expected: Tasks NOT randomly failing.

We have downgraded our production environment to 18.09.0 again and have not experienced the failure in the last weeks, everything else stayed the same (images, configuration, kernel, etc.). So it is definitely a problem introduced in 18.09.1.

Additional information you deem important (e.g. issue happens only occasionally): We have a Docker swarm running with around 25 nodes, 35 services and 100 containers.

The issue happens around 2-5 times a day with completely different containers on all of the nodes. We have not had a single day with not at least 2 kills since the upgrade. We had it happen up to 6 times on a single day.

Every container seems to be evenly likely to be affected.

Output of docker version:

Client:
 Version:           18.09.2
 API version:       1.39
 Go version:        go1.10.6
 Git commit:        6247962
 Built:             Sun Feb 10 04:13:47 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.2
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.6
  Git commit:       6247962
  Built:            Sun Feb 10 03:42:13 2019
  OS/Arch:          linux/amd64
  Experimental:     false

Output of docker info:

Containers: 30
 Running: 17
 Paused: 0
 Stopped: 13
Images: 29
Server Version: 18.09.2
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: efs local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active
 NodeID: 3whne2nbw3atm30sxp2w9elhx
 Is Manager: false
 Node Address: 10.1.148.157
 Manager Addresses:
  10.1.157.55:2377
  10.1.17.3:2377
  10.1.90.128:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9754871865f7fe2f4e74d43e2fc7ccd237edcbce
runc version: 09c8266bf2fcf9519a651b04ae54c967b9ab86ec
init version: fec3683
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.15.0-1032-aws
Operating System: Ubuntu 18.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.691GiB
Name: frontend0.dev.ecosio.internal
ID: LOBR:KA6Q:CHBM:OB3V:Q4PS:MFI7:ARBE:XQRO:35PO:UFED:SLHZ:UTTE
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
 10.0.0.0/10
Live Restore Enabled: false
Product License: Community Engine

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.): On AWS/EC2.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 14
  • Comments: 28 (10 by maintainers)

Most upvoted comments

I believe we used the following audit.rules:

-a exit,always -F arch=b64 -S kill -F a1=9 -k kill_signals

We have not experienced the issue anymore with 18.09.3 - 18.09.6. We only had it with 18.09.1 and 18.09.2.

We’re seeing an identical issue with 18.09.2 (6247962).

There’s also a ticket on the Docker forums which seems to reference the same bug: https://forums.docker.com/t/container-fails-with-error-137-but-no-oom-flag-set-and-theres-plenty-of-ram/69336/5

Perfect! Thanks for the update, @straurob 👍

@straurob docker 19.03.2 and containerd 1.2.6 are quite some patch releases behind the latest; if you have a test setup where you can test/reproduce, are you still seeing this on the latest patch releases for both? (not sure if anything in this area changed, but would be useful to know if it’s still an issue or if it has been fixed since)

Is this only for 18.09.2 or also with 18.09.1 (more accurately; the version of containerd and runc shipping with 18.09.2)? There was a fix in runc to fix a CVE that’s causing more memory to be used when starting a container.