moby: Logging long lines breaks container

Description

In a container which has a TTY allocated (docker run -t), if a process whose stdout/stderr is being collected by docker-containerd-shim sends a line several thousand characters long (observed to be anywhere in the range of about 5k-9k), docker-containerd-shim stops reading from the PTY master. This causes processes writing to the terminal to hang in the write(2) call, which quickly renders the container unserviceable.

Steps to reproduce the issue:

Here’s how I can get a reliable reproduction:

  1. Install docker-ce version 17.11.0~ce-0~ubuntu on an Ubuntu 16.04 x64 system.

  2. Start a container like this:

     docker run -t busybox /bin/sh -c 's="a"; while true; do echo $s | wc -c; echo $s; s="a$s"; done'
    
  3. The screen will quickly start scrolling successively longer lines of as, however within a few seconds, the terminal will stop scrolling, having printed a number in the mid-to-high thousands and a long line of a characters. While the number (indicating the length of the string printed immediately afterwards) will vary on each run, I’ve never seen it reach 10k.

Describe the results you received:

The container’s logs stop displaying lines.

Describe the results you expected:

The container’s logs continue to display lines until the container is terminated.

Additional information you deem important (e.g. issue happens only occasionally):

The exact line length that the demo program stops at varies between runs by a considerable margin (5k to 9k in my observations), which suggests this isn’t a buffer size problem, but is instead a race condition somewhere.

The problem only occurs when a TTY is allocated, suggesting this is not a general problem with log I/O. If you re-run the same docker run command but without the -t option (which, I’ve noticed, feeds the container output via a pipe, rather than a pty), you can let this run as long as your patience will last. I’ve let it go to line lengths of 42k before getting bored and hitting ^C.

Downgrading to 17.10.0~ce-0~ubuntu (commit ID f4ffd25) and re-running the problematic command (with -t) also does not exhibit the unhelpful behaviour, suggesting this problem is a recent addition to the pantheon, and (if it is a problem in, say, the kernel) it’s being triggered by recent changes in Docker.

Output of docker version:

Client:
 Version:      17.11.0-ce
 API version:  1.34
 Go version:   go1.8.3
 Git commit:   1caf76c
 Built:        Mon Nov 20 18:36:37 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.11.0-ce
 API version:  1.34 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   1caf76c
 Built:        Mon Nov 20 18:35:10 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

Containers: 6
 Running: 4
 Paused: 0
 Stopped: 2
Images: 1
Server Version: 17.11.0-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 11
 Dirperm1 Supported: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 992280e8e265f491f7a624ab82f3e238be086e49
runc version: 0351df1c5a66838d0c392b4ac4cf9450de844e2d
init version: 949e6fa
Security Options:
 apparmor
Kernel Version: 3.13.0-135-generic
Operating System: Ubuntu 14.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 1.955GiB
Name: docker-test
ID: MF2K:XQZL:SO7O:ZU6O:JIAR:AJZC:3EE2:YILN:PKN2:IXST:7KMB:J7K5
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: true

WARNING: No swap limit support

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 59 (43 by maintainers)

Commits related to this issue

Most upvoted comments

I’ll write down a summary (also in case others are interested); “Docker CE” consists of several components (currently the “cli”, the “engine”, and “packaging scripts”); those components are assembled in the “docker ce” repository; https://github.com/docker/docker-ce, which is used to build the packages; you find branches in there for each release, as well as tags.

Collecting the code in a single repository has the advantage that everything used to build a release is tied together in a single repository, but the disadvantage is that commits no longer match upstream commits (due to code being moved in subdirectories, causing commits to be rewritten), and also makes it not possible to “vendor” go code from that repository (a limitation of how “go” expects packages to be in a fixed path).

To resolve that situation, starting with Docker 18.06, the following repositories are used;

Note that release branches may be removed after a release reaches EOL, so when vendoring code, it’s recommended to use a tag).

Components are still collected/combined in the Docker CE repository (https://github.com/docker/docker-ce), but that should mainly be an implementation detail, and could change at some point (shipping separate packages for the “engine” (daemon) and “cli” is still on the table)

Some information cal also be found here; https://github.com/docker/docker.github.io/blob/master/install/index.md#releases

After further reports of problems from our customers running Docker 18.01, I managed to get a repro with the above steps on

Client:
 Version:       18.01.0-ce
 API version:   1.35
 Go version:    go1.9.2
 Git commit:    03596f5
 Built: Wed Jan 10 20:11:05 2018
 OS/Arch:       linux/amd64
 Experimental:  false
 Orchestrator:  swarm

Server:
 Engine:
  Version:      18.01.0-ce
  API version:  1.35 (minimum version 1.12)
  Go version:   go1.9.2
  Git commit:   03596f5
  Built:        Wed Jan 10 20:09:37 2018
  OS/Arch:      linux/amd64
  Experimental: false

However it took a lot longer this time (got up to 78619) before bombing out. So the bug is still there, it’s just sneakier now.

I’m deploying it this weekend

On Thu, Aug 2, 2018 at 10:22 AM Sebastiaan van Stijn < notifications@github.com> wrote:

@erikh https://github.com/erikh @mpalmer https://github.com/mpalmer have you been able to verify that 18.06 fixes the issue for you?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/moby/moby/issues/35865#issuecomment-410003533, or mute the thread https://github.com/notifications/unsubscribe-auth/AABJ6-wIpU12DuVD_XBDbxXj4rF_dwH9ks5uMzVGgaJpZM4RL0Al .

No, this does not help much, cause the problem itself is in cotainerd-shim, not in docker and niether in dockerd.

The original reporter said they never seen the repo get over 10k and i easily it 26k before manually stopping it, no blocking or issues.

cough

However it took a lot longer this time (got up to 78619) before bombing out. So the bug is still there, it’s just sneakier now.

so,fwiw, we just re-implemented kr/pty for some local processes and it seems the library is the issue, not docker. we’re hitting long line limits with non-docker operations.

I’ll see what our team can dig up; maybe we can find something there or talk to keith or something.

On Tue, May 22, 2018 at 8:01 AM, Brian Goff notifications@github.com wrote:

Definitely an issue with how the pty is setup. I’ve found that this can be reproduced by just sending lots of data to stdout/stderr, line length doesn’t matter.

I’m not familiar (really at all) with how these things are allocated so it’s been a bit of a slog to track down. As far as I can tell everything in containerd/docker is working correctly. The copy routines continue to run (side note, would be nice if we could use something other than userspace copying here) correctly, something is just getting stuck.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/moby/moby/issues/35865#issuecomment-391023368, or mute the thread https://github.com/notifications/unsubscribe-auth/AABJ6xjnpyHo4pYg5bltNCs_I3l3hXhVks5t1ChjgaJpZM4RL0Al .

Definitely an issue with how the pty is setup. I’ve found that this can be reproduced by just sending lots of data to stdout/stderr, line length doesn’t matter.

I’m not familiar (really at all) with how these things are allocated so it’s been a bit of a slog to track down. As far as I can tell everything in containerd/docker is working correctly. The copy routines continue to run (side note, would be nice if we could use something other than userspace copying here) correctly, something is just getting stuck.