moby: Starting container fails with 'System error: read parent: connection reset by peer'

On our CI server, we run tests in Docker containers using docker-compose. We link 2-15 containers during one run. We ensure that test jobs running concurrently have different docker-compose project names.

Since we upgraded to Docker 1.7.0 (from 1.6), docker-compose to 1.3.1 (from 1.2), and started killing containers instead of stopping them (faster, remove them anyway), we twice had containers failing to start with the following message from compose:

Creating x_db_1...
Creating x_1...
Cannot start container 10bbc5af8ec0d3bb39b207a6474ec70a0954bff01ff94389684a8b9f52df6067: [8] System error: read parent: connection reset by peer

/var/log/docker.log contains the following:

time="2015-06-25T10:29:44.322521665+02:00" level=info msg="POST /v1.18/containers/19aec1ddb8a5cd771771f16a1f8929bb58eea2cf7e877425a7812f6c6e5756a2/start" 
time="2015-06-25T10:29:44.690044235+02:00" level=warning msg="signal: killed" 
time="2015-06-25T10:29:44.915997839+02:00" level=error msg="Handler for POST /containers/{name:.*}/start returned error: Cannot start container 19aec1ddb8a5cd771771f16a1f8929bb58eea2cf7e877425a7812f6c6e5756a2: [8] System error: read parent: connection reset by peer" 
time="2015-06-25T10:29:44.916111471+02:00" level=error msg="HTTP Error" err="Cannot start container 19aec1ddb8a5cd771771f16a1f8929bb58eea2cf7e877425a7812f6c6e5756a2: [8] System error: read parent: connection reset by peer" statusCode=500 

The container is created, but doesn’t start. Trying to manually start it using docker start fails with the same error message. Memory is available and the kernel log doesn’t show any message from the OOM killer.

Restarting docker temporarily solves the problem, so I assume this is a problem with docker itself, not with docker-compose.

docker version:

Client version: 1.7.0
Client API version: 1.19
Go version (client): go1.4.2
Git commit (client): 0baf609
OS/Arch (client): linux/amd64
Server version: 1.7.0
Server API version: 1.19
Go version (server): go1.4.2
Git commit (server): 0baf609
OS/Arch (server): linux/amd64

docker info:

Containers: 30
Images: 451
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 517
Dirperm1 Supported: true
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.16.0-0.bpo.4-amd64
Operating System: Debian GNU/Linux 7 (wheezy)
CPUs: 4
Total Memory: 15.52 GiB
Name: <hostname>
ID: HYYT:WNZW:UPU7:VI2O:HUTP:EZVV:2MQ2:WCRJ:3SHJ:LZXF:MVLS:P3XC
WARNING: No memory limit support
WARNING: No swap limit support

uname -a:

<hostname> 3.16.0-0.bpo.4-amd64 #1 SMP Debian 3.16.7-ckt9-3~deb8u1~bpo70+1 (2015-04-27) x86_64 GNU/Linux

Environment details (AWS, VirtualBox, physical, etc.): Physical machine

Steps to Reproduce:

  1. docker-compose run ..., kill and rm 50-200 (estimated) containers with links using docker-compose
  2. At some point, docker fails to start a container.

Actual Results: Starting the container fails.

Expected Results: Container starts.

Additional info:

Kubernetes seemed to have a similar problem, apparently some sort of race. The issue also links to a few occurences where people had a similar problem (mainly on IRC).

About this issue

  • Original URL
  • State: closed
  • Created 9 years ago
  • Comments: 90 (26 by maintainers)

Commits related to this issue

Most upvoted comments

tl;dr; Data is being left in a pipe - https://github.com/docker/docker/commit/be7ef6370a4bbe1c333182cecf1b1d783d193aba is a workaround.

I believe I’ve found the root cause of this issue. The bug is not in Docker, but is in the opencontainers/runc library that docker depends on. As @burke mentioned, we’ve been seeing this a lot more in production in 1.9.1, and having a reproducible test case made tracking this down a lot easier. In our case the problem persisted across restarts of the docker daemon and even across reboots.

Attaching to the docker daemon with strace, I was able to find the syscall that generated the ECONNRESET:

[pid 30291] execve("/sbin/borginit", ["/sbin/borginit", "resque-worker"], [/* 29 vars */] <unfinished ...>
[pid 30295] +++ exited with 0 +++
[pid 30294] +++ exited with 0 +++
[pid 30293] +++ exited with 0 +++
[pid 30292] +++ exited with 0 +++
[pid 30291] <... execve resumed> )      = 0
[pid 26317] <... read resumed> 0xc20f88b600, 512) = -1 ECONNRESET (Connection reset by peer)

Not surprisingly, it’s a read call. Looking at what other operations were performed on that file descriptor (18) gleaned a bit more information:

[pid 26317] write(18, "{\"args\":[\"/sbin/borginit\",\"resqu"..., 7681 <unfinished ...>
[pid 30294] <... futex resumed> )       = 0
[pid 26317] <... write resumed> )       = 7681
[pid 30294] futex(0xc208020b58, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 30291] socket(PF_LOCAL, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0 <unfinished ...>
[pid 26317] shutdown(18, SHUT_WR <unfinished ...>
[pid 30291] <... socket resumed> )      = 4
[pid 26317] <... shutdown resumed> )    = 0
[pid 30291] setsockopt(4, SOL_SOCKET, SO_BROADCAST, [1], 4 <unfinished ...>
[pid 26317] read(18,  <unfinished ...>

Also, what was exactly was file descriptor 18? After a bit of searching I was able to find it was being returned from a call to socketpair:

[pid 26245] socketpair(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC, 0, [17, 18]) = 0

At this point I had enough information from the strace log (what created the fd, how and what order it was used) to start going through the docker source to try to find corresponding source code. After some searching I found where the socketpair syscall originated, and where operations are performed on the pipe. The code that generates the error is:

  if err := json.NewDecoder(p.parentPipe).Decode(&ierr); err != nil && err != io.EOF {
    return newSystemError(err)
  }

With this information, I now tried to piece together logically how we ended up in this bad state:

  1. We create a pipe
  2. Write data to that pipe
  3. Shutdown the pipe for further writes
  4. Wait for a read response

However, step 4 is what fails. How can read fail with ECONNRESET? According to this Stackoverflow answer, on a local pipe/socket, if the socket is closed on the other side before all written data is read, a subsequent read would return ECONNRESET. Going with this hypothesis, I tried to find what was reading on the other side of the pipe. To figure that out, I searched for the same bytes that we write to the pipe earlier in the strace log, which yields:

[pid 30291] read(3, "{\"args\":[\"/sbin/borginit\",\"resqu"..., 512) = 512
[pid 30292] <... select resumed> )      = 0 (Timeout)
[pid 30291] read(3,  <unfinished ...>
[pid 30292] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 30291] <... read resumed> "2.36:8125\",\"STATSD_IMPLEMENTATIO"..., 1024) = 1024
[pid 30291] read(3,  <unfinished ...>
[pid 30292] <... select resumed> )      = 0 (Timeout)
[pid 30291] <... read resumed> "mount_cmds\":null},{\"source\":\"tmp"..., 2048) = 2048
[pid 30292] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 30291] read(3, "ation_flags\":null,\"data\":\"\",\"rel"..., 4096) = 4096

What immediately struck me as somewhat suspicious was the fact that all of the calls to read were returning a power of 2. More digging through the code and I found where this data was consumed. From what I can gather, when docker starts a new container it creates a pipe where the daemon writes out a JSON configuration blob, which is read by the spawned container.

One thing I had also done is run strace on a successful container launch (using the workaround of adding an extra environment variable). The difference was the last call to read returned 11 bytes and not the entire size of the read buffer:

[pid 24238] read(3, "{\"args\":[\"/sbin/borginit\",\"resqu"..., 512) = 512
[pid 24239] <... select resumed> )      = 0 (Timeout)
[pid 24239] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 24238] read(3, "2.36:8125\",\"STATSD_IMPLEMENTATIO"..., 1024) = 1024
[pid 24239] <... select resumed> )      = 0 (Timeout)
[pid 24238] read(3,  <unfinished ...>
[pid 24239] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 24238] <... read resumed> "null,\"postmount_cmds\":null},{\"so"..., 2048) = 2048
[pid 24238] read(3,  <unfinished ...>
[pid 24239] <... select resumed> )      = 0 (Timeout)
[pid 24239] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 24238] <... read resumed> "80,\"propagation_flags\":null,\"dat"..., 4096) = 4096
[pid 24239] <... select resumed> )      = 0 (Timeout)
[pid 24239] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 24238] read(3, "_count\":0}\n", 8192) = 11

All of this led me to believe that there might have been some data sitting the pipe that was not being read before the pipe was closed, which leads to the connection reset error. A simple hack showed that this was in fact the case, AND what was was left in the buffer. After recompiling docker and trying to run a container with strace attached to the daemon, we can see exactly what’s going on:

[pid 27263] read(3,  <unfinished ...>
[pid 27263] <... read resumed> "2.36:8125\",\"STATSD_IMPLEMENTATIO"..., 1024) = 1024
[pid 27263] read(3,  <unfinished ...>
[pid 27263] <... read resumed> "mount_cmds\":null},{\"source\":\"tmp"..., 2048) = 2048
[pid 27263] read(3,  <unfinished ...>
[pid 27263] <... read resumed> "ation_flags\":null,\"data\":\"\",\"rel"..., 4096) = 4096
[...snip...]
[pid 27263] read(3, "\n", 2048)         = 1

BINGO. There’s a newline in the pipe that isn’t being read! My assumption as to what’s happening:

  1. The new container reads its configuration as JSON
  2. The JSON decoder does not read the final \n from the pipe, because as far as it’s concerned it has completed unmarshaling.
  3. The pipe is closed, which causes the parent to fail with ECONNRESET when it tries to read back a response.

What’s the best way to fix this? I leave this up to the Docker team. I can think of two options off the top of my head:

  • Use my disgusting hack
  • Remove the newline from the JSON encoded configuration

@airhorns It has to do with the total size of the JSON configuration that’s passed down from the docker daemon to the container. This is influenced by the line length of the command, specifically, the reason why the “add the environment variable” workaround works is because adding that environment variable modifies the total size of the JSON.

The underlying data stream being fed into the JSON decoder is being buffered (as demonstrated by the read syscalls). That stream starts at 512 bytes and is doubled every time the decoder needs more data. 99.9% of the time this works correctly because the trailing newline at the end of the JSON is read from the pipe into the buffer. It’s that 0.1% of the time where the last character in the underlying buffer (the buffer that is being filled by the read syscall) is the final }. At that point, the JSON decoder says, “I’m done!” and doesn’t ask for any more data, even though there’s still a newline that was added by the encoder that’s sitting in the pipe.

IMO the real fix for this would be for the Golang JSON decoder to try to read the final newline when decoding JSON, since they’re the ones that add the newline during encoding.