moby: docker top fails when large PIDs exceed 5 figures
Description
On a system that supports greater than 5 figures, docker top
on the container will fail due to what appears to be a parsing failure of the returned values:
docker run -it --rm --name health health
$ docker top health
Error response from daemon: Unexpected pid '103005root': strconv.Atoi: parsing "103005root": invalid syntax
time="2017-07-27T07:50:58.385629919-04:00" level=error msg="Handler for GET /v1.30/containers/health/top returned error: Unexpected pid '103005root': strconv.Atoi: parsing \"103005root\": invalid syntax"
Steps to reproduce the issue:
-
Check
pid_max
to see if your kernel supports a 6 figures number:$ cat /proc/sys/kernel/pid_max 131072 # if not, change it: $ echo 131072 > /proc/sys/kernel/pid_max
-
Run enough processes to get to large PIDs; I’d expect some sort of
for
loop should do this pretty quickly -
Attempt to run
docker top
: Docker client output:$ docker top health Error response from daemon: Unexpected pid '103005root': strconv.Atoi: parsing "103005root": invalid syntax
Describe the results you received: Did not give me top output
Describe the results you expected:
I should get docker top
output
Additional information you deem important (e.g. issue happens only occasionally):
Output of docker version
:
$ docker version
Client:
Version: 17.06.0-ce
API version: 1.30
Go version: go1.8.3
Git commit: 02c1d87
Built: Fri Jun 23 21:15:15 2017
OS/Arch: linux/amd64
Server:
Version: 17.06.0-ce
API version: 1.30 (minimum version 1.12)
Go version: go1.8.3
Git commit: 02c1d87
Built: Fri Jun 23 21:51:55 2017
OS/Arch: linux/amd64
Experimental: false
Output of docker info
:
$ docker info
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 290
Server Version: 17.06.0-ce
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: cfb82a876ecc11b5ca0977d1733adbe58599088a
runc version: 2d41c047c83e09a6d61d464906feb2a2f3c52aa4
init version: 949e6fa
Security Options:
seccomp
Profile: default
Kernel Version: 4.9.33
Operating System: Alpine Linux v3.6
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 1.94GiB
Name: alpine
ID: ANNE:4FUU:XCPW:GM7X:3OBN:E4KZ:VY63:ZB6H:NS3W:MAOI:54XZ:RBHL
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Username: mbentley
Registry: https://index.docker.io/v1/
Labels:
foo=bar
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No swap limit support
Additional environment details (AWS, VirtualBox, physical, etc.): VMware Fusion VM
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 32 (24 by maintainers)
Two quick questions:
Todays Docker Desktop Edge release should have the fix. Should also be in the next stable release
For the time being - those who wish to work around it much like I had, this should work…
Start a new privileged throwaway container:
Identify if your pid_max is over five digits long:
Change it to 99999:
This should allow use cases, such as Jenkins running via Docker-in-Docker (on Windows) to work reliably afterwards.
It’s very likely due to a bug/presentation issue in busybox; docker uses the output of
ps
on the host (https://github.com/moby/moby/issues/34282#issuecomment-323077673), which has this issue if the host is using busybox (which is the case on docker for mac/linux), but not if it’s running on UbuntuUntil there’s a proper fix for the parsing in place, can’t the Linux VM (mobylinuxvm) just set (on boot)
/proc/sys/kernel/pid_max
to99999
each time to avoid this issue? I’ve been doing this by hand, it does work but the fix has to be manually reapplied every time the host or the VM restarts.I think that may actually be a good solution; one other option that may work is to customize the COLUMN-headers; not all variations of
ps
seem to support-o column:<width>
, but custom names seem to be supported (even bybusybox
); from https://unix.stackexchange.com/a/313470So doing something like;
Will make the columns wider, accommodating the underscores;
Yes, this is where things could get complicated.
Assumptions:
PID
. Next column could start with a digit. -ps -o pid
so far seems to be dependable way to get the pids on the few distros I’ve tried so far.For this case, the delimiter with
USER
can be found withunicode.IsLetter()
. That doesn’t help if the user specified arguments that would have values starting with a digit for the column afterPID
(for exampleTIME
,CPU
, etc).One idea that I’m still working with is that
ps -o pid
will return only the pids. This will give a something to compare the value atfieldsASCII(line)[pidIndex]
to. If there is a match fromps -o pid
with whats in the broken line we will be able to figure out at what index thePID
field ends.A quick update, if someone wants to look into this (don’t have time myself at this moment);
@mbentley sent me the output of
ps -ef
on the host, which is what docker is running behind the scenes to collect the information (see daemon/top_unix.go#L118-L155, and daemon/top_unix.go#L61-L111)The problem here is that that output (at least by default) glues together the
PID
andUSER
columns, therefore docker fails to split them (which causes the error).I did a quick search if there’s a portable way to change the format, so that we can guarantee that the columns are always separated. This ServerFault discussion provides some possible options; https://serverfault.com/a/157618
However, if there’s an option, we need to be sure it’s supported on all platforms/distros, given that docker shells out to the
ps
binary on the host, and we don’t know what thatps
binary supports on each distro.The
docker top
command is known to be problematic for the same reason, as the output (and options) are not standardised; ideally docker wouldn’t shell out, and obtain the same information other ways, but that will be a bigger change.