moby: dockerd seems not to close file descriptors to /var/run/docker.sock and hits "too many open files" error
Description I have a docker swarm on 3 manager nodes. I am deploying a few services (less than 20) using ‘docker stack deploy’ CLI. Several internal overlay networks are created and some services are exposed to outside. Then multiple tests are run that basically access http endpoints exposed by one of services. After several hours dockerd on one of nodes hits the limit of open files. This issue started to appear once we started to use 17.12.1-ce in our CI. Previously we used 17.06.2-ce for more than half a year and ‘too many open files’ was not observed there.
Steps to reproduce the issue: As in description
Describe the results you received:
- /var/log/messages gets flooded with “too many open files” error making systemd-journal constantly consume 100% of one CPU core.
- docker CLI hangs
Describe the results you expected: Limit of file descriptors is not hit and docker CLI works.
Additional information you deem important (e.g. issue happens only occasionally):
Output of docker version
:
Client:
Version: 17.12.1-ce
API version: 1.35
Go version: go1.9.4
Git commit: 7390fc6
Built: Tue Feb 27 22:15:20 2018
OS/Arch: linux/amd64
Server:
Engine:
Version: 17.12.1-ce
API version: 1.35 (minimum version 1.12)
Go version: go1.9.4
Git commit: 7390fc6
Built: Tue Feb 27 22:17:54 2018
OS/Arch: linux/amd64
Experimental: false
Output of docker info
:
If this issue is encountered no CLI commands work, Pasting output from another execution which did not hit this issue yet:
Containers: 12
Running: 8
Paused: 0
Stopped: 4
Images: 13
Server Version: 17.12.1-ce
Storage Driver: devicemapper
Pool Name: vg00-docker--pool
Pool Blocksize: 524.3kB
Base Device Size: 10.74GB
Backing Filesystem: xfs
Udev Sync Supported: true
Data Space Used: 7.73GB
Data Space Total: 21.47GB
Data Space Available: 13.75GB
Metadata Space Used: 1.704MB
Metadata Space Total: 218.1MB
Metadata Space Available: 216.4MB
Thin Pool Minimum Free Space: 2.147GB
Deferred Removal Enabled: true
Deferred Deletion Enabled: true
Deferred Deleted Device Count: 0
Library Version: 1.02.107-RHEL7 (2015-10-14)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
NodeID: k6zb3q7z4gz6v1qingmhlpg6f
Is Manager: true
ClusterID: n5y6hnmseoh9wxskc2237fbx6
Managers: 3
Nodes: 3
Orchestration:
Task History Retention Limit: 2
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 3
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 0
Autolock Managers: false
Root Rotation In Progress: false
Node Address: 10.9.4.157
Manager Addresses:
10.9.4.121:2377
10.9.4.157:2377
10.9.4.177:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9b55aab90508bd389d7654c4baf173a981477d55
runc version: 9f9c96235cc97674e935002fc3d78361b696a69e
init version: 949e6fa
Security Options:
seccomp
Profile: default
Kernel Version: 3.10.0-327.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 3
Total Memory: 5.671GiB
Name: node-10-9-4-157
ID: KL7C:T2D7:OO2D:UKII:4NDF:KM7X:IEMK:WOCF:NTRY:XT44:6554:ADUR
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
File Descriptors: 168
Goroutines: 337
System Time: 2018-05-14T19:27:01.265600845Z
EventsListeners: 9
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
docker-registry:5000
127.0.0.0/8
Registry Mirrors:
http://docker-registry:5001/
Live Restore Enabled: false
WARNING: bridge-nf-call-ip6tables is disabled
Additional environment details (AWS, VirtualBox, physical, etc.): 3 VMs with CentOS 7.2 on KVM.
First 1000 lines of warnings and errors from dockerd messages.log.filtered.1000.log
[root@xxx]# cat lsof.docker.sock
(...)
dockerd 6368 root *526u unix 0xffff880148d21a40 0t0 2724573 /var/run/docker.sock
dockerd 6368 root *527u unix 0xffff880147495a40 0t0 2722306 /var/run/docker.sock
dockerd 6368 root *528u unix 0xffff880148d003c0 0t0 2720591 /var/run/docker.sock
dockerd 6368 root *529u unix 0xffff880147494000 0t0 2720592 /var/run/docker.sock
dockerd 6368 root *530u unix 0xffff880148d030c0 0t0 2720593 /var/run/docker.sock
dockerd 6368 root *531u unix 0xffff880148d22940 0t0 2747419 /var/run/docker.sock
dockerd 6368 root *532u unix 0xffff880148d2bc00 0t0 2720595 /var/run/docker.sock
dockerd 6368 root *533u unix 0xffff880147494b40 0t0 2722307 /var/run/docker.sock
dockerd 6368 root *534u unix 0xffff880148d230c0 0t0 2747420 /var/run/docker.sock
dockerd 6368 root *535u unix 0xffff880148d20780 0t0 2770958 /var/run/docker.sock
[root@xxx]# cat lsof.docker.sock | wc -l
65373
[root@xxx]# cat lsof.docker.sock | grep '/var/run/docker.sock' | wc -l
65372
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 2
- Comments: 27 (11 by maintainers)
The thing is, in dockerd the client connection is handled by the stdlib http server. These are automatically closed out except in a few specific cases (where the daemon hijacks the connection to enable 2-way communication for container attach). The old metricsbeat code was definitely incorrect (the docker client was waiting on a nil channel, which will always block because metricsbeat wasn’t setting the channel, in order to close out the response body).
Maybe we can get a goroutine stack dump from docker to see what might be still running. You can get this from
curl --unix-socket /var/run/docker.sock http://./debug/pprof/goroutine\?debug\=2
On Wed, May 16, 2018 at 3:15 PM Przemysław Strzelczak < notifications@github.com> wrote: