moby: dockerd seems not to close file descriptors to /var/run/docker.sock and hits "too many open files" error

Description I have a docker swarm on 3 manager nodes. I am deploying a few services (less than 20) using ‘docker stack deploy’ CLI. Several internal overlay networks are created and some services are exposed to outside. Then multiple tests are run that basically access http endpoints exposed by one of services. After several hours dockerd on one of nodes hits the limit of open files. This issue started to appear once we started to use 17.12.1-ce in our CI. Previously we used 17.06.2-ce for more than half a year and ‘too many open files’ was not observed there.

Steps to reproduce the issue: As in description

Describe the results you received:

  • /var/log/messages gets flooded with “too many open files” error making systemd-journal constantly consume 100% of one CPU core.
  • docker CLI hangs

Describe the results you expected: Limit of file descriptors is not hit and docker CLI works.

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker version:

Client:
 Version:	17.12.1-ce
 API version:	1.35
 Go version:	go1.9.4
 Git commit:	7390fc6
 Built:	Tue Feb 27 22:15:20 2018
 OS/Arch:	linux/amd64

Server:
 Engine:
  Version:	17.12.1-ce
  API version:	1.35 (minimum version 1.12)
  Go version:	go1.9.4
  Git commit:	7390fc6
  Built:	Tue Feb 27 22:17:54 2018
  OS/Arch:	linux/amd64
  Experimental:	false

Output of docker info: If this issue is encountered no CLI commands work, Pasting output from another execution which did not hit this issue yet:

Containers: 12
 Running: 8
 Paused: 0
 Stopped: 4
Images: 13
Server Version: 17.12.1-ce
Storage Driver: devicemapper
 Pool Name: vg00-docker--pool
 Pool Blocksize: 524.3kB
 Base Device Size: 10.74GB
 Backing Filesystem: xfs
 Udev Sync Supported: true
 Data Space Used: 7.73GB
 Data Space Total: 21.47GB
 Data Space Available: 13.75GB
 Metadata Space Used: 1.704MB
 Metadata Space Total: 218.1MB
 Metadata Space Available: 216.4MB
 Thin Pool Minimum Free Space: 2.147GB
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: true
 Deferred Deleted Device Count: 0
 Library Version: 1.02.107-RHEL7 (2015-10-14)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: k6zb3q7z4gz6v1qingmhlpg6f
 Is Manager: true
 ClusterID: n5y6hnmseoh9wxskc2237fbx6
 Managers: 3
 Nodes: 3
 Orchestration:
  Task History Retention Limit: 2
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 10.9.4.157
 Manager Addresses:
  10.9.4.121:2377
  10.9.4.157:2377
  10.9.4.177:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9b55aab90508bd389d7654c4baf173a981477d55
runc version: 9f9c96235cc97674e935002fc3d78361b696a69e
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-327.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 3
Total Memory: 5.671GiB
Name: node-10-9-4-157
ID: KL7C:T2D7:OO2D:UKII:4NDF:KM7X:IEMK:WOCF:NTRY:XT44:6554:ADUR
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 168
 Goroutines: 337
 System Time: 2018-05-14T19:27:01.265600845Z
 EventsListeners: 9
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 docker-registry:5000
 127.0.0.0/8
Registry Mirrors:
 http://docker-registry:5001/
Live Restore Enabled: false

WARNING: bridge-nf-call-ip6tables is disabled

Additional environment details (AWS, VirtualBox, physical, etc.): 3 VMs with CentOS 7.2 on KVM.

First 1000 lines of warnings and errors from dockerd messages.log.filtered.1000.log

[root@xxx]# cat lsof.docker.sock
(...)
dockerd 6368 root *526u  unix 0xffff880148d21a40      0t0  2724573 /var/run/docker.sock
dockerd 6368 root *527u  unix 0xffff880147495a40      0t0  2722306 /var/run/docker.sock
dockerd 6368 root *528u  unix 0xffff880148d003c0      0t0  2720591 /var/run/docker.sock
dockerd 6368 root *529u  unix 0xffff880147494000      0t0  2720592 /var/run/docker.sock
dockerd 6368 root *530u  unix 0xffff880148d030c0      0t0  2720593 /var/run/docker.sock
dockerd 6368 root *531u  unix 0xffff880148d22940      0t0  2747419 /var/run/docker.sock
dockerd 6368 root *532u  unix 0xffff880148d2bc00      0t0  2720595 /var/run/docker.sock
dockerd 6368 root *533u  unix 0xffff880147494b40      0t0  2722307 /var/run/docker.sock
dockerd 6368 root *534u  unix 0xffff880148d230c0      0t0  2747420 /var/run/docker.sock
dockerd 6368 root *535u  unix 0xffff880148d20780      0t0  2770958 /var/run/docker.sock
[root@xxx]# cat lsof.docker.sock | wc -l
65373
[root@xxx]# cat lsof.docker.sock | grep '/var/run/docker.sock' | wc -l
65372

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 2
  • Comments: 27 (11 by maintainers)

Most upvoted comments

The thing is, in dockerd the client connection is handled by the stdlib http server. These are automatically closed out except in a few specific cases (where the daemon hijacks the connection to enable 2-way communication for container attach). The old metricsbeat code was definitely incorrect (the docker client was waiting on a nil channel, which will always block because metricsbeat wasn’t setting the channel, in order to close out the response body).

Maybe we can get a goroutine stack dump from docker to see what might be still running. You can get this from curl --unix-socket /var/run/docker.sock http://./debug/pprof/goroutine\?debug\=2

On Wed, May 16, 2018 at 3:15 PM Przemysław Strzelczak < notifications@github.com> wrote:

I might be soon asked to delete the environment with the instance of this problem. Could you recommend what kind of logs I could collect before?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/moby/moby/issues/37061#issuecomment-389633869, or mute the thread https://github.com/notifications/unsubscribe-auth/AAwxZukxHxVipHNCwHozMbSUtAjSHXvxks5tzHq-gaJpZM4T-ZpF .