moby: Swarm Nodes Become Unavailable When Swarm is Under Heavy Load


BUG REPORT INFORMATION

Description

When any of the services in the swarm are under heavy load, one or more services (possibly all) become unavailable due to the nodes they’re running on becoming unavailable. Log info below should clarify what this means.

Steps to reproduce the issue:

  1. Create swarm and some services in usual way
  2. Put one or more of those services under very heavy load (e.g. 99% CPU)

Describe the results you received: Services become unreachable. For example, if trying to connect to a service running MySQL, from another service, you will get “ERROR 2005 (HY000): Unknown MySQL server host ‘db-master’ (0)” – where “db-master” is the name of the service running MySQL.

Docker Engine logs show:

... "memberlist: Marking ip-10-0-27-201-3b736f1d4651 as failed, suspect timeout reached"
... "memberlist: Failed TCP fallback ping: read tcp 10.0.25.121:52898->10.0.27.201:7946: i/o timeout"
... "memberlist: Suspect ip-10-0-27-201-3b736f1d4651 has failed, no acks received"
... "memberlist: Marking ip-10-0-27-201-3b736f1d4651 as failed, suspect timeout reached"
...

Describe the results you expected: Swarm services to stay up and discoverable even under heavy load. It looks like there are not enough resources left for nodes to effectively swap information about the state of the swarm?

Additional information you deem important (e.g. issue happens only occasionally):

  • We have multiple masters
  • Experimental flag is set, but we are not using any experimental features
  • Problem never occurs on our staging environment, where load never reaches high levels
  • I’ve tried to alleviate the problem a little by adding some --limit-cpu and --limit-memory to around 90% options to the services that cause high load, but I still sometimes get 100% CPU on those services

Output of docker version:

Client:
 Version:      17.03.0-ce
 API version:  1.26
 Go version:   go1.7.5
 Git commit:   60ccb22
 Built:        Thu Feb 23 11:02:43 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.0-ce
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   60ccb22
 Built:        Thu Feb 23 11:02:43 2017
 OS/Arch:      linux/amd64
 Experimental: true

Output of docker info:

Client:
 Version:      17.03.0-ce
 API version:  1.26
 Go version:   go1.7.5
 Git commit:   60ccb22
 Built:        Thu Feb 23 11:02:43 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.0-ce
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   60ccb22
 Built:        Thu Feb 23 11:02:43 2017
 OS/Arch:      linux/amd64
 Experimental: true
root@ip-10-0-27-240:~# docker info
Containers: 8
 Running: 4
 Paused: 0
 Stopped: 4
Images: 9
Server Version: 17.03.0-ce
Storage Driver: aufs                                                                                                                                                        [21/73]
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 176
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
Swarm: active
 NodeID: cfoixqufmzsoe58np7go5ftwj
 Is Manager: true
 ClusterID: ez0wwtl7pywsplpnygul8y2v8
 Managers: 4
 Nodes: 9
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: 10.0.27.240
 Manager Addresses:
  10.0.25.121:2377
  10.0.25.171:2377
  10.0.27.240:2377
  10.0.27.59:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 977c511eda0925a723debdc94d09459af49d082a
runc version: a01dafd48bc1c7cc12bdb01206f9fea7dd6feb70
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-45-generic
Operating System: Ubuntu 16.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 1.952 GiB
Name: ip-10-0-27-240
ID: 6LVV:2FUY:E2TD:DKIR:LFAX:3HF6:6ZZI:65CM:F7YY:CQ2X:IVJO:F23E
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Labels:
 type=web
 environment=production
Experimental: true
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):

  • On AWS inside VPC. Following ports are open:
    • TCP: 2377, 4789, 7946
    • UDP: 7946, 4789
    • ESP: All
    • Some addition TCP and UDP ports are also open, for other host services
  • We’re using an overlay network with the encrypted option on

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 13
  • Comments: 33 (6 by maintainers)

Most upvoted comments

Any one looking into this? We’ve migrated to swarm fully, and are now slowly rolling back piece by piece. We can’t run our platform like this, since as a result of the above symptoms, we’re serving back 500s every so often.

Spoke too soon, we still have the problem, even with normal networking. It happens seemingly at random, but I’m sure that can’t be. Any suggestions on which logs I should get, and from which types of nodes?

Similar problem here, occurs under heavy IO load, both with integrated and external storage.

2017-07-12T12:48:49.941714411Z memberlist: Failed fallback ping: read tcp: i/o timeout
2017-07-12T12:48:49.941893525Z memberlist: Suspect xxx-79599d82f216 has failed, no acks received
2017-07-12T12:48:52.471544290Z Node join event for srv067171.nue2.bigpoint.net-79599d82f216/10.72.67.171
2017-07-12T12:48:52.847381111Z memberlist: Was able to connect to xxx but other probes failed, network may be miscon

This occurred on a 2 node setup with lots of cpu (80 cores / 256 GB ram / 10 Gbit network interconnect) while only disk io was at 100%. I haven’t checked if consul waits for disk IO and what the timeout is, would be nice to be able to increase the timeout to avoid this.

We are now starting the containers manually on each node, still using swarm networking / macvlan, to avoid having the containers killed on timeout+rejoin events.

I had similar problems as stated above (with Ubuntu 16.04.2 LTS 4.4.0-78-generic kernel) (but I never got kernel panic and the VPS is not at DigitalOcean)

Installing linux-generic-hwe-16.04 ( apt-get install linux-generic-hwe-16.04 ) installed 4.8 kernel ( 4.8.0-52-generic ) and the problem seems to be gone since.