moby: Was our cluster broken?

One of our swarm clusters apparently failed recently. I’ve pasted a summary of the logs below. Bottom line : services did not move between hosts, nodes could not go into drain mode, and a simple docker service logs [SERVICE_NAME] would fail.

Was the cluster broken? Docker itself reported all nodes as reachable (and one leader). We ended up deleting the cluster and recreating it. Is this the only way to go? Could we have done anything else?

Oct 20 13:04:47 cphprodswarm1node-01 dockerd[30944]: 2017/10/20 13:04:47 grpc: Server.processUnaryRPC failed to write status stream error: code = 4 desc = "context deadline exceeded"
Oct 20 13:05:01 cphprodswarm1node-01 dockerd[30944]: time="2017-10-20T13:04:59.080100300+02:00" level=info msg="6c1948bceedf62d5 [term: 138] received a MsgHeartbeat message with higher term from 1165dcf1b976c8d3 [term: 140]" module=raft n$
Oct 20 13:05:05 cphprodswarm1node-01 dockerd[30944]: time="2017-10-20T13:05:05.013355200+02:00" level=info msg="6c1948bceedf62d5 became follower at term 140" module=raft node.id=dl0qsmsnuiwwrk5vwcde9i5jk
Oct 20 13:05:09 cphprodswarm1node-01 dockerd[30944]: time="2017-10-20T13:05:08.157192700+02:00" level=info msg="raft.node: 6c1948bceedf62d5 changed leader from 6c1948bceedf62d5 to 1165dcf1b976c8d3 at term 140" module=raft node.id=dl0qsmsn$
Oct 20 13:05:13 cphprodswarm1node-01 dockerd[30944]: time="2017-10-20T13:05:12.955582900+02:00" level=warning msg="memberlist: failed to receive: read tcp 192.168.81.104:7946->192.168.81.106:52904: i/o timeout from=192.168.81.106:52904"
Oct 20 13:05:14 cphprodswarm1node-01 dockerd[30944]: time="2017-10-20T13:05:14.602878600+02:00" level=warning msg="memberlist: Push/Pull with cphprodswarm1node-03-dbadc84f9f54 failed: write tcp 192.168.81.104:55672->192.168.81.106:7946: i$
Oct 20 13:05:15 cphprodswarm1node-01 dockerd[30944]: time="2017-10-20T13:05:15.130800500+02:00" level=warning msg="memberlist: Failed to send ack: write tcp 192.168.81.104:7946->192.168.81.107:42268: i/o timeout from=192.168.81.107:42268"
Oct 20 13:05:21 cphprodswarm1node-01 dockerd[30944]: time="2017-10-20T13:05:21.005008900+02:00" level=warning msg="memberlist: Failed fallback ping: write tcp 192.168.81.104:50940->192.168.81.107:7946: i/o timeout"
Oct 20 13:05:21 cphprodswarm1node-01 dockerd[30944]: time="2017-10-20T13:05:21.005064500+02:00" level=info msg="memberlist: Suspect cphprodswarm1node-04-de254ed95721 has failed, no acks received"
Oct 20 13:05:22 cphprodswarm1node-01 dockerd[30944]: time="2017-10-20T13:05:22.185895800+02:00" level=error msg="failed to remove node" error="rpc error: code = 10 desc = dispatcher is stopped" method="(*Dispatcher).Session" node.id=dl0qs$
Oct 20 13:05:22 cphprodswarm1node-01 dockerd[30944]: time="2017-10-20T13:05:22.337908300+02:00" level=warning msg="memberlist: Refuting a suspect message (from: cphprodswarm1node-04-de254ed95721)"
Oct 20 13:05:22 cphprodswarm1node-01 dockerd[30944]: time="2017-10-20T13:05:21.986473200+02:00" level=error msg="dispatcher batch failed" error="raft: failed to process the request: node lost leader status" method="(*Dispatcher).processUp$

Quickly followed by a lot of these (for a couple of minutes)

Oct 20 13:06:41 cphprodswarm1node-01 dockerd[30944]: time="2017-10-20T13:06:41.126648300+02:00" level=warning msg="Peer operation failed:could not delete fdb entry into the sandbox: could not find the neighbor entry to delete op:&{2 k5byw$
Oct 20 13:06:41 cphprodswarm1node-01 dockerd[30944]: time="2017-10-20T13:06:41.155117700+02:00" level=warning msg="rmServiceBinding handleEpTableEvent utilities_logstash c665a2c2a25c2997224329a9bec7eee0aee765dd0e746693bdbe9fccce26de39 abo$

Then, a lot of these (for 12 hours or so)

Oct 20 13:07:09 cphprodswarm1node-01 kernel: [3033708.255168] aufs au_opts_verify:1597:dockerd[3829]: dirperm1 breaks the protection by the permission bits on the lower branch
Oct 20 13:07:09 cphprodswarm1node-01 dockerd[30944]: time="2017-10-20T13:07:09.753781700+02:00" level=error msg="fatal task error" error="starting container failed: Address already in use" module="node/agent/taskmanager" node.id=dl0qsmsnu$

Suddenly followed by

Oct 21 01:23:53 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:23:53.594562900+02:00" level=error msg="fatal task error" error="task: non-zero exit (1)" module="node/agent/taskmanager" node.id=dl0qsmsnuiwwrk5vwcde9i5jk service.i$
Oct 21 01:23:55 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:23:55.886511600+02:00" level=info msg="6c1948bceedf62d5 [term: 148] received a MsgVote message with higher term from 239615ad86d93d37 [term: 149]" module=raft node.i$
Oct 21 01:23:55 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:23:55.886954800+02:00" level=info msg="6c1948bceedf62d5 became follower at term 149" module=raft node.id=dl0qsmsnuiwwrk5vwcde9i5jk
Oct 21 01:23:55 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:23:55.887176300+02:00" level=info msg="6c1948bceedf62d5 [logterm: 148, index: 225269, vote: 0] cast MsgVote for 239615ad86d93d37 [logterm: 148, index: 225269] at ter$
Oct 21 01:23:55 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:23:55.887386900+02:00" level=info msg="raft.node: 6c1948bceedf62d5 lost leader 67326aa72ce66aa0 at term 149" module=raft node.id=dl0qsmsnuiwwrk5vwcde9i5jk
Oct 21 01:23:56 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:23:56.041822800+02:00" level=info msg="raft.node: 6c1948bceedf62d5 elected leader 239615ad86d93d37 at term 149" module=raft node.id=dl0qsmsnuiwwrk5vwcde9i5jk
Oct 21 01:23:56 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:23:56.880694500+02:00" level=error msg="agent: session failed" error="rpc error: code = 1 desc = context canceled" module="node/agent" node.id=dl0qsmsnuiwwrk5vwcde9i$

20 minutes later, the log gets flooded with

Oct 21 01:39:28 cphprodswarm1node-01 dockerd[63999]: time=“2017-10-21T01:39:28.101515500+02:00” level=warning msg=“Deleting bridge mac mac 02:42:0b:01:08:0c failed, no such file or directory” Oct 21 01:39:28 cphprodswarm1node-01 dockerd[63999]: time=“2017-10-21T01:39:28.214081400+02:00” level=warning msg=“Deleting bridge mac mac 02:42:0b:01:08:05 failed, no such file or directory”

And the last meaningful entries in the log

Oct 21 01:47:42 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:42.293496700+02:00" level=info msg="6c1948bceedf62d5 is starting a new election at term 175" module=raft node.id=dl0qsmsnuiwwrk5vwcde9i5jk
Oct 21 01:47:42 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:42.293549300+02:00" level=info msg="6c1948bceedf62d5 became candidate at term 176" module=raft node.id=dl0qsmsnuiwwrk5vwcde9i5jk
Oct 21 01:47:42 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:42.293564500+02:00" level=info msg="6c1948bceedf62d5 received MsgVoteResp from 6c1948bceedf62d5 at term 176" module=raft node.id=dl0qsmsnuiwwrk5vwcde9i5jk
Oct 21 01:47:42 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:42.293578800+02:00" level=info msg="6c1948bceedf62d5 [logterm: 170, index: 229279] sent MsgVote request to 1165dcf1b976c8d3 at term 176" module=raft node.id=dl0qs$
Oct 21 01:47:42 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:42.293592100+02:00" level=info msg="6c1948bceedf62d5 [logterm: 170, index: 229279] sent MsgVote request to 239615ad86d93d37 at term 176" module=raft node.id=dl0qs$
Oct 21 01:47:42 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:42.293603100+02:00" level=info msg="6c1948bceedf62d5 [logterm: 170, index: 229279] sent MsgVote request to 25b366c96514af76 at term 176" module=raft node.id=dl0qs$
Oct 21 01:47:42 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:42.293614100+02:00" level=info msg="6c1948bceedf62d5 [logterm: 170, index: 229279] sent MsgVote request to 67326aa72ce66aa0 at term 176" module=raft node.id=dl0qs$
Oct 21 01:47:42 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:42.299672700+02:00" level=info msg="6c1948bceedf62d5 received MsgVoteResp rejection from 25b366c96514af76 at term 176" module=raft node.id=dl0qsmsnuiwwrk5vwcde9i5$
Oct 21 01:47:42 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:42.299713300+02:00" level=info msg="6c1948bceedf62d5 [quorum:3] has received 1 MsgVoteResp votes and 1 vote rejections" module=raft node.id=dl0qsmsnuiwwrk5vwcde9i$
Oct 21 01:47:42 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:42.299854800+02:00" level=info msg="6c1948bceedf62d5 received MsgVoteResp from 239615ad86d93d37 at term 176" module=raft node.id=dl0qsmsnuiwwrk5vwcde9i5jk
Oct 21 01:47:42 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:42.299868800+02:00" level=info msg="6c1948bceedf62d5 [quorum:3] has received 2 MsgVoteResp votes and 1 vote rejections" module=raft node.id=dl0qsmsnuiwwrk5vwcde9i$
Oct 21 01:47:42 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:42.300171300+02:00" level=info msg="6c1948bceedf62d5 received MsgVoteResp rejection from 1165dcf1b976c8d3 at term 176" module=raft node.id=dl0qsmsnuiwwrk5vwcde9i5$
Oct 21 01:47:42 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:42.300193600+02:00" level=info msg="6c1948bceedf62d5 [quorum:3] has received 2 MsgVoteResp votes and 2 vote rejections" module=raft node.id=dl0qsmsnuiwwrk5vwcde9i$
Oct 21 01:47:42 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:42.423381700+02:00" level=warning msg="Deleting bridge mac mac 02:42:0b:01:08:05 failed, no such file or directory"
Oct 21 01:47:42 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:42.912212800+02:00" level=error msg="agent: session failed" error="rpc error: code = 2 desc = The swarm does not have a leader. It's possible that too few manager$
Oct 21 01:47:45 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:45.257173700+02:00" level=info msg="6c1948bceedf62d5 [term: 176] received a MsgVote message with higher term from 1165dcf1b976c8d3 [term: 177]" module=raft node.i$
Oct 21 01:47:45 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:45.257237500+02:00" level=info msg="6c1948bceedf62d5 became follower at term 177" module=raft node.id=dl0qsmsnuiwwrk5vwcde9i5jk
Oct 21 01:47:45 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:45.257252300+02:00" level=info msg="6c1948bceedf62d5 [logterm: 170, index: 229279, vote: 0] cast MsgVote for 1165dcf1b976c8d3 [logterm: 170, index: 229280] at ter$
Oct 21 01:47:45 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:45.261751300+02:00" level=info msg="raft.node: 6c1948bceedf62d5 elected leader 1165dcf1b976c8d3 at term 177" module=raft node.id=dl0qsmsnuiwwrk5vwcde9i5jk
Oct 21 01:47:48 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:48.389482700+02:00" level=error msg="agent: session failed" error="rpc error: code = 10 desc = dispatcher is stopped" module="node/agent" node.id=dl0qsmsnuiwwrk5v$
Oct 21 01:47:49 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:49.124746000+02:00" level=info msg="Node join event for cphprodswarm1node-05-35df577dce77/192.168.81.108"

And finally, the log got flooded with this final one

Oct 21 05:16:52 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T05:16:52.115078700+02:00" level=error msg="agent: session failed" error="rpc error: code = 10 desc = dispatcher is stopped" module="node/agent" node.id=dl0qsmsnuiwwrk5v$
Oct 21 05:16:56 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T05:16:56.615764700+02:00" level=error msg="agent: session failed" error="rpc error: code = 10 desc = dispatcher is stopped" module="node/agent" node.id=dl0qsmsnuiwwrk5v$
Oct 21 05:17:00 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T05:17:00.207821400+02:00" level=error msg="agent: session failed" error="rpc error: code = 10 desc = dispatcher is stopped" module="node/agent" node.id=dl0qsmsnuiwwrk5v$
Oct 21 05:17:07 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T05:17:07.469317000+02:00" level=error msg="agent: session failed" error="rpc error: code = 10 desc = dispatcher is stopped" module="node/agent" node.id=dl0qsmsnuiwwrk5v$
Oct 21 05:17:08 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T05:17:08.729252300+02:00" level=error msg="agent: session failed" error="rpc error: code = 10 desc = dispatcher is stopped" module="node/agent" node.id=dl0qsmsnuiwwrk5v$
Oct 21 05:17:10 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T05:17:10.593714600+02:00" level=error msg="agent: session failed" error="rpc error: code = 10 desc = dispatcher is stopped" module="node/agent" node.id=dl0qsmsnuiwwrk5v$
Oct 21 05:17:13 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T05:17:13.588298800+02:00" level=error msg="agent: session failed" error="rpc error: code = 10 desc = dispatcher is stopped" module="node/agent" node.id=dl0qsmsnuiwwrk5v$
Oct 21 05:17:15 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T05:17:15.293011500+02:00" level=error msg="agent: session failed" error="rpc error: code = 10 desc = dispatcher is stopped" module="node/agent" node.id=dl0qsmsnuiwwrk5v$

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 1
  • Comments: 18 (3 by maintainers)

Most upvoted comments

We just had a complete cluster failure with similar symptoms to this and have found this open ticket. We had recently updated traefik and switched from host to routing mesh mode for the ingress posts only a few hours before, which may be completely coincidental.

Aug  3 00:00:06 zen dockerd[2089228]: time="2020-08-03T00:00:06.959175017Z" level=error msg="agent: session failed" backoff=8s error="rpc error: code = Aborted desc = dispatcher is stopped" module=node/agent node.id=klzq8ldc5e5ij8bt7kcogwigt
Aug  3 00:00:06 zen dockerd[2089228]: time="2020-08-03T00:00:06.959242503Z" level=info msg="manager selected by agent for new session: { }" module=node/agent node.id=klzq8ldc5e5ij8bt7kcogwigt
Aug  3 00:00:06 zen dockerd[2089228]: time="2020-08-03T00:00:06.959277548Z" level=info msg="waiting 6.519164787s before registering session" module=node/agent node.id=klzq8ldc5e5ij8bt7kcogwigt
Aug  3 00:00:07 zen dockerd[2089228]: time="2020-08-03T00:00:07.317005975Z" level=warning msg="Health check for container 150636d1fde82922f889e3ba7cd4c238e12784b29a86e93f9133bce46f693106 error: No such exec instance: 60ff42c6d4010b0cefb2a0e092450f7676e5846efb865c745f64b24b23091ec7"
Aug  3 00:00:07 zen dockerd[2089228]: time="2020-08-03T00:00:07.317160617Z" level=error msg="Error replicating health state for container 150636d1fde82922f889e3ba7cd4c238e12784b29a86e93f9133bce46f693106: open /var/lib/docker/containers/150636d1fde82922f889e3ba7cd4c238e12784b29a86e93f9133bce46f693106/.tmp-config.v2.json486493204: no such file or directory"
Aug  3 00:00:13 zen dockerd[2089228]: time="2020-08-03T00:00:13.478951521Z" level=error msg="agent: session failed" backoff=8s error="rpc error: code = Aborted desc = dispatcher is stopped" module=node/agent node.id=klzq8ldc5e5ij8bt7kcogwigt
Linux zen 5.4.0-1021-aws #21-Ubuntu SMP Fri Jul 24 09:42:29 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Client: Docker Engine - Community
 Version:           19.03.11
 API version:       1.40
 Go version:        go1.13.10
 Git commit:        42e35e61f3
 Built:             Mon Jun  1 09:12:22 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.11
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.10
  Git commit:       42e35e61f3
  Built:            Mon Jun  1 09:10:54 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.13
  GitCommit:        7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

Client:
 Debug Mode: false

Server:
 Containers: 8
  Running: 4
  Paused: 0
  Stopped: 4
 Images: 11
 Server Version: 19.03.11
 Storage Driver: aufs
  Root Dir: /var/lib/docker/aufs
  Backing Filesystem: extfs
  Dirs: 122
  Dirperm1 Supported: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: active
  NodeID: klzq8ldc5e5ij8bt7kcogwigt
  Is Manager: true
  ClusterID: mv0nphvxo39miltqdlrnhi5qt
  Managers: 3
  Nodes: 3
  Default Address Pool: 10.0.0.0/8  
  SubnetSize: 24
  Data Path Port: 4789
  Orchestration:
   Task History Retention Limit: 5
  Raft:
   Snapshot Interval: 10000
   Number of Old Snapshots to Retain: 0
   Heartbeat Tick: 1
   Election Tick: 10
  Dispatcher:
   Heartbeat Period: 5 seconds
  CA Configuration:
   Expiry Duration: 3 months
   Force Rotate: 0
  Autolock Managers: false
  Root Rotation In Progress: false
  Node Address: 172.31.198.182
  Manager Addresses:
   172.31.198.182:2377
   172.31.211.123:2377
   172.31.237.0:2377
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.4.0-1021-aws
 Operating System: Ubuntu 20.04.1 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 953.4MiB
 Name: zen
 ID: VEHH:SXNS:RBTW:XTCS:ECAR:J6BZ:C53Y:WW7Z:I5GC:MK73:VCGW:JB5I
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
  provider=amazonec2
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false