moby: Was our cluster broken?
One of our swarm clusters apparently failed recently. I’ve pasted a summary of the logs below. Bottom line : services did not move between hosts, nodes could not go into drain mode, and a simple docker service logs [SERVICE_NAME]
would fail.
Was the cluster broken? Docker itself reported all nodes as reachable (and one leader). We ended up deleting the cluster and recreating it. Is this the only way to go? Could we have done anything else?
Oct 20 13:04:47 cphprodswarm1node-01 dockerd[30944]: 2017/10/20 13:04:47 grpc: Server.processUnaryRPC failed to write status stream error: code = 4 desc = "context deadline exceeded"
Oct 20 13:05:01 cphprodswarm1node-01 dockerd[30944]: time="2017-10-20T13:04:59.080100300+02:00" level=info msg="6c1948bceedf62d5 [term: 138] received a MsgHeartbeat message with higher term from 1165dcf1b976c8d3 [term: 140]" module=raft n$
Oct 20 13:05:05 cphprodswarm1node-01 dockerd[30944]: time="2017-10-20T13:05:05.013355200+02:00" level=info msg="6c1948bceedf62d5 became follower at term 140" module=raft node.id=dl0qsmsnuiwwrk5vwcde9i5jk
Oct 20 13:05:09 cphprodswarm1node-01 dockerd[30944]: time="2017-10-20T13:05:08.157192700+02:00" level=info msg="raft.node: 6c1948bceedf62d5 changed leader from 6c1948bceedf62d5 to 1165dcf1b976c8d3 at term 140" module=raft node.id=dl0qsmsn$
Oct 20 13:05:13 cphprodswarm1node-01 dockerd[30944]: time="2017-10-20T13:05:12.955582900+02:00" level=warning msg="memberlist: failed to receive: read tcp 192.168.81.104:7946->192.168.81.106:52904: i/o timeout from=192.168.81.106:52904"
Oct 20 13:05:14 cphprodswarm1node-01 dockerd[30944]: time="2017-10-20T13:05:14.602878600+02:00" level=warning msg="memberlist: Push/Pull with cphprodswarm1node-03-dbadc84f9f54 failed: write tcp 192.168.81.104:55672->192.168.81.106:7946: i$
Oct 20 13:05:15 cphprodswarm1node-01 dockerd[30944]: time="2017-10-20T13:05:15.130800500+02:00" level=warning msg="memberlist: Failed to send ack: write tcp 192.168.81.104:7946->192.168.81.107:42268: i/o timeout from=192.168.81.107:42268"
Oct 20 13:05:21 cphprodswarm1node-01 dockerd[30944]: time="2017-10-20T13:05:21.005008900+02:00" level=warning msg="memberlist: Failed fallback ping: write tcp 192.168.81.104:50940->192.168.81.107:7946: i/o timeout"
Oct 20 13:05:21 cphprodswarm1node-01 dockerd[30944]: time="2017-10-20T13:05:21.005064500+02:00" level=info msg="memberlist: Suspect cphprodswarm1node-04-de254ed95721 has failed, no acks received"
Oct 20 13:05:22 cphprodswarm1node-01 dockerd[30944]: time="2017-10-20T13:05:22.185895800+02:00" level=error msg="failed to remove node" error="rpc error: code = 10 desc = dispatcher is stopped" method="(*Dispatcher).Session" node.id=dl0qs$
Oct 20 13:05:22 cphprodswarm1node-01 dockerd[30944]: time="2017-10-20T13:05:22.337908300+02:00" level=warning msg="memberlist: Refuting a suspect message (from: cphprodswarm1node-04-de254ed95721)"
Oct 20 13:05:22 cphprodswarm1node-01 dockerd[30944]: time="2017-10-20T13:05:21.986473200+02:00" level=error msg="dispatcher batch failed" error="raft: failed to process the request: node lost leader status" method="(*Dispatcher).processUp$
Quickly followed by a lot of these (for a couple of minutes)
Oct 20 13:06:41 cphprodswarm1node-01 dockerd[30944]: time="2017-10-20T13:06:41.126648300+02:00" level=warning msg="Peer operation failed:could not delete fdb entry into the sandbox: could not find the neighbor entry to delete op:&{2 k5byw$
Oct 20 13:06:41 cphprodswarm1node-01 dockerd[30944]: time="2017-10-20T13:06:41.155117700+02:00" level=warning msg="rmServiceBinding handleEpTableEvent utilities_logstash c665a2c2a25c2997224329a9bec7eee0aee765dd0e746693bdbe9fccce26de39 abo$
Then, a lot of these (for 12 hours or so)
Oct 20 13:07:09 cphprodswarm1node-01 kernel: [3033708.255168] aufs au_opts_verify:1597:dockerd[3829]: dirperm1 breaks the protection by the permission bits on the lower branch
Oct 20 13:07:09 cphprodswarm1node-01 dockerd[30944]: time="2017-10-20T13:07:09.753781700+02:00" level=error msg="fatal task error" error="starting container failed: Address already in use" module="node/agent/taskmanager" node.id=dl0qsmsnu$
Suddenly followed by
Oct 21 01:23:53 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:23:53.594562900+02:00" level=error msg="fatal task error" error="task: non-zero exit (1)" module="node/agent/taskmanager" node.id=dl0qsmsnuiwwrk5vwcde9i5jk service.i$
Oct 21 01:23:55 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:23:55.886511600+02:00" level=info msg="6c1948bceedf62d5 [term: 148] received a MsgVote message with higher term from 239615ad86d93d37 [term: 149]" module=raft node.i$
Oct 21 01:23:55 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:23:55.886954800+02:00" level=info msg="6c1948bceedf62d5 became follower at term 149" module=raft node.id=dl0qsmsnuiwwrk5vwcde9i5jk
Oct 21 01:23:55 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:23:55.887176300+02:00" level=info msg="6c1948bceedf62d5 [logterm: 148, index: 225269, vote: 0] cast MsgVote for 239615ad86d93d37 [logterm: 148, index: 225269] at ter$
Oct 21 01:23:55 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:23:55.887386900+02:00" level=info msg="raft.node: 6c1948bceedf62d5 lost leader 67326aa72ce66aa0 at term 149" module=raft node.id=dl0qsmsnuiwwrk5vwcde9i5jk
Oct 21 01:23:56 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:23:56.041822800+02:00" level=info msg="raft.node: 6c1948bceedf62d5 elected leader 239615ad86d93d37 at term 149" module=raft node.id=dl0qsmsnuiwwrk5vwcde9i5jk
Oct 21 01:23:56 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:23:56.880694500+02:00" level=error msg="agent: session failed" error="rpc error: code = 1 desc = context canceled" module="node/agent" node.id=dl0qsmsnuiwwrk5vwcde9i$
20 minutes later, the log gets flooded with
Oct 21 01:39:28 cphprodswarm1node-01 dockerd[63999]: time=“2017-10-21T01:39:28.101515500+02:00” level=warning msg=“Deleting bridge mac mac 02:42:0b:01:08:0c failed, no such file or directory” Oct 21 01:39:28 cphprodswarm1node-01 dockerd[63999]: time=“2017-10-21T01:39:28.214081400+02:00” level=warning msg=“Deleting bridge mac mac 02:42:0b:01:08:05 failed, no such file or directory”
And the last meaningful entries in the log
Oct 21 01:47:42 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:42.293496700+02:00" level=info msg="6c1948bceedf62d5 is starting a new election at term 175" module=raft node.id=dl0qsmsnuiwwrk5vwcde9i5jk
Oct 21 01:47:42 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:42.293549300+02:00" level=info msg="6c1948bceedf62d5 became candidate at term 176" module=raft node.id=dl0qsmsnuiwwrk5vwcde9i5jk
Oct 21 01:47:42 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:42.293564500+02:00" level=info msg="6c1948bceedf62d5 received MsgVoteResp from 6c1948bceedf62d5 at term 176" module=raft node.id=dl0qsmsnuiwwrk5vwcde9i5jk
Oct 21 01:47:42 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:42.293578800+02:00" level=info msg="6c1948bceedf62d5 [logterm: 170, index: 229279] sent MsgVote request to 1165dcf1b976c8d3 at term 176" module=raft node.id=dl0qs$
Oct 21 01:47:42 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:42.293592100+02:00" level=info msg="6c1948bceedf62d5 [logterm: 170, index: 229279] sent MsgVote request to 239615ad86d93d37 at term 176" module=raft node.id=dl0qs$
Oct 21 01:47:42 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:42.293603100+02:00" level=info msg="6c1948bceedf62d5 [logterm: 170, index: 229279] sent MsgVote request to 25b366c96514af76 at term 176" module=raft node.id=dl0qs$
Oct 21 01:47:42 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:42.293614100+02:00" level=info msg="6c1948bceedf62d5 [logterm: 170, index: 229279] sent MsgVote request to 67326aa72ce66aa0 at term 176" module=raft node.id=dl0qs$
Oct 21 01:47:42 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:42.299672700+02:00" level=info msg="6c1948bceedf62d5 received MsgVoteResp rejection from 25b366c96514af76 at term 176" module=raft node.id=dl0qsmsnuiwwrk5vwcde9i5$
Oct 21 01:47:42 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:42.299713300+02:00" level=info msg="6c1948bceedf62d5 [quorum:3] has received 1 MsgVoteResp votes and 1 vote rejections" module=raft node.id=dl0qsmsnuiwwrk5vwcde9i$
Oct 21 01:47:42 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:42.299854800+02:00" level=info msg="6c1948bceedf62d5 received MsgVoteResp from 239615ad86d93d37 at term 176" module=raft node.id=dl0qsmsnuiwwrk5vwcde9i5jk
Oct 21 01:47:42 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:42.299868800+02:00" level=info msg="6c1948bceedf62d5 [quorum:3] has received 2 MsgVoteResp votes and 1 vote rejections" module=raft node.id=dl0qsmsnuiwwrk5vwcde9i$
Oct 21 01:47:42 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:42.300171300+02:00" level=info msg="6c1948bceedf62d5 received MsgVoteResp rejection from 1165dcf1b976c8d3 at term 176" module=raft node.id=dl0qsmsnuiwwrk5vwcde9i5$
Oct 21 01:47:42 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:42.300193600+02:00" level=info msg="6c1948bceedf62d5 [quorum:3] has received 2 MsgVoteResp votes and 2 vote rejections" module=raft node.id=dl0qsmsnuiwwrk5vwcde9i$
Oct 21 01:47:42 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:42.423381700+02:00" level=warning msg="Deleting bridge mac mac 02:42:0b:01:08:05 failed, no such file or directory"
Oct 21 01:47:42 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:42.912212800+02:00" level=error msg="agent: session failed" error="rpc error: code = 2 desc = The swarm does not have a leader. It's possible that too few manager$
Oct 21 01:47:45 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:45.257173700+02:00" level=info msg="6c1948bceedf62d5 [term: 176] received a MsgVote message with higher term from 1165dcf1b976c8d3 [term: 177]" module=raft node.i$
Oct 21 01:47:45 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:45.257237500+02:00" level=info msg="6c1948bceedf62d5 became follower at term 177" module=raft node.id=dl0qsmsnuiwwrk5vwcde9i5jk
Oct 21 01:47:45 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:45.257252300+02:00" level=info msg="6c1948bceedf62d5 [logterm: 170, index: 229279, vote: 0] cast MsgVote for 1165dcf1b976c8d3 [logterm: 170, index: 229280] at ter$
Oct 21 01:47:45 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:45.261751300+02:00" level=info msg="raft.node: 6c1948bceedf62d5 elected leader 1165dcf1b976c8d3 at term 177" module=raft node.id=dl0qsmsnuiwwrk5vwcde9i5jk
Oct 21 01:47:48 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:48.389482700+02:00" level=error msg="agent: session failed" error="rpc error: code = 10 desc = dispatcher is stopped" module="node/agent" node.id=dl0qsmsnuiwwrk5v$
Oct 21 01:47:49 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T01:47:49.124746000+02:00" level=info msg="Node join event for cphprodswarm1node-05-35df577dce77/192.168.81.108"
And finally, the log got flooded with this final one
Oct 21 05:16:52 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T05:16:52.115078700+02:00" level=error msg="agent: session failed" error="rpc error: code = 10 desc = dispatcher is stopped" module="node/agent" node.id=dl0qsmsnuiwwrk5v$
Oct 21 05:16:56 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T05:16:56.615764700+02:00" level=error msg="agent: session failed" error="rpc error: code = 10 desc = dispatcher is stopped" module="node/agent" node.id=dl0qsmsnuiwwrk5v$
Oct 21 05:17:00 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T05:17:00.207821400+02:00" level=error msg="agent: session failed" error="rpc error: code = 10 desc = dispatcher is stopped" module="node/agent" node.id=dl0qsmsnuiwwrk5v$
Oct 21 05:17:07 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T05:17:07.469317000+02:00" level=error msg="agent: session failed" error="rpc error: code = 10 desc = dispatcher is stopped" module="node/agent" node.id=dl0qsmsnuiwwrk5v$
Oct 21 05:17:08 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T05:17:08.729252300+02:00" level=error msg="agent: session failed" error="rpc error: code = 10 desc = dispatcher is stopped" module="node/agent" node.id=dl0qsmsnuiwwrk5v$
Oct 21 05:17:10 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T05:17:10.593714600+02:00" level=error msg="agent: session failed" error="rpc error: code = 10 desc = dispatcher is stopped" module="node/agent" node.id=dl0qsmsnuiwwrk5v$
Oct 21 05:17:13 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T05:17:13.588298800+02:00" level=error msg="agent: session failed" error="rpc error: code = 10 desc = dispatcher is stopped" module="node/agent" node.id=dl0qsmsnuiwwrk5v$
Oct 21 05:17:15 cphprodswarm1node-01 dockerd[63999]: time="2017-10-21T05:17:15.293011500+02:00" level=error msg="agent: session failed" error="rpc error: code = 10 desc = dispatcher is stopped" module="node/agent" node.id=dl0qsmsnuiwwrk5v$
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 1
- Comments: 18 (3 by maintainers)
We just had a complete cluster failure with similar symptoms to this and have found this open ticket. We had recently updated traefik and switched from host to routing mesh mode for the ingress posts only a few hours before, which may be completely coincidental.