moby: Containers on overlay network cannot reach other containers

Description

Overlay network randomly has specific machines unreachable. Containers on these nodes are unable to reach containers on other nodes, and vice-versa. This happens much more frequently on larger clusters, but we’ve seen it on smaller ones as well.

We are using consul for our kv store.

Steps to reproduce the issue:

  1. Create a large overlay network, spanning many nodes. I can trigger this ~80% of the time with 25 nodes.
  2. Launch containers on multiple nodes, which need to communicate. In our case, mongodb, kafka, elasticsearch, and others.
  3. Watch them fail.

Describe the results you received: Containers on some nodes are unable to connect to containers on other nodes. They can connect to containers on the same node.

[exabeam@ip-10-10-2-39 ~]$ sudo docker exec -it mongodb-configsvr-host1 bash
(docker) [root@mongodb-configsvr-host1 /]# ping mongodb-configsvr-host22
PING mongodb-configsvr-host22 (10.50.48.23) 56(84) bytes of data.
From mongodb-configsvr-host1 (10.50.48.32) icmp_seq=1 Destination Host Unreachable
From mongodb-configsvr-host1 (10.50.48.32) icmp_seq=2 Destination Host Unreachable
From mongodb-configsvr-host1 (10.50.48.32) icmp_seq=3 Destination Host Unreachable
From mongodb-configsvr-host1 (10.50.48.32) icmp_seq=4 Destination Host Unreachable
^C
--- mongodb-configsvr-host22 ping statistics ---
5 packets transmitted, 0 received, +4 errors, 100% packet loss, time 4006ms

Journal logs:

-- Logs begin at Fri 2017-12-15 00:32:45 UTC, end at Fri 2017-12-15 01:00:37 UTC. --
Dec 15 00:40:12 ip-10-10-2-39 systemd[1]: Starting Docker Application Container Engine...
-- Subject: Unit docker.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit docker.service has begun starting up.
Dec 15 00:40:12 ip-10-10-2-39 dockerd[15797]: time="2017-12-15T00:40:12.125750601Z" level=info msg="libcontainerd: new containerd process, pid: 15804"
Dec 15 00:40:13 ip-10-10-2-39 dockerd[15797]: time="2017-12-15T00:40:13.168203097Z" level=warning msg="failed to rename /var/lib/docker/tmp for background deletion: rename /var/lib/docker/tmp /var/lib/docker/tmp-old: no such file or directory. Deleting synchronously"
Dec 15 00:40:13 ip-10-10-2-39 dockerd[15797]: time="2017-12-15T00:40:13.356143183Z" level=warning msg="devmapper: Usage of loopback devices is strongly discouraged for production use. Please use `--storage-opt dm.thinpooldev` or use `man dockerd` to refer to dm.thinpooldev section."
Dec 15 00:40:13 ip-10-10-2-39 dockerd[15797]: time="2017-12-15T00:40:13.415855355Z" level=info msg="devmapper: Creating filesystem xfs on device docker-202:1-352328870-base"
Dec 15 00:40:13 ip-10-10-2-39 dockerd[15797]: time="2017-12-15T00:40:13.551678692Z" level=info msg="devmapper: Successfully created filesystem xfs on device docker-202:1-352328870-base"
Dec 15 00:40:13 ip-10-10-2-39 dockerd[15797]: time="2017-12-15T00:40:13.674373157Z" level=info msg="Graph migration to content-addressability took 0.00 seconds"
Dec 15 00:40:13 ip-10-10-2-39 dockerd[15797]: time="2017-12-15T00:40:13.674413916Z" level=info msg="Initializing discovery with TLS"
Dec 15 00:40:13 ip-10-10-2-39 dockerd[15797]: time="2017-12-15T00:40:13.722352220Z" level=info msg="Loading containers: start."
Dec 15 00:40:13 ip-10-10-2-39 dockerd[15797]: time="2017-12-15T00:40:13.722438367Z" level=info msg="Option Initializing KV with TLS"
Dec 15 00:40:14 ip-10-10-2-39 dockerd[15797]: time="2017-12-15T00:40:14.992434302Z" level=info msg="2017/12/15 00:40:14 [INFO] serf: EventMemberJoin: ip-10-10-2-39 10.10.2.39
Dec 15 00:40:14 ip-10-10-2-39 dockerd[15797]: "
Dec 15 00:40:14 ip-10-10-2-39 dockerd[15797]: time="2017-12-15T00:40:14.999344260Z" level=error msg="joining serf neighbor 10.10.2.39 failed: Failed to join the cluster at neigh IP 10.10.2.127: 1 error(s) occurred:
Dec 15 00:40:14 ip-10-10-2-39 dockerd[15797]: * Failed to join 10.10.2.127: dial tcp 10.10.2.127:7946: getsockopt: connection refused"
Dec 15 00:40:15 ip-10-10-2-39 dockerd[15797]: time="2017-12-15T00:40:15.000971589Z" level=info msg="2017/12/15 00:40:15 [INFO] serf: EventMemberJoin: ip-10-10-2-77 10.10.2.77
Dec 15 00:40:15 ip-10-10-2-39 dockerd[15797]: "
Dec 15 00:40:15 ip-10-10-2-39 dockerd[15797]: time="2017-12-15T00:40:15.001047174Z" level=info msg="2017/12/15 00:40:15 [INFO] serf: EventMemberJoin: ip-10-10-2-122 10.10.2.122
Dec 15 00:40:15 ip-10-10-2-39 dockerd[15797]: "
Dec 15 00:40:15 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:40:15.351785574Z" level=info msg="Loading containers: done."
Dec 15 00:40:15 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:40:15.351834341Z" level=info msg="2017/12/15 00:40:15 [INFO] serf: EventMemberJoin: ip-10-10-2-186 10.10.2.186
Dec 15 00:40:15 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: "
Dec 15 00:40:15 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:40:15.364524778Z" level=info msg="Docker daemon" commit=19e2cf6 graphdriver(s)=devicemapper version=17.09.1-ce
Dec 15 00:40:15 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:40:15.364643045Z" level=info msg="Daemon has completed initialization"
Dec 15 00:40:15 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:40:15.372985688Z" level=info msg="2017/12/15 00:40:15 [INFO] serf: EventMemberJoin: ip-10-10-2-127 10.10.2.127
Dec 15 00:40:15 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: "
Dec 15 00:40:15 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:40:15.374887725Z" level=info msg="API listen on [::]:2376"
Dec 15 00:40:15 ip-10-10-2-39.us-west-2.compute.internal systemd[1]: Started Docker Application Container Engine.
-- Subject: Unit docker.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit docker.service has finished starting up.
--
-- The start-up result is done.
Dec 15 00:40:15 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:40:15.374958646Z" level=info msg="API listen on /var/run/docker.sock"
Dec 15 00:43:40 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:43:40.581427400Z" level=error msg="could not resolve peer \"10.50.48.12\": timed out resolving peer by querying the cluster"
Dec 15 00:43:41 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:43:41.581709777Z" level=error msg="could not resolve peer \"10.50.48.12\": timed out resolving peer by querying the cluster"
Dec 15 00:43:42 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:43:42.582011510Z" level=error msg="could not resolve peer \"10.50.48.12\": timed out resolving peer by querying the cluster"
Dec 15 00:43:43 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:43:43.582278440Z" level=error msg="could not resolve peer \"10.50.48.12\": timed out resolving peer by querying the cluster"
Dec 15 00:43:44 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:43:44.582540380Z" level=error msg="could not resolve peer \"10.50.48.12\": timed out resolving peer by querying the cluster"
Dec 15 00:43:45 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:43:45.582809896Z" level=error msg="could not resolve peer \"10.50.48.12\": timed out resolving peer by querying the cluster"
Dec 15 00:43:46 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:43:46.584500194Z" level=error msg="could not resolve peer \"10.50.48.12\": timed out resolving peer by querying the cluster"
Dec 15 00:43:47 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:43:47.586258805Z" level=error msg="could not resolve peer \"10.50.48.12\": timed out resolving peer by querying the cluster"
Dec 15 00:43:47 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:43:47.586875346Z" level=warning msg="peerDbAdd transient condition - Key:10.50.48.26 02:42:0a:32:30:1a cardinality:2 db state:Set{{948b9200505f4f702d5c745b161c941537ff16d8735cced81ded356091209a1d 10.10.2.39 22 32 true}, {dummy 10.10.2.39 22 32 false}}"
Dec 15 00:43:48 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:43:48.588299858Z" level=error msg="could not resolve peer \"10.50.48.12\": timed out resolving peer by querying the cluster"
Dec 15 00:44:42 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:44:42.228925625Z" level=error msg="could not resolve peer \"10.50.48.9\": timed out resolving peer by querying the cluster"
Dec 15 00:44:43 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:44:43.235079630Z" level=error msg="could not resolve peer \"10.50.48.9\": timed out resolving peer by querying the cluster"
Dec 15 00:44:44 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:44:44.235249108Z" level=error msg="could not resolve peer \"10.50.48.9\": timed out resolving peer by querying the cluster"
Dec 15 00:44:45 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:44:45.235502623Z" level=error msg="could not resolve peer \"10.50.48.9\": timed out resolving peer by querying the cluster"
Dec 15 00:44:46 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:44:46.236353923Z" level=error msg="could not resolve peer \"10.50.48.9\": timed out resolving peer by querying the cluster"
Dec 15 00:44:46 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:44:46.237015773Z" level=warning msg="peerDbAdd transient condition - Key:10.50.48.51 02:42:0a:32:30:33 cardinality:2 db state:Set{{a83296771927045209a3f4986b8d54819f861b70ae9357f813bbd6beffb4a350 10.10.2.39 22 32 true}, {dummy 10.10.2.39 22 32 false}}"
Dec 15 00:44:47 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:44:47.238280566Z" level=error msg="could not resolve peer \"10.50.48.9\": timed out resolving peer by querying the cluster"
Dec 15 00:45:21 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:21.813035003Z" level=warning msg="peerDbDelete transient condition - Key:10.50.48.51 02:42:0a:32:30:33 cardinality:1 db state:Set{{dummy 10.10.2.39 22 32 false}}"
Dec 15 00:45:21 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:21.813113462Z" level=warning msg="Neighbor entry already present for IP 10.50.48.51, mac 02:42:0a:32:30:33 neighbor:&{dstIP:[10 50 48 51] dstMac:[2 66 10 50 48 51] linkName:vx-000100-dcf48 linkDst:vxlan0 family:0} forceUpdate:false"
Dec 15 00:45:21 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:21.813139381Z" level=warning msg="Peer operation failed:could not add neighbor entry for nid:dcf48923f107727f755869f05ab10ba24a56444737f981c898ec14b2032ea952 eid:dummy into the sandbox:Search neighbor failed for IP 10.50.48.51, mac 02:42:0a:32:30:33, present in db:true op:&{2 dcf48923f107727f755869f05ab10ba24a56444737f981c898ec14b2032ea952 a83296771927045209a3f4986b8d54819f861b70ae9357f813bbd6beffb4a350 [10 50 48 51] [255 255 252 0] [2 66 10 50 48 51] [0 0 0 0 0 0 0 0 0 0 255 255 10 10 2 39] false false true Leave}"
Dec 15 00:45:24 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:24.555935217Z" level=warning msg="peerDbDelete transient condition - Key:10.50.48.26 02:42:0a:32:30:1a cardinality:1 db state:Set{{dummy 10.10.2.39 22 32 false}}"
Dec 15 00:45:29 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:29.061280932Z" level=warning msg="peerDbAdd transient condition - Key:10.50.48.26 02:42:0a:32:30:1a cardinality:2 db state:Set{{8384c22dd90f28304cbff4b50d59a22ea1e89cef573e3f68328d9238b35b3362 10.10.2.186 22 32 false}, {dummy 10.10.2.39 22 32 false}}"
Dec 15 00:45:32 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:32.281175806Z" level=warning msg="peerDbAdd transient condition - Key:10.50.48.51 02:42:0a:32:30:33 cardinality:2 db state:Set{{16035419d19fdb10db6da6d90924f0c630ce279cc0fd7aa9003b3ce7df30e09a 10.10.2.122 22 32 false}, {dummy 10.10.2.39 22 32 false}}"
Dec 15 00:45:32 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:32.281235119Z" level=warning msg="Neighbor entry already present for IP 10.50.48.51, mac 02:42:0a:32:30:33 neighbor:&{dstIP:[0 0 0 0 0 0 0 0 0 0 255 255 10 50 48 51] dstMac:[2 66 10 50 48 51] linkName:vx-000100-dcf48 linkDst:vxlan0 family:0} forceUpdate:false"
Dec 15 00:45:36 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:36.037319650Z" level=error msg="could not resolve peer \"10.50.48.5\": timed out resolving peer by querying the cluster"
Dec 15 00:45:37 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:37.037589884Z" level=error msg="could not resolve peer \"10.50.48.13\": timed out resolving peer by querying the cluster"
Dec 15 00:45:38 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:38.037839646Z" level=error msg="could not resolve peer \"10.50.48.5\": timed out resolving peer by querying the cluster"
Dec 15 00:45:39 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:39.038093412Z" level=error msg="could not resolve peer \"10.50.48.13\": timed out resolving peer by querying the cluster"
Dec 15 00:45:40 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:40.038438277Z" level=error msg="could not resolve peer \"10.50.48.13\": timed out resolving peer by querying the cluster"
Dec 15 00:45:41 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:41.038748530Z" level=error msg="could not resolve peer \"10.50.48.5\": timed out resolving peer by querying the cluster"
Dec 15 00:45:42 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:42.039029344Z" level=error msg="could not resolve peer \"10.50.48.5\": timed out resolving peer by querying the cluster"
Dec 15 00:45:43 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:43.040062307Z" level=error msg="could not resolve peer \"10.50.48.13\": timed out resolving peer by querying the cluster"
Dec 15 00:45:44 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:44.040738594Z" level=error msg="could not resolve peer \"10.50.48.14\": timed out resolving peer by querying the cluster"
Dec 15 00:45:45 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:45.041500982Z" level=error msg="could not resolve peer \"10.50.48.20\": timed out resolving peer by querying the cluster"
Dec 15 00:45:46 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:46.041736518Z" level=error msg="could not resolve peer \"10.50.48.14\": timed out resolving peer by querying the cluster"
Dec 15 00:45:47 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:47.041944529Z" level=error msg="could not resolve peer \"10.50.48.5\": timed out resolving peer by querying the cluster"
Dec 15 00:45:48 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:48.042211415Z" level=error msg="could not resolve peer \"10.50.48.13\": timed out resolving peer by querying the cluster"
Dec 15 00:45:49 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:49.042602371Z" level=error msg="could not resolve peer \"10.50.48.20\": timed out resolving peer by querying the cluster"
Dec 15 00:45:50 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:50.042877823Z" level=error msg="could not resolve peer \"10.50.48.14\": timed out resolving peer by querying the cluster"
Dec 15 00:45:51 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:51.043060362Z" level=error msg="could not resolve peer \"10.50.48.5\": timed out resolving peer by querying the cluster"
Dec 15 00:45:52 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:52.043401234Z" level=error msg="could not resolve peer \"10.50.48.13\": timed out resolving peer by querying the cluster"
Dec 15 00:45:53 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:53.043678780Z" level=error msg="could not resolve peer \"10.50.48.20\": timed out resolving peer by querying the cluster"
Dec 15 00:45:54 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:54.044024680Z" level=error msg="could not resolve peer \"10.50.48.14\": timed out resolving peer by querying the cluster"
Dec 15 00:45:55 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:55.044331507Z" level=error msg="could not resolve peer \"10.50.48.20\": timed out resolving peer by querying the cluster"
Dec 15 00:45:56 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:56.044669511Z" level=error msg="could not resolve peer \"10.50.48.25\": timed out resolving peer by querying the cluster"
Dec 15 00:45:57 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:57.044938297Z" level=error msg="could not resolve peer \"10.50.48.24\": timed out resolving peer by querying the cluster"
Dec 15 00:45:58 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:58.045236075Z" level=error msg="could not resolve peer \"10.50.48.14\": timed out resolving peer by querying the cluster"
Dec 15 00:45:59 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:59.045511083Z" level=error msg="could not resolve peer \"10.50.48.20\": timed out resolving peer by querying the cluster"
Dec 15 00:46:00 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:00.045775091Z" level=error msg="could not resolve peer \"10.50.48.25\": timed out resolving peer by querying the cluster"
Dec 15 00:46:01 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:01.046082853Z" level=error msg="could not resolve peer \"10.50.48.24\": timed out resolving peer by querying the cluster"
Dec 15 00:46:02 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:02.046398348Z" level=error msg="could not resolve peer \"10.50.48.14\": timed out resolving peer by querying the cluster"
Dec 15 00:46:03 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:03.046680381Z" level=error msg="could not resolve peer \"10.50.48.20\": timed out resolving peer by querying the cluster"
Dec 15 00:46:04 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:04.046902046Z" level=error msg="could not resolve peer \"10.50.48.25\": timed out resolving peer by querying the cluster"
Dec 15 00:46:05 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:05.047049988Z" level=error msg="could not resolve peer \"10.50.48.24\": timed out resolving peer by querying the cluster"
Dec 15 00:46:05 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:05.047751298Z" level=warning msg="peerDbAdd transient condition - Key:10.50.48.32 02:42:0a:32:30:20 cardinality:2 db state:Set{{6e13374231d5792b3b16d8c551c9567b37b8e61744feb5f495fba7b66b368b6e 10.10.2.39 22 32 true}, {dummy 10.10.2.39 22 32 false}}"
Dec 15 00:46:06 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:06.047943322Z" level=error msg="could not resolve peer \"10.50.48.25\": timed out resolving peer by querying the cluster"
Dec 15 00:46:07 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:07.048273526Z" level=error msg="could not resolve peer \"10.50.48.24\": timed out resolving peer by querying the cluster"
Dec 15 00:46:08 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:08.048642394Z" level=error msg="could not resolve peer \"10.50.48.4\": timed out resolving peer by querying the cluster"
Dec 15 00:46:09 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:09.050425793Z" level=error msg="could not resolve peer \"10.50.48.17\": timed out resolving peer by querying the cluster"
Dec 15 00:46:10 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:10.053972757Z" level=error msg="could not resolve peer \"10.50.48.23\": timed out resolving peer by querying the cluster"
Dec 15 00:46:11 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:11.054348470Z" level=error msg="could not resolve peer \"10.50.48.25\": timed out resolving peer by querying the cluster"
Dec 15 00:46:12 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:12.054840861Z" level=error msg="could not resolve peer \"10.50.48.24\": timed out resolving peer by querying the cluster"
Dec 15 00:46:13 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:13.061006391Z" level=error msg="could not resolve peer \"10.50.48.4\": timed out resolving peer by querying the cluster"
Dec 15 00:46:14 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:14.061270824Z" level=error msg="could not resolve peer \"10.50.48.17\": timed out resolving peer by querying the cluster"
Dec 15 00:46:15 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:15.061522089Z" level=error msg="could not resolve peer \"10.50.48.23\": timed out resolving peer by querying the cluster"
Dec 15 00:46:16 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:16.061839113Z" level=error msg="could not resolve peer \"10.50.48.25\": timed out resolving peer by querying the cluster"
Dec 15 00:46:17 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:17.062076822Z" level=error msg="could not resolve peer \"10.50.48.24\": timed out resolving peer by querying the cluster"
Dec 15 00:46:18 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:18.062385094Z" level=error msg="could not resolve peer \"10.50.48.4\": timed out resolving peer by querying the cluster"
Dec 15 00:46:19 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:19.062696554Z" level=error msg="could not resolve peer \"10.50.48.17\": timed out resolving peer by querying the cluster"
Dec 15 00:46:20 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:20.064790638Z" level=error msg="could not resolve peer \"10.50.48.23\": timed out resolving peer by querying the cluster"
Dec 15 00:46:21 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:21.067539938Z" level=error msg="could not resolve peer \"10.50.48.4\": timed out resolving peer by querying the cluster"
Dec 15 00:46:21 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:21.068683195Z" level=warning msg="peerDbAdd transient condition - Key:10.50.48.32 02:42:0a:32:30:20 cardinality:2 db state:Set{{dummy 10.10.2.39 22 32 false}, {6e13374231d5792b3b16d8c551c9567b37b8e61744feb5f495fba7b66b368b6e 10.10.2.39 22 32 true}}"
Dec 15 00:46:21 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:21.068714041Z" level=warning msg="Entry already present in db: nid:dcf48923f107727f755869f05ab10ba24a56444737f981c898ec14b2032ea952 eid:dummy peerIP:10.50.48.32 peerMac:02:42:0a:32:30:20 isLocal:false vtep:10.10.2.39"
Dec 15 00:46:21 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:21.068752275Z" level=warning msg="Neighbor entry already present for IP 10.50.48.32, mac 02:42:0a:32:30:20 neighbor:&{dstIP:[10 50 48 32] dstMac:[2 66 10 50 48 32] linkName:vx-000100-dcf48 linkDst:vxlan0 family:0} forceUpdate:true"
Dec 15 00:46:21 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:21.068868897Z" level=warning msg="Neighbor entry already present for IP 10.10.2.39, mac 02:42:0a:32:30:20 neighbor:&{dstIP:[0 0 0 0 0 0 0 0 0 0 255 255 10 10 2 39] dstMac:[2 66 10 50 48 32] linkName:vx-000100-dcf48 linkDst:vxlan0 family:7} forceUpdate:false"
Dec 15 00:46:21 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:21.068903013Z" level=warning msg="Peer operation failed:could not add fdb entry for nid:dcf48923f107727f755869f05ab10ba24a56444737f981c898ec14b2032ea952 eid:dummy into the sandbox:Search neighbor failed for IP 10.10.2.39, mac 02:42:0a:32:30:20, present in db:true op:&{1 dcf48923f107727f755869f05ab10ba24a56444737f981c898ec14b2032ea952 dummy [10 50 48 32] [255 255 252 0] [2 66 10 50 48 32] [0 0 0 0 0 0 0 0 0 0 255 255 10 10 2 39] false true false watchMiss}"
Dec 15 00:46:22 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:22.070154351Z" level=error msg="could not resolve peer \"10.50.48.17\": timed out resolving peer by querying the cluster"
Dec 15 00:46:22 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:22.070956149Z" level=warning msg="peerDbAdd transient condition - Key:10.50.48.72 02:42:0a:32:30:48 cardinality:2 db state:Set{{1a3ef204892e76e1cbfcca7576f28bbfba022cd1e49c6e897a62e2a0738911a3 10.10.2.39 22 32 true}, {dummy 10.10.2.39 22 32 false}}"
Dec 15 00:46:23 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:23.086049272Z" level=error msg="could not resolve peer \"10.50.48.23\": timed out resolving peer by querying the cluster"
Dec 15 00:46:24 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:24.086365853Z" level=error msg="could not resolve peer \"10.50.48.4\": timed out resolving peer by querying the cluster"
Dec 15 00:46:25 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:25.086694403Z" level=error msg="could not resolve peer \"10.50.48.17\": timed out resolving peer by querying the cluster"
Dec 15 00:46:26 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:26.087032562Z" level=error msg="could not resolve peer \"10.50.48.23\": timed out resolving peer by querying the cluster"
Dec 15 00:46:27 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:27.090327610Z" level=error msg="could not resolve peer \"10.50.48.4\": timed out resolving peer by querying the cluster"
Dec 15 00:46:28 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:28.100048345Z" level=error msg="could not resolve peer \"10.50.48.17\": timed out resolving peer by querying the cluster"
Dec 15 00:46:29 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:29.112959868Z" level=error msg="could not resolve peer \"10.50.48.23\": timed out resolving peer by querying the cluster"
Dec 15 00:47:23 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:47:23.283265485Z" level=warning msg="peerDbAdd transient condition - Key:10.50.48.80 02:42:0a:32:30:50 cardinality:2 db state:Set{{9fe2a0870a6f14dac6751c83b62cce30c18082eaae11dea5e2d43d5a168ad776 10.10.2.127 22 32 false}, {76c8bf940ed1f0422345bcb9d37c8db2067a48161c8b6ff6689038cd6c05ebcb 10.10.2.127 22 32 false}}"
Dec 15 00:47:23 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:47:23.283330673Z" level=warning msg="Neighbor entry already present for IP 10.50.48.80, mac 02:42:0a:32:30:50 neighbor:&{dstIP:[0 0 0 0 0 0 0 0 0 0 255 255 10 50 48 80] dstMac:[2 66 10 50 48 80] linkName:vx-000100-dcf48 linkDst:vxlan0 family:0} forceUpdate:false"
Dec 15 00:47:25 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:47:25.677710734Z" level=warning msg="peerDbDelete transient condition - Key:10.50.48.80 02:42:0a:32:30:50 cardinality:1 db state:Set{{9fe2a0870a6f14dac6751c83b62cce30c18082eaae11dea5e2d43d5a168ad776 10.10.2.127 22 32 false}}"
Dec 15 00:47:28 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:47:28.854566986Z" level=warning msg="peerDbAdd transient condition - Key:10.50.48.80 02:42:0a:32:30:50 cardinality:2 db state:Set{{9fe2a0870a6f14dac6751c83b62cce30c18082eaae11dea5e2d43d5a168ad776 10.10.2.127 22 32 false}, {78c2e91119b59a60216c8aea38319bbf7d70ca5b7288f427dfc4cd319cfc8fe1 10.10.2.186 22 32 false}}"
Dec 15 00:47:28 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:47:28.854643458Z" level=warning msg="Neighbor entry already present for IP 10.50.48.80, mac 02:42:0a:32:30:50 neighbor:&{dstIP:[0 0 0 0 0 0 0 0 0 0 255 255 10 50 48 80] dstMac:[2 66 10 50 48 80] linkName:vx-000100-dcf48 linkDst:vxlan0 family:0} forceUpdate:false"
Dec 15 00:47:30 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:47:30.860523220Z" level=warning msg="peerDbDelete transient condition - Key:10.50.48.80 02:42:0a:32:30:50 cardinality:1 db state:Set{{9fe2a0870a6f14dac6751c83b62cce30c18082eaae11dea5e2d43d5a168ad776 10.10.2.127 22 32 false}}"
Dec 15 00:47:32 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:47:32.068871701Z" level=warning msg="peerDbAdd transient condition - Key:10.50.48.80 02:42:0a:32:30:50 cardinality:2 db state:Set{{fe6cefb9b974800e2917a350b6357aceed271a6dbbbf4b4da45d380196180380 10.10.2.127 22 32 false}, {9fe2a0870a6f14dac6751c83b62cce30c18082eaae11dea5e2d43d5a168ad776 10.10.2.127 22 32 false}}"
Dec 15 00:47:32 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:47:32.068921895Z" level=warning msg="Neighbor entry already present for IP 10.50.48.80, mac 02:42:0a:32:30:50 neighbor:&{dstIP:[0 0 0 0 0 0 0 0 0 0 255 255 10 50 48 80] dstMac:[2 66 10 50 48 80] linkName:vx-000100-dcf48 linkDst:vxlan0 family:0} forceUpdate:false"
Dec 15 00:47:32 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:47:32.494377690Z" level=warning msg="peerDbDelete transient condition - Key:10.50.48.80 02:42:0a:32:30:50 cardinality:1 db state:Set{{fe6cefb9b974800e2917a350b6357aceed271a6dbbbf4b4da45d380196180380 10.10.2.127 22 32 false}}"
Dec 15 00:55:32 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:55:32.937897501Z" level=warning msg="2017/12/15 00:55:32 [WARN] serf: received old event jl 10.10.2.77 dcf48923f107727f755869f05ab10ba24a56444737f981c898ec14b2032ea952 92afa9f48852c73e1d8c41b904e2eed40e561ac9ed51966d1d7062397548118a from time 373 (current: 894)
Dec 15 00:55:32 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: "
Dec 15 00:56:14 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:56:14.118302987Z" level=error msg="Error running exec in container: oci runtime error: exec failed: container_linux.go:265: starting container process caused \"process_linux.go:84: executing setns process caused \\"exit status 15\\"\"
Dec 15 00:56:14 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: "

Describe the results you expected: Containers on all hosts should be able to connect to each other.

Additional information you deem important (e.g. issue happens only occasionally): Intermittent/randomly occurs. Much more frequent on larger clusters.

Output of docker version:

Tested with 17.05, 17.09.1, and 17.11.0. I don’t have a 17.05 up currently to get the docker version output.

Client:
 Version:      17.09.1-ce
 API version:  1.32
 Go version:   go1.8.3
 Git commit:   19e2cf6
 Built:        Thu Dec  7 22:23:40 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.09.1-ce
 API version:  1.32 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   19e2cf6
 Built:        Thu Dec  7 22:25:03 2017
 OS/Arch:      linux/amd64
 Experimental: false
Client:
 Version:      17.11.0-ce
 API version:  1.34
 Go version:   go1.8.3
 Git commit:   1caf76c
 Built:        Mon Nov 20 18:35:47 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.11.0-ce
 API version:  1.34 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   1caf76c
 Built:        Mon Nov 20 18:45:06 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

Containers: 6
 Running: 5
 Paused: 0
 Stopped: 1
Images: 5
Server Version: 17.09.1-ce
Storage Driver: devicemapper
 Pool Name: docker-202:1-352328870-pool
 Pool Blocksize: 65.54kB
 Base Device Size: 10.74GB
 Backing Filesystem: xfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 1.729GB
 Data Space Total: 107.4GB
 Data Space Available: 105.6GB
 Metadata Space Used: 3.641MB
 Metadata Space Total: 2.147GB
 Metadata Space Available: 2.144GB
 Thin Pool Minimum Free Space: 10.74GB
 Udev Sync Supported: true
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: true
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.140-RHEL7 (2017-05-03)
Logging Driver: none
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 3f2f8b84a77f73d38244dd690525642a72156c64
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-693.2.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 29.79GiB
Name: ip-10-10-2-39.us-west-2.compute.internal
ID: 73EN:WWPI:2QJ4:2GPV:4I4Z:BHMS:EZBX:BF75:HAJ7:M7CO:LLSA:FYJW
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Cluster Store: consul://10.10.2.39:8080
Cluster Advertise: 10.10.2.39:2376
Insecure Registries:
 docker-exabeam.repo.exabeam.com
 127.0.0.0/8
Live Restore Enabled: true

WARNING: devicemapper: usage of loopback devices is strongly discouraged for production use.
         Use `--storage-opt dm.thinpooldev` to specify a custom block storage device.
Containers: 3
 Running: 3
 Paused: 0
 Stopped: 0
Images: 2
Server Version: 17.11.0-ce
Storage Driver: devicemapper
 Pool Name: docker-202:1-486539717-pool
 Pool Blocksize: 65.54kB
 Base Device Size: 10.74GB
 Backing Filesystem: xfs
 Udev Sync Supported: true
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Data Space Used: 1.23GB
 Data Space Total: 107.4GB
 Data Space Available: 106.1GB
 Metadata Space Used: 2.036MB
 Metadata Space Total: 2.147GB
 Metadata Space Available: 2.145GB
 Thin Pool Minimum Free Space: 10.74GB
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: true
 Deferred Deleted Device Count: 0
 Library Version: 1.02.140-RHEL7 (2017-05-03)
Logging Driver: none
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 992280e8e265f491f7a624ab82f3e238be086e49
runc version: 0351df1c5a66838d0c392b4ac4cf9450de844e2d
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-693.2.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 29.79GiB
Name: ip-10-10-2-23.us-west-2.compute.internal
ID: QTRG:VGMN:HOW5:4IJ6:RHTG:3GNF:SVP6:OF53:KI4M:H647:CV64:45VR
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Cluster Store: consul://10.10.2.23:8080
Cluster Advertise: 10.10.2.23:2376
Insecure Registries:
 docker-exabeam.repo.exabeam.com
 127.0.0.0/8
Live Restore Enabled: true

WARNING: devicemapper: usage of loopback devices is strongly discouraged for production use.
         Use `--storage-opt dm.thinpooldev` to specify a custom block storage device.

Additional environment details (AWS, VirtualBox, physical, etc.): AWS and physical, running Centos 7. Tested with consul 0.7.2 and 1.0.1. Restarting the docker daemon on unconnectable hosts sometimes fixes it for that host.

daemon.json

{
    "bip": "172.17.0.1/16",
    "cluster-advertise": "10.10.2.39:2376",
    "cluster-store": "consul://10.10.2.39:8080",
    "cluster-store-opts": {
          "kv.cacertfile": "/etc/docker/ssl/exa-ca.pem",
          "kv.certfile": "/etc/docker/ssl/docker-cert.pem",
          "kv.keyfile": "/etc/docker/ssl/docker-key.pem"
    },
    "disable-legacy-registry": true,
    "dns": [],
    "hosts": [
        "unix:///var/run/docker.sock",
        "tcp://0.0.0.0:2376"
    ],
    "insecure-registries": [
        "docker-exabeam.repo.exabeam.com"
    ],
    "live-restore": true,
    "log-driver": "none",
    "shutdown-timeout": 90,
    "storage-driver": "devicemapper",
    "tlsverify": true,
    "tlscacert": "/etc/docker/ssl/exa-ca.pem",
    "tlscert": "/etc/docker/ssl/docker-cert.pem",
    "tlskey": "/etc/docker/ssl/docker-key.pem"
}

consul config file:

{
 "data_dir":"/var/lib/consul",
 "enable_syslog":false,
 "server":true,
 "node_name":"host1",
 "advertise_addr":"10.10.2.39",
 "bind_addr":"10.10.2.39",
 "bootstrap_expect": 5,
 "retry_join": ["10.10.2.39","10.10.2.43","10.10.2.52","10.10.2.58","10.10.2.69"],
 "client_addr": "0.0.0.0",
 "verify_outgoing": true,
 "verify_incoming": true,
 "verify_server_hostname": false,
 "encrypt": "BuZSrsHSx1MCQ9j1kFuQmA==",
 "key_file": "/opt/consul/etc/consul-key.pem",
 "cert_file": "/opt/consul/etc/consul-cert.pem",
 "ca_file": "/opt/consul/etc/exa-ca.pem",
 "ports": {
  "https": 8080,
  "http": -1
 }
}

Possibly related to https://github.com/moby/moby/issues/32841 , https://github.com/moby/moby/issues/32195 , or https://github.com/moby/moby/issues/33721 , but I’m not using swarm, so I figured it warranted its own ticket.

About this issue

  • Original URL
  • State: open
  • Created 7 years ago
  • Reactions: 2
  • Comments: 22 (2 by maintainers)

Most upvoted comments

I could also hit this from time to time. No exact way to reproduce. We have 14 bare metals with 18.06 installed, one overlay created on it. Not using swarm, we are using the overlay manually.

Description

I am starting a container on that overlay on host 10.2.1.67. The IP as auto assigned by docker. Yet, it can’t ping any other container on the same overlay, nor any other container could ping it.

On ALL syslog on the other hosts, it says

Jan 14 10:50:37 lb1 dockerd[22247]: time="2019-01-14T10:50:37.636470476+08:00" level=warning msg="peerDbAdd transient condition - Key:10.9.228.33 02:42:0a:09:e4:21 cardinality:2 db state:Set{{dummy 10.2.0.64 21 32 false}, {4650024a28f5f2adeff62f386fa32c0b853c165d5deb7bd147f1f58d1bf5f0ba 10.2.1.67 21 32 false}}"
Jan 14 10:50:37 lb1 dockerd[22247]: time="2019-01-14T10:50:37.636543836+08:00" level=warning msg="Neighbor entry already present for IP 10.9.228.33, mac 02:42:0a:09:e4:21 neighbor:&{dstIP:[10 9 228 33] dstMac:[2 66 10 9 228 33] linkName:vx-000100-f785e linkDst:vxlan0 family:0} forceUpdate:false"

I believe cardinality:2 means a previous containeer with the same Virtual IP 10.9.228.33 wasn’t deleted correctly when the last host 10.2.0.64 shut it down. Such that the when I am creating a new container on new host 10.2.1.67, it sense a duplicated entry in /var/lib/docker/network/files/local-kv.db.

docker network inspect is has the new container mac and virtual IP. My online-kv db is Consul, I have also pull all the KV data in the it, it is saying the correct host 10.2.1.67 has the 10.9.228.33 container.

Actually if I stop the docker daemon, remove /var/lib/docker/network/files/local-kv.db and restart it, everything works as expected, but only on the host I restart docker daemon. i.e. If I do the same on every 14 hosts I expect it would correct the problem, but apparently this isn’t quite feasible every time we hit this bug.

Suggestion

Maybe the old host didn’t broadcast a peerDelete for some reasons? AndThe log says forceUpdate:false

is it possible to make the new container announcement as forceUpdate:true such that it would work even peerDelete were missed for any reason?