moby: Swarm is having occasional network connection problems between nodes.

Few times a day I am having connection issues between nodes and clients are seeing occasional “Bad request” error. My swarm setup (aws) has following services: nginx (global) and web (replicated=2) and separate overlay network. In nginx.conf I am using proxy_pass http://web:5000 to route requests to web service. Both services are running and marked as healthy and haven’t been restarted while having these errors. Manager is separate node (30sec-manager1).

Few times a day for few requests I am receiving an errors that nginx couldn’t connect upstream and I always see 10.0.0.6 IP address mentioned:

Here are related nginx and docker logs. Both web services are replicated on 30sec-worker3 and 30sec-worker4 nodes.

Nginx log:
----------
2017/03/29 07:13:18 [error] 7#7: *44944 connect() failed (113: Host is unreachable) while connecting to upstream, client: 104.154.58.95, server: 30seconds.com, request: "GET / HTTP/1.1", upstream: "http://10.0.0.6:5000/", host: "30seconds.com"

Around same time from docker logs (journalctl -u docker.service)

on node 30sec-manager1:
---------------------------
Mar 29 07:12:50 30sec-manager1 docker[30365]: time="2017-03-29T07:12:50.736935344Z" level=warning msg="memberlist: Refuting a suspect message (from: 30sec-worker3-054c94d39b58)"
Mar 29 07:12:54 30sec-manager1 docker[30365]: time="2017-03-29T07:12:54.659229055Z" level=info msg="memberlist: Marking 30sec-worker3-054c94d39b58 as failed, suspect timeout reached"
Mar 29 07:12:54 30sec-manager1 docker[30365]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-manager1 docker[30365]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-manager1 docker[30365]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-manager1 docker[30365]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-manager1 docker[30365]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-manager1 docker[30365]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:13:10 30sec-manager1 docker[30365]: time="2017-03-29T07:13:10.302960985Z" level=info msg="memberlist: Suspect 30sec-worker3-054c94d39b58 has failed, no acks received"
Mar 29 07:13:11 30sec-manager1 docker[30365]: time="2017-03-29T07:13:11.055187819Z" level=warning msg="memberlist: Refuting a suspect message (from: 30sec-worker3-054c94d39b58)"
Mar 29 07:13:14 30sec-manager1 docker[30365]: time="2017-03-29T07:13:14Z" level=info msg="Firewalld running: false"
Mar 29 07:13:14 30sec-manager1 docker[30365]: time="2017-03-29T07:13:14Z" level=info msg="Firewalld running: false"
Mar 29 07:13:14 30sec-manager1 docker[30365]: time="2017-03-29T07:13:14Z" level=info msg="Firewalld running: false"
Mar 29 07:13:14 30sec-manager1 docker[30365]: time="2017-03-29T07:13:14Z" level=info msg="Firewalld running: false"
Mar 29 07:13:14 30sec-manager1 docker[30365]: time="2017-03-29T07:13:14Z" level=info msg="Firewalld running: false"
Mar 29 07:13:14 30sec-manager1 docker[30365]: time="2017-03-29T07:13:14Z" level=info msg="Firewalld running: false"
Mar 29 07:13:14 30sec-manager1 docker[30365]: time="2017-03-29T07:13:14Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:17 30sec-manager1 docker[30365]: time="2017-03-29T07:13:17Z" level=info msg="Firewalld running: false"

on node 30sec-worker3:
-------------------------
Mar 29 07:12:50 30sec-worker3 docker[30362]: time="2017-03-29T07:12:50.613402284Z" level=info msg="memberlist: Suspect 30sec-manager1-b1cbc10665cc has failed, no acks received"
Mar 29 07:12:55 30sec-worker3 docker[30362]: time="2017-03-29T07:12:55.614174704Z" level=warning msg="memberlist: Refuting a dead message (from: 30sec-worker4-4ca6b1dcaa42)"
Mar 29 07:13:09 30sec-worker3 docker[30362]: time="2017-03-29T07:13:09.613368306Z" level=info msg="memberlist: Suspect 30sec-worker4-4ca6b1dcaa42 has failed, no acks received"
Mar 29 07:13:10 30sec-worker3 docker[30362]: time="2017-03-29T07:13:10.613972658Z" level=info msg="memberlist: Suspect 30sec-manager1-b1cbc10665cc has failed, no acks received"
Mar 29 07:13:11 30sec-worker3 docker[30362]: time="2017-03-29T07:13:11.042788976Z" level=warning msg="memberlist: Refuting a suspect message (from: 30sec-worker4-4ca6b1dcaa42)"
Mar 29 07:13:14 30sec-worker3 docker[30362]: time="2017-03-29T07:13:14.613951134Z" level=info msg="memberlist: Marking 30sec-worker4-4ca6b1dcaa42 as failed, suspect timeout reached"
Mar 29 07:13:25 30sec-worker3 docker[30362]: time="2017-03-29T07:13:25.615128313Z" level=error msg="Bulk sync to node 30sec-manager1-b1cbc10665cc timed out"

on node 30sec-worker4:
-------------------------
Mar 29 07:12:49 30sec-worker4 docker[30376]: time="2017-03-29T07:12:49.658082975Z" level=info msg="memberlist: Suspect 30sec-worker3-054c94d39b58 has failed, no acks received"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54.658737367Z" level=info msg="memberlist: Marking 30sec-worker3-054c94d39b58 as failed, suspect timeout reached"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:13:09 30sec-worker4 docker[30376]: time="2017-03-29T07:13:09.658056735Z" level=info msg="memberlist: Suspect 30sec-worker3-054c94d39b58 has failed, no acks received"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16.303689665Z" level=warning msg="memberlist: Refuting a suspect message (from: 30sec-worker4-4ca6b1dcaa42)"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"

syslog on 30sec-worker4:
--------------------------
Mar 29 07:12:49 30sec-worker4 docker[30376]: time="2017-03-29T07:12:49.658082975Z" level=info msg="memberlist: Suspect 30sec-worker3-054c94d39b58 has failed, no acks received"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54.658737367Z" level=info msg="memberlist: Marking 30sec-worker3-054c94d39b58 as failed, suspect timeout reached"
Mar 29 07:12:54 30sec-worker4 kernel: [645679.048975] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 kernel: [645679.100691] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 kernel: [645679.130069] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 kernel: [645679.155859] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 kernel: [645679.180461] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 kernel: [645679.205707] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 kernel: [645679.230326] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 kernel: [645679.255597] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 docker[30376]: message repeated 7 times: [ time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"]
Mar 29 07:13:09 30sec-worker4 docker[30376]: time="2017-03-29T07:13:09.658056735Z" level=info msg="memberlist: Suspect 30sec-worker3-054c94d39b58 has failed, no acks received"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16.303689665Z" level=warning msg="memberlist: Refuting a suspect message (from: 30sec-worker4-4ca6b1dcaa42)"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-worker4 docker[30376]: message repeated 7 times: [ time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"]

I checked other cases when nginx can’t find find upstream and always I find these 3x lines appear most at these times in docker logs in:

level=info msg="memberlist:Suspect 30sec-worker3-054c94d39b58 has failed, no acks received"
level=warning msg="memberlist: Refuting a suspect message (from: 30sec-worker3-054c94d39b58)"
level=warning msg="memberlist: Refuting a dead message (from: 30sec-worker3-054c94d39b58)

By searching other issues, found that these have similar errors, so it may be related: https://github.com/docker/docker/issues/28843 https://github.com/docker/docker/issues/25325

Anything I should check or debug more to spot the problem or is it a bug? Thank you.

Output of docker version:

Client:
 Version:      17.03.0-ce
 API version:  1.26
 Go version:   go1.7.5
 Git commit:   60ccb22
 Built:        Thu Feb 23 11:02:43 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.0-ce
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   60ccb22
 Built:        Thu Feb 23 11:02:43 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

Containers: 18
 Running: 3
 Paused: 0
 Stopped: 15
Images: 16
Server Version: 17.03.0-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 83
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: active
 NodeID: ck99cyhgydt8y1zn8ik2xmcdv
 Is Manager: true
 ClusterID: in0q54eh74ljazrprt0vza3wj
 Managers: 1
 Nodes: 5
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: 172.31.31.146
 Manager Addresses:
  172.31.31.146:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 977c511eda0925a723debdc94d09459af49d082a
runc version: a01dafd48bc1c7cc12bdb01206f9fea7dd6feb70
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-57-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 990.6 MiB
Name: 30sec-manager1
ID: 5IIF:RONB:Y27Q:5MKX:ENEE:HZWM:XYBV:O6KN:BKL6:AEUK:2VKB:MO5P
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Labels:
 provider=amazonec2
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.): Amazon AWS (Manager - t2.micro, rest of nodes - t2.small)

docker-compose.yml (There are more services and nodes in setup, but I posted only involved ones)

version: "3"

services:

  nginx:
    image: 333435094895.dkr.ecr.us-east-1.amazonaws.com/swarm/nginx:latest
    ports:
      - 80:80
      - 81:81
    networks:
      - thirtysec
    depends_on:
      - web
    deploy:
      mode: global
      update_config:
        delay: 2s
        monitor: 2s

  web:
    image: 333435094895.dkr.ecr.us-east-1.amazonaws.com/swarm/os:latest
    command: sh -c "python manage.py collectstatic --noinput && daphne thirtysec.asgi:channel_layer -b 0.0.0.0 -p 5000"
    ports:
      - 5000:5000
    networks:
      - thirtysec
    deploy:
      mode: replicated
      replicas: 2
      labels: [APP=THIRTYSEC]
      update_config:
        delay: 15s
        monitor: 15s
      placement:
        constraints: [node.labels.aws_type == t2.small]

    healthcheck:
      test: goss -g deploy/swarm/checks/web-goss.yaml validate
      interval: 2s
      timeout: 3s
      retries: 15

networks:
    thirtysec:

web-goss.yaml

port:
  tcp:5000:
    listening: true
    ip:
    - 0.0.0.0

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 36
  • Comments: 250 (68 by maintainers)

Most upvoted comments

We have the this problem with swarm and it came also to the point that we had to remove swarm in production. As i said before we have 4 nodes where 3 is managers. Containers are running evenly over all four. We have about 15 containers for each node and the services needs to connect to each other regardless of where they happen to be running. Usually webapp and redis but we have also loadbalancers on manager nodes.

We have used tcpdump in different networknamespaces to follow pings and HTTP requests. We then saw that the ping was received at the destination containers networkns but the container could not send it back. The thing we noticed was that it tried to send out arp requests for the source containers ip but did not get any response. After some more digging we found that in the networkns for the docker overlay network on the node with the destination container the fqdb is different from the one that working.

$ sudo nsenter --net=/var/run/docker/netns/1-s7xhhuwbex bridge fdb show dev vxlan1 | grep "02:42:ac:14:00:8c"
02:42:ac:14:00:8c master br0

and example of one that is working:

02:42:ac:14:00:3f dst 10.135.24.176 link-netnsid 0 self permanent

The one that is working points fqdb to the right nodes ip. We think this shows that the docker network on the node with the destination container thinks that the source container is on the same node. There for it does not get an answer for the arp request or any other traffic. Some how the creation/deletion of vlans and mac creates this problem. This happens during both high load and not ( but more often during high load ).

This bug seams to effect a lot of people and there is also other issues that can maybe be linked to this bug. https://github.com/moby/moby/issues/32841

We think this issue should be seen as a showstopper and be prioritized accordingly. We want to go back to “Docker Swarm is awsome” feeling and use it in production again.

We’re in production and facing this too. Had to pull all the critical services out of swarm, and was thinking about ways to migrate to something else, because it’s really hurting us.

Please P1 this!

Run “sudo service docker restart” on the host with service container that can’t be ping from the good ones, problem solved. Maybe good for a while, until creating new or updating services.

@kleptog the issue you describe should be taken care of by https://github.com/docker/libnetwork/pull/1935

@adityacs how are your tests ?

Just adding my two cents here since I’ve been monitoring this thread for a few weeks now.

We started to see these issues few weeks ago on our production sites and weren’t able to fix them so we had to rollback our services to a previous environment. Once we discovered this thread, we upgraded our test environments to RC3 when it came out and things seemed to work fine. In fact, we have a replay machine that replays everything that happens in our production environment and we left it running continuously in our test environments.

Things were fine for almost a week but then we started to see the connectivity issues again. And whenever there were problems, I’d SSH into the nodes and check syslogs and I would find the msg="Neighbor entry already present for IP 10.1.10.67, mac 02:42:0a:00:01:03" messages. If I don’t do anything things will start to work eventually again, but it can take up to 40 minutes according to my experience for things to start working again.

To me, it seems that RC3 helped but didn’t take away the problem unfortunately. We are desperately looking for a solution here since we are growing and we don’t want to keep using our old infrastructure for running our operations.

FWIW We are running our services on Digital Ocean.

@DBLaci @svscorp the possible root cause for Unable to complete atomic operation, key modified has been identified and fix is being tested. You can expect a patch in 17.11

@sulphur There are many fixes going into 17.06 for these network issues.

@mavenugo @thaJeztah I’m monitoring this issue closely. I have to make a recommendation to higher management whether to adopt Swarm Mode or not. Currently, just using standalone Swarm “Classic”. But looks like we might need to wait and see. For our enterprise customers; dropping network connections sporadically is not an option.

@vieux Maybe we should keep this open until we get confirmation? It’s a pretty serious issue.

Hoi, we’ve also been having similar issues with some services not being able to see certain other services. After much debugging I finally tracked it down the the “bridge fdb” being wrong. I have written a script which can be used on a Docker swarm 17.03+ to verify if all the FDBs on that host are actually correct w.r.t. the Swarm state. https://gist.github.com/kleptog/9a5aa56e8d2532032b6a7b32bf7cc3aa

It converts the mac addresses in the tables back to services and verifies that the traffic is being sent to the correct host. As it turns out our swarm has various inconsistencies. For example (irrelevant info snipped) (Docker 17.06.1):

=== Network r96o6pc2d1aqw1yd7f55zej03 vlan 4100
--- check /var/run/docker/netns/2-r96o6pc2d1
02:42:0a:00:03:0f -> 172.16.102.101 (no service for mac)
02:42:0a:00:03:11 -> 172.16.102.101 (no service for mac)
02:42:0a:00:03:12 -> local (veth0) = srv nam5jg872ix2nrfy2hrkk6m03 -> node kqfdomihm5sfwkypjh7doyxw8
02:42:0a:00:03:12 -> 172.16.102.101 = srv nam5jg872ix2nrfy2hrkk6m03  -> node kqfdomihm5sfwkypjh7doyxw8
^^^ WARN Remote reference to self?

--- orig fdb
02:42:0a:00:03:0f dev vxlan0 dst 172.16.102.101 self permanent
02:42:0a:00:03:11 dev vxlan0 dst 172.16.102.101 self permanent
02:42:0a:00:03:12 dev veth0 vlan 0 master br0 
02:42:0a:00:03:12 dev vxlan0 dst 172.16.102.101 self permanent

Here we see the fdb retains references to MAC addresses/services which no longer exist in the swarm. And one MAC address is double, but the address it is being sent to (172.16.102.101) is the IP of the node itself. It doesn’t seem to harm anything, but it looks wrong.

More problematic is when the entries are actually wrong, like in this example (now Docker 17.03.2):

=== Network lbxh1rhp51puzn4olhmjptofv vlan 4109
--- check /var/run/docker/netns/1-lbxh1rhp51
02:42:0a:00:0c:03 -> 10.54.54.104 = srv 6r4rqeoki0gtzfnbt6dkwhv9x -> node kj1g5m9yupokowyyj7ecneyz6 (dc2-worker4)
02:42:0a:00:0c:05 -> local (veth2) = srv tx4977yl3fwnm0fli4qvkyr5q -> node l7z7c2gy0fqgmh9zhsv8yxfmd (dc2-worker1)
02:42:0a:00:0c:05 -> 10.54.54.103 = srv tx4977yl3fwnm0fli4qvkyr5q  -> node l7z7c2gy0fqgmh9zhsv8yxfmd (dc2-worker1)
^^^ ERROR local != 10.54.54.103
02:42:0a:00:0c:07 -> 10.54.54.104 = srv eungwd9c3jo2w3e3x4yh2tsnu -> node kj1g5m9yupokowyyj7ecneyz6 (dc2-worker4)
02:42:0a:00:0c:09 -> 10.54.54.104 = srv 57atej7vjivl60a51t71wz3e6 -> node kj1g5m9yupokowyyj7ecneyz6 (dc2-worker4)
02:42:0a:00:0c:0b -> local (veth3) = srv id83n5qhup9vspg5v31lekqkw  -> node l7z7c2gy0fqgmh9zhsv8yxfmd (dc2-worker1)
02:42:0a:00:0c:0b -> 10.54.54.103 = srv id83n5qhup9vspg5v31lekqkw -> node l7z7c2gy0fqgmh9zhsv8yxfmd (dc2-worker1)
^^^ ERROR local != 10.54.54.103
02:42:0a:00:0c:0d -> 10.54.54.104 = srv vsh35l8ecfuy9myxjsfm8lmy6  -> node kj1g5m9yupokowyyj7ecneyz6 (dc2-worker4)
02:42:0a:00:0c:0d -> br0 ???

--- orig fdb
02:42:0a:00:0c:03 dev vxlan1 dst 10.54.54.104 link-netnsid 0 self permanent
02:42:0a:00:0c:05 dev veth2 master br0 
02:42:0a:00:0c:05 dev vxlan1 dst 10.54.54.103 link-netnsid 0 self permanent
02:42:0a:00:0c:07 dev vxlan1 dst 10.54.54.104 link-netnsid 0 self permanent
02:42:0a:00:0c:09 dev vxlan1 dst 10.54.54.104 link-netnsid 0 self permanent
02:42:0a:00:0c:0b dev veth3 master br0 
02:42:0a:00:0c:0b dev vxlan1 dst 10.54.54.103 link-netnsid 0 self permanent
02:42:0a:00:0c:0d dev vxlan1 dst 10.54.54.104 link-netnsid 0 self permanent
02:42:0a:00:0c:0d dev vxlan1 master br0 

Here we see that two MAC addresses are duplicated, but now the second entry refers to an IP address of another host.

We’re currently working to make this script so we can track down the exact moment when things go wrong. But I hope this script is useful to others trying to debug their own swarms.

Hi guys.

I had the same problem. The reason was in virtual IP address (VIP) and It is enabled by default. I turned off the VIP everywhere and the problems with the network connection is resolved.

To disable to use the virtual IP address, you must start the services with the following option --endpoint-mode dnsrr. For example:

$ docker service create \
  --replicas 3 \
  --name my-dnsrr-service \
  --network my-network \
  --endpoint-mode dnsrr \
  nginx

If you can’t use this option to start your services (for example, you start them with docker-stack.yml), then you can access to your services by name: tasks.my-service (not my-service). In this case you will not be used virtual ip address.

From the documentation:

$ nslookup my-service

Server:    127.0.0.11
Address 1: 127.0.0.11

Name:      my-service
Address 1: 10.0.9.2 ip-10-0-9-2.us-west-2.compute.internal

10.0.9.2 - it is virtual ip address.

and:

$ nslookup tasks.my-service

Server:    127.0.0.11
Address 1: 127.0.0.11

Name:      tasks.my-service
Address 1: 10.0.9.4 my-web.2.6b3q2qbjveo4zauc6xig7au10.my-network
Address 2: 10.0.9.3 my-web.1.63s86gf6a0ms34mvboniev7bs.my-network
Address 3: 10.0.9.5 my-web.3.66u2hcrz0miqpc8h0y0f3v7aw.my-network

10.0.9.4, 10.0.9.3 and 10.0.9.5 - These are “real” ip addresses of containers with your services.

For more information read this https://docs.docker.com/engine/swarm/networking/#use-dns-round-robin-for-a-service

@chris-fung It comes back on its own, it happens occasionally for 1-20 requests in a row. The problem is these small interruptions when clients actually see Bad request error on production. I can’t catch or know when it happens. I just receive an error from papertrail nginx logs saying it was unreachable, but the rest of the time everything works. image

Thanks @fcrisciani. I’ll test and report back

@sharand sure.

  1. So do a docker network ls and identify the id of the network (1st column)
  2. do nsenter --net=/var/run/docker/netns/1-<network id> do auto complete for the last part
  3. after that you will be inside the network namespace of the overlay network, here you should run: bridge fdb show and that will show something like:
[...]
02:42:0a:00:00:0c dev veth1 master br0 
02:42:0a:00:00:12 dev vxlan0 master br0 
02:42:0a:00:00:12 dev vxlan0 dst 172.31.23.47 link-netnsid 0 self permanent
[...]
  1. now comes the checking part entries that are like: 02:42:0a:00:00:0c dev veth1 master br0 where the interface is a vethX are the local containers, so you should be able to do a docker inspect <container id> and find the same mac address The entries instead that has vxlan0 are remote containers so you should check that the dst <IP> is actually correctly matching the destination node where the container is running.

The issue that we were tracking so far had missing remote entries, or actually mac address of containers that were local, had entries pointing to other nodes so entries with vxlan0 as interface.

My suggestion is to identify one pair of containers for which you are seeing the connectivity issue and validate following the above steps if that is the failure signature. If so we should have a fix in the next 17.10 RC, if that is not the case and the fdb entries are pointing to the correct endpoints then we should gather further information and we will move forward from there.

Same problem for us on Azure or Openstack environment (Ubuntu 16.04, Docker version 17.07.0-ce, build 8784753).

I found a lot of : kernel: IPVS: __ip_vs_del_service: enter in the log when the problem occurs.

This issue started to be very critical for us.

We use AWS ECS in production, and I have been running a docker swarm on bare metal, ubuntu 16.04.3 and docker 17.06.2-ce. We have been running the swarm since docker 1.11 I believe? We moved some less important production apps into swarm about 4 months ago. We have had this issue the entire time, and the issue used to be that the services would just not be reachable after a period, and restarting docker, etc would resolve, for a while. Now, we see requests that should take 400 ms, that sometimes take 10-30s. The biggest culprit seems to be redis running on our manager nodes, with apps running on worker nodes.

We have bee using a VERY stable workaround for 3 months or so. On 2 of the manager nodes, we also run haproxy. All of our external traffic goes through this, so none of our apps are published externally, only internally. If I set one of my apps on the workers to use redis within the overlay network, I get the intermittent “hangs”. If I instead tell the worker app to use the external haproxy address that then maps through to the same overlay network, everything is solid.

I thought with this recent update to 17.06.2 that things seemed better with an initial test so I deployed a new app with direct overlay communication to redis. It definitely periodically hangs. So I switched my redis URI instead to route through haproxy (still going to the same redis service, just routed through haproxy outside the overlay) and no problems.

I stumbled onto this workaround after noticing that I had a couple of apps running outside the swarm hitting redis through haproxy, and they never burped. Here is a view of NewRelic response times over 3 hours. You can clearly see where I made the switch back to using the haproxy workaround. Now my connections are SUPER stable at 400ms or so.

2017-09-10_15-17-04

And here is just the last 60 mins.

2017-09-10_15-20-23

I would love to see this resolved, I love everything ELSE about swarm and want to replace our production AWS ECS clusters.

@odelreym The problem you are running into is because IPVS removes idle connections after 900 seconds. For long running sessions that can be idle for > 15 mins you have to tweak the tcp keepalive settings. Some more info here & here

We’re running:

  • Kernel 3.13.0 (Ubuntu 14.04)
  • Swarm managed overlay networks
  • Networks attachable but not secure
  • Docker engine 17.04.0~ce-0
  • Running on VMware VMs (not my choice - I’d prefer bare metal)

What I see is over time, I start getting timeouts to swarm services. Restarting docker on the problematic nodes fixes it. Deleting all services deployed to those nodes also fixes it. My assumption is the later works because it’s deleting and re-creating the networks.

This is kind of scary considering we only find out as services start degrading and failing.

Output of Docker version

Client:
 Version:      17.04.0-ce
 API version:  1.28
 Go version:   go1.7.5
 Git commit:   4845c56
 Built:        Mon Apr  3 18:01:08 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.04.0-ce
 API version:  1.28 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   4845c56
 Built:        Mon Apr  3 18:01:08 2017
 OS/Arch:      linux/amd64
 Experimental: true

Output of docker info

Containers: 13
 Running: 13
 Paused: 0
 Stopped: 0
Images: 27
Server Version: 17.04.0-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 252
 Dirperm1 Supported: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
Swarm: active
 NodeID: rkdhdv5htvf9699y4c80zz2wr
 Is Manager: false
 Node Address: 172.18.2.15
 Manager Addresses:
  172.17.0.25:2377
  172.17.0.36:2377
Runtimes: runc
Default Runtime: runc
Init Binary:
containerd version: 422e31ce907fd9c3833a38d7b8fdd023e5a76e73
runc version: 9c2d8d184e5da67c95d601382adf14862e4f2228
init version: 949e6fa
Security Options:
 apparmor
Kernel Version: 3.13.0-96-generic
Operating System: Ubuntu 14.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.798GiB
Name: dev-web01
ID: UWFJ:K7VF:56PG:EYED:GVPT:L73C:3O6I:MT5F:EI4R:6ES6:XWEO:6W4H
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
 stage=dev
 node=01
 role=web
Experimental: true
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

We have the same issue on old docker 1.12.6. The same kernel messages IPVS: __ip_vs_del_service: enter, and virtual and sometimes bare metal node services restarting with “suspect timeout reached” in docker log. We are also considering to leave swarm because of that 😦((

@dtmistry @Nossnevs 17.10-rc2 is available, let me know if you guys can give it a try and report your results

Well the best is that the Docker EE (production certified version) have exactly the same bug…

The other “temporary” way around is to write a monitoring script that ping all conainers from all containers. Mostly you will see 1 host affected so then you need to drain or reboot it.

I know it is ugly but script that ping is very light and can be run every minute. I believe in swarm team and Docker and I i’m sure they will figue it out. ❤️

@odelreym thanks for sharing this, I’m sure will help a lot of people seeing a similar issue. I would actually look into this to see if there is variable that can be tweaked to change it at the host level

We have upgraded our nodes to 17.06 and at first glace it have been smooth sailing and almost no issues (will comeback with more on this later). But we have removed a lot of the load on the node because we moved a lot of production code away from docker. So I’m trying to reproduce the problem in 17.06 on our swarm cluster with this python code https://github.com/Nossnevs/docker_swarm_test to add some of the stress that we had back when we noticed the bug in the first time.

The code start up two services test_a and test_b which talk to each other over http by using service name as domain name. After a specified time it moves the containers to a random specified node and starts talking again. You can also specify the number of service pairs.

I started with 2 containers which worked fine and the only transmission errors i found in Kibana was when the container moved (which is to be expected). I increased the pairs to 10 and after 2 moves one of the pairs had continuous errors and unfortunately other services also started to acting up. In the end i had to restart docker on all nodes.

I can’t repeat this on our cluster because we have some services that need to be up and running.

But if you like to reproduce the problem you may use this code on you own risk. Feel free to improve it.

I tested 17.06-rc3. I am running swarm cluster on top of ubuntu 16.04.2. As far as I tested I am not facing any issues now, services are able to communicate with each other. Hope I won’t be getting any network issues with 17.06 😃

Hi All

Our problem is like this:

A swarm mode cluster with two nodes. A service in global mode. On One node, let the container ping the container of the same service On the node. On the other node, restart the docker engine per minutes.

After a few minutes,there is a “Destination Host Unreachable” error while the two containers’ IP remain unchanged. The situation is unrecoverable until the docker engine having been restarted again.

the docker version is: Version: 17.03.1-ce

we’v found that the arp item of the continuous being restarted container
has not been synchronized to the neigh table of the overlay network space.

@muhammadwaheed @thaJeztah I debugged alot and tried some different enviroments, with my network loss issues between swarm nodes.

  1. VMware VM Ubuntu 16.04 LTS on slow storage (huge overloaded storage system), causes sometimes network loss.
  2. VMware VM Ubuntu 16.04 LTS on slow host (lots of VMs on same host, huge ram usage and 40% cpu avg), causes sometimes “no ack’s received” and other network related communications.
  3. Physical Host HPE BL460c Ubuntu 16.04 LTS slow storage (using shared storage), works fine (no network issues), but slow service re-/starts etc.
  4. Physical Host HPE BL460c Ubuntu 16.04 LTS with own storage (integrated or network attachted, same result), works fine and fast.

We tried multiple VMware hosts/cluster and hardware, also tested docker version between 17.03 and 17.05, it looks like that a virtualization enviroment causes some weired issues with docker, switched to physical enviroment and works fine without any issues.

We use 4. solution since monday and we received very positive feedback about the performance and stability of the applications running on them.

We have the same problem. We have 4 nodes that loses connection to each other and it is not a network problem. We consider it to happen randomly. After couple of minutes it gets back the connections and the swarm heals it self ending with all 4 nodes working.

Docker 17.03

@fcrisciani Oh 😄 I’m just trying to work out what net.ipv4.tcp_timestamps and net.ipv4.tcp_window_scaling have to do with seemed random networking issues I see in my set up after a few weeks of runtime 😃

I had a similar issue issue with

  • docker engine versions: 17.09.0, 17.09.1 (tried with both versions installed across the 3 nodes , one version at a time)
  • OS: centOS 7
  • kernel: 3.10.0-693.11.6.el7.x86_64
  • 3 nodes (3 of them managers)
  • sercvices configured with 2 replicas

everything started fine for a few minutes then, 1 of the 2 replicas stopped working and all of the outbound (_default network traffic as well) and ingress traffic got stucked (timed out )from the container and also from the host (failing host) itself, after a while we did a tcp dump and realize that we encountered this problem

Ran sudo sysctl -w net.ipv4.tcp_timestamps=0; sudo sysctl -w net.ipv4.tcp_window_scaling=0 this on the hosts actually solved our problems. (beware it will be temporarily in order to make it permanent I had to modify /etc/systcl.conf)

So would recommend listing your tcp sysctl settings by doing a sudo sysctl -p |grep tcp and check if those two guys are enabled.

UPDATE my final setup was like this:

  • tcp_tw_recycle: 0
  • tcp_timestamps: 0
  • tcp_window_scaling: 1
  • Docker upgraded to 17.12

Been running for more than 7 days now in 16 swarm cluster of 3+ servers each, no problem at all, all network glitches removed, hope you guys find it useful

@DBLaci I had similar issues on 17.09. Upgrade to 17.10.0-ce solve the issue we had. It was related to Overlay fix for transient IP reuse docker/libnetwork#1935

I didn’t create allot of networks manually, most of them were created using stack deploy.

You need to check ping as @fcrisciani mention to try and isolate the problem you have

@DBLaci once you hit the condition of connection refused, can you try to open a shell in one src and dst container and try to ping each other using the container ip directly not the VIP?

@fcrisciani Not an exact way of reproduce because the problem apperass somewhat randomly or after a time (possibly idle):

  • I create overlay networks manually (there are 10-20 of them!)
  • I deploy stacks that contains 1 or more services and communicate over the manually created overlay networks. Most of the services are in 2 replicas
  • After the deploy everything is working (except the problems introduced in 17.10 and 17.11-rc #35417) fine
  • After a time (maybe next day) the problem occurs as I described above. The service is running but cannot be accessed from another service (or at least from some) and got connection refused until I restart it manually. Healthcheck and everything shows the service itself is working.

Docker 17.11-rc3 (at the moment) Ubuntu 16.04 (default sysctl now) on AWS - cpu/memory is not a bottleneck the swarm nodes (5) are on one subnet with 3 manager nodes (3/5)

We have the same problem on Ubuntu 16.06 Docker version 17.09.0-ce

@rgardam sorry to bust your theory, but i’ve got this problem in our private data center where docker runs on bare matal linux (no vmware or anything like that).

@omerh we are tracking the issue here https://github.com/moby/moby/issues/35310 and the PR for the fix is here https://github.com/docker/libnetwork/pull/2004

@fcrisciani, no sorry. The issue was on a production environment. I still have the old masters running, ill try to reproduce it later on the coming week

during my upgrades i also encountered "Unable to complete atomic operation, key modified". I made the update on a drained node. Not sure if it was demoted, though. (I was running an all-nodes-as-manager setup). Using 17.10 I encountered not visible overlay-networks, until i promoted the node (so back on an all-manager setup) (this seems to be an other issue though)

@fcrisciani After upgrading to 17.10-rc2 we had issues starting the services which were attached to a user defined overlay network. Specifically the below error -

"Unable to complete atomic operation, key modified"

But I believe the way the upgrade happen was a problem.

Once 17.10 was available we upgraded each node by draining/promoting/demoting. And I’ve not seen any issues since

@fcrisciani Hmmm, it doesn’t work in my case. I was using entrypoint: dnsrr until the moment 17.10 was released, because that was helping (though, Nginx caches IP and on a service restart it was failing translating to the old IP).

Now I removed it back, and situation returned, in random time a tool (behind nginx) is not responsive anymove. In logs, nginx timeouts connecting to ldap (that is on another node). Only service restart helps.

But, in 17.10 docker service restart leaves state “Down” in “docker node ls”, so I have to reboot the machine in order to make it “Ready” 😦 That is of course another issue, a non blocking one for sure 😃

@SunDeryInc isn’t it mentioned there?

Overlay fix for transient IP reuse docker/libnetwork#1935

@dtmistry @Nossnevs the next RC of 17.10 has 2 fixes for that case, 1st IP allocation will be sequential to avoid the possibility of IP overlap, plus there is a fix in overlay driver to have a consistent configuration in case of IP overlap. Would be great is you guys can give a feedback on that as it become available.

for what it’s worth I swapped out our swarm stack for k8s weeks ago and instead of crashing every few days it’s crashed never. of course ymmv but k8s solved all of my instability issues so far.

On 9 Oct 2017 09:47, “djalal” notifications@github.com wrote:

@augusto-altman https://github.com/augusto-altman

unless you move to a managed k8s cluster with paid support, you should be aware that networking glitches are everywhere.

quick fact : 62 issues labeled “network”+“bug” in K8S (vs 75 in moby)

source : https://github.com/kubernetes/kubernetes/issues?q=is% 3Aissue+label%3Asig%2Fnetwork+label%3Akind%2Fbug+is%3Aopen

I like this quote “different companies, same team”

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/moby/moby/issues/32195#issuecomment-335097439, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGUeF9AsqW-6LKk5HXVOHwnjj6VtO3fks5sqd24gaJpZM4MsqLz .

@sanimej I have been testing this along the weekend and lowering the keepalive parameter to 100 seconds the problem is gone (I think a timeout param below 30 min is ok)

[root@Nsnode1 netns]# sysctl -a | grep keepalive
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 100

Parameters setup for Hiraki Database driver

    <beans profile="database-postgresql">
        <bean id="postgresqlHikariConfig" class="com.zaxxer.hikari.HikariConfig">
           ----------- some params here for user credentials -----
            <property name="connectionTimeout" value="30000"/>
            <property name="idleTimeout" value="30000"/>
            <property name="maxLifetime" value="1800000"/>   
            <property name="maximumPoolSize" value="10"/>
            <property name="dataSourceProperties">
                <props>
                    <prop key="prepareThreshold">1</prop>
                </props>
            </property>
        </bean>

I can always see 10 established connections to the Postgres container (that’s what it is defined in the previous config, => good)

[root@1bca83842811 /]# netstat -tanl| grep 5432| grep ESTABLISHED| wc -l
10

Each 900 seconds I can see these type of messages in catalina.out:

- Connection is not available, request timed out after 927660ms.
- Connection is not available, request timed out after 925914ms.
- Connection is not available, request timed out after 925602ms.
- Connection is not available, request timed out after 929863ms.
- Connection is not available, request timed out after 927551ms.
- Connection is not available, request timed out after 930182ms.
- Connection is not available, request timed out after 928875ms.
- Connection is not available, request timed out after 926941ms.
- Connection is not available, request timed out after 927406ms.
- Connection is not available, request timed out after 929835ms.
- Connection is not available, request timed out after 930294ms.

but now it doesn’t affect to the application (maybe is something that hiraki does for monitoring if the socket is alive or not but it seems it is not important)

I don’t know if lowering the keepalive to 100 seconds can have collateral effects to another application/container (I don’t expect it 'cause the others services that I have are pure http), but, so far, it is good for me.

Anyway, along this week I will continue testing this

Nacho.

Thanks @fcrisciani and the rest for the god job done.

I can confirm what @odelreym mentions. Any persistent/pooled connections broke after a certain amount of time for us. All our transient connections always worked fine. This behaviour was present even after an update to 17.06-rc. An alternative to lowering timeouts is to set endpoint_mode to dnsrr for the services with incoming persistent connections (if suitable for your services). After doing that, things are stable and working fine.

Hi there

I have been suffering for a while these type of problems with the swarm (I was in 17.05, Centos 7.3)

 https://github.com/moby/moby/issues/27897#issuecomment-304220784

Last Friday I updated my 3-node test cluster with 17.06 and so far, it is working fine, but the issue that I had with a Tomcat container connected to a PostgresQL container, each one deployed in different docker hosts, it was still there

The symptom was that each time I try to log into the application (30/35 min before the stack was launched into the swarm), the tomcat wanted to reuse a stablished connection to the Postgres (checked using netstat… the sockets were there), for checking the username/password, but the browser was waiting a response which never came back.

So, looking deeper into the logs (searching for values similar to 30 min/1800 seconds) I saw this

<property name="maxLifetime" value="1800000"/>

and changing the connection timeout defined in Hiraki to 900000 (15 min) the stack worked properly

I think IPVS hibernates the idle connections 20-30 min after the last time they were used and if the application is not able to back them to life or a broken pipe happens, the service seems to hang … at least in my case.

I already tested a stack ‘flow-proxy<->nginx’ which run separately… and in 17.05, 30-40 min later, from flow proxy I was able to ping the other container but I wasn’t able to do a telnet to the nginx service for instance .

Now, I can confirm it works in 17.06

I hope this information helps to anyone

@fcrisciani

  1. There is no resource limit being hit AFAIK, this is something I have been monitoring closely, kernel logs are clean in those terms at the time of these errors.
  2. Well we are not using any network manager ourselves but since it’s virtual machines at DO there is probably something running with their hypervisors that is out of scope for us to configure. Are you saying that docker swarm might not be suitable to run in virtualized environments at this time?
  3. We have infrequent periodic operations but they are not correlated in time with the issues, even during these there is no resource starvation going on

I have been testing the network connectivity between the nodes and found it to be somewhat flaky at times, probably related to network congestion over at DO datacenters, but this is something I would expect to be quite a common scenario for cloud environments. I have mentioned this specifically here and here quite a while ago.

What troubles me is that even if there is brief connectivity issues between the nodes it cannot be viable for the overlay network to become unstable for 30-60 minutes or more for multiple services. In my tests the connectivity issues can be a single ping request failure followed by immediately successful ones. As I mentioned in the linked comments it doesn’t look like swarm is handling these connectivity issues in a reliable way, mostly since the nodes are kept in active state, services kept running but with a wide range of connectivity issues that lingers in overlay network even when node connectivity is back in business pretty much right away.

What do you mean by glitches? Do you mean that brief connectivity issues are expected to cause these rather serious issues for substantial times?

For reference here is an extract of kernel parameters used on these machines to cope with a lot of http traffic and some other stuff.

vm.swappiness=0
vm.vfs_cache_pressure=200
vm.dirty_background_ratio=5
vm.dirty_ratio=10
vm.overcommit_memory=1

fs.file-max = 500000

net.ipv4.ip_local_port_range=1024 65000

net.ipv4.tcp_tw_reuse=1
net.ipv4.tcp_fin_timeout=15

net.core.somaxconn=4096
net.core.netdev_max_backlog=65536

net.core.rmem_max=67108864
net.core.wmem_max=67108864
net.core.wmem_default=31457280
net.core.rmem_default=31457280
net.core.optmem_max=25165824

net.ipv4.tcp_max_syn_backlog=20480
net.ipv4.tcp_max_tw_buckets=400000
net.ipv4.tcp_no_metrics_save=1
net.ipv4.tcp_syn_retries=2
net.ipv4.tcp_synack_retries=2
net.ipv4.tcp_rmem=8192 87380 33554432
net.ipv4.tcp_wmem=8192 65536 33554432
net.ipv4.tcp_mem=384027 512036  768054
net.ipv4.udp_mem=768054 1024072 1536108
net.ipv4.udp_rmem_min=16384
net.ipv4.udp_wmem_min=16384

net.netfilter.nf_conntrack_max=262144
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 30
net.netfilter.nf_conntrack_tcp_timeout_established=86400

net.ipv4.neigh.default.gc_thresh1=8096
net.ipv4.neigh.default.gc_thresh2=12288
net.ipv4.neigh.default.gc_thresh3=16384

net.ipv4.tcp_keepalive_time=600
net.ipv4.tcp_keepalive_intvl=30

Anyone else that have experience or heard of these issues on digital ocean specifically? I have reached out to them regarding the network congestion but it seems to be something they can’t really do much about.

I can confirm that this is still present to some extent in 17.06.

We are getting a lot of the same memberlist warning and failed suspects, multiple times per day. We are not running on AWS. Machines are relatively high spec 12 CPU 32GB memory, and they are not overloaded.

These memberlist warning are generally interleaved with node join events which looks kind of weird.

Jul  7 09:43:15 ada dockerd[1594]: time="2017-07-07T09:43:15.497723231Z" level=info msg="Node join event for alan-9ce9d334a406/10.132.80.107"
Jul  7 09:43:30 ada dockerd[1594]: time="2017-07-07T09:43:30.865030613Z" level=warning msg="memberlist: Refuting a suspect message (from: alan-9ce9d334a406)"
Jul  7 09:43:37 ada dockerd[1594]: time="2017-07-07T09:43:37.764122101Z" level=info msg="Node join event for alan-9ce9d334a406/10.132.80.107"

The whole suite of the previously mentioned memberlist weirdness is something we still see frequently.

Jul  6 06:28:07 alan dockerd[1649]: time="2017-07-06T06:28:07.260538965Z" level=warning msg="memberlist: Refuting a suspect message (from: ada-e841cd94904d)"
Jul  6 06:31:29 alan dockerd[1649]: time="2017-07-06T06:31:29.997129354Z" level=warning msg="memberlist: Failed fallback ping: write tcp 10.132.80.107:63718->10.132.76.139:7946: i/o timeout"
Jul  6 06:31:29 alan dockerd[1649]: time="2017-07-06T06:31:29.997210265Z" level=info msg="memberlist: Suspect ada-e841cd94904d has failed, no acks received"
Jul  6 06:31:30 alan dockerd[1649]: time="2017-07-06T06:31:30.000546159Z" level=warning msg="memberlist: Refuting a suspect message (from: marvin-bc32a4e47a50)"
Jul  6 09:05:34 alan dockerd[1649]: time="2017-07-06T09:05:34.316824120Z" level=warning msg="memberlist: Failed fallback ping: write tcp 10.132.80.107:16514->10.132.76.139:7946: i/o timeout"
Jul  6 09:05:34 alan dockerd[1649]: time="2017-07-06T09:05:34.316887960Z" level=info msg="memberlist: Suspect ada-e841cd94904d has failed, no acks received"
Jul  6 09:05:34 alan dockerd[1649]: time="2017-07-06T09:05:34.317450181Z" level=warning msg="memberlist: Refuting a suspect message (from: marvin-bc32a4e47a50)"
Jul  6 14:32:19 alan dockerd[1649]: time="2017-07-06T14:32:19.759783389Z" level=warning msg="memberlist: Failed fallback ping: write tcp 10.132.80.107:53740->10.132.76.139:7946: i/o timeout"
Jul  6 14:32:19 alan dockerd[1649]: time="2017-07-06T14:32:19.759849347Z" level=info msg="memberlist: Suspect ada-e841cd94904d has failed, no acks received"
Jul  6 16:13:30 alan dockerd[1649]: time="2017-07-06T16:13:30.988749298Z" level=warning msg="memberlist: Failed fallback ping: write tcp 10.132.80.107:18344->10.132.72.183:7946: i/o timeout"
Jul  6 16:13:30 alan dockerd[1649]: time="2017-07-06T16:13:30.988805307Z" level=info msg="memberlist: Suspect marvin-bc32a4e47a50 has failed, no acks received"
Jul  6 16:16:31 alan dockerd[1649]: time="2017-07-06T16:16:31.042404586Z" level=warning msg="memberlist: Failed fallback ping: write tcp 10.132.80.107:22026->10.132.72.183:7946: i/o timeout"
Jul  6 16:16:31 alan dockerd[1649]: time="2017-07-06T16:16:31.042461622Z" level=info msg="memberlist: Suspect marvin-bc32a4e47a50 has failed, no acks received"
Jul  6 16:16:31 alan dockerd[1649]: time="2017-07-06T16:16:31.043998950Z" level=warning msg="memberlist: Refuting a suspect message (from: marvin-bc32a4e47a50)"
Jul  6 19:31:30 alan dockerd[1649]: time="2017-07-06T19:31:30.734503779Z" level=warning msg="memberlist: Failed fallback ping: write tcp 10.132.80.107:56282->10.132.76.139:7946: i/o timeout"
Jul  6 19:31:30 alan dockerd[1649]: time="2017-07-06T19:31:30.734567924Z" level=info msg="memberlist: Suspect ada-e841cd94904d has failed, no acks received"
Jul  6 20:00:49 alan dockerd[1649]: time="2017-07-06T20:00:49.751892140Z" level=warning msg="memberlist: Failed fallback ping: write tcp 10.132.80.107:29316->10.132.76.139:7946: i/o timeout"
Jul  6 20:00:49 alan dockerd[1649]: time="2017-07-06T20:00:49.751935623Z" level=info msg="memberlist: Suspect ada-e841cd94904d has failed, no acks received"
Jul  6 21:08:30 alan dockerd[1649]: time="2017-07-06T21:08:30.675150927Z" level=warning msg="memberlist: Failed fallback ping: write tcp 10.132.80.107:51722->10.132.76.139:7946: i/o timeout"
Jul  7 06:55:36 alan dockerd[1649]: time="2017-07-07T06:55:36.985447101Z" level=warning msg="memberlist: Failed fallback ping: write tcp 10.132.80.107:19442->10.132.76.139:7946: i/o timeout"
Jul  7 06:55:36 alan dockerd[1649]: time="2017-07-07T06:55:36.985523603Z" level=info msg="memberlist: Suspect ada-e841cd94904d has failed, no acks received"
Jul  7 06:55:36 alan dockerd[1649]: time="2017-07-07T06:55:36.986110489Z" level=warning msg="memberlist: Refuting a suspect message (from: marvin-bc32a4e47a50)"
Jul  7 09:07:30 alan dockerd[1649]: time="2017-07-07T09:07:30.911333856Z" level=warning msg="memberlist: Failed fallback ping: write tcp 10.132.80.107:3142->10.132.76.139:7946: i/o timeout"
Jul  7 09:07:30 alan dockerd[1649]: time="2017-07-07T09:07:30.911389984Z" level=info msg="memberlist: Suspect ada-e841cd94904d has failed, no acks received"
Jul  7 09:43:30 alan dockerd[1649]: time="2017-07-07T09:43:30.838054956Z" level=warning msg="memberlist: Failed fallback ping: write tcp 10.132.80.107:50796->10.132.76.139:7946: i/o timeout"
Jul  7 09:43:30 alan dockerd[1649]: time="2017-07-07T09:43:30.838109395Z" level=info msg="memberlist: Suspect ada-e841cd94904d has failed, no acks received"
$ docker -v
Docker version 17.06.0-ce, build 02c1d87

After upgrading I actually destroyed the swarm, rebooted all machines, and recreated the whole swarm and all services and while that was coming back up I got a lot of these errors:

Jul 03 15:47:42 alan dockerd[1649]: time="2017-07-03T15:47:42.915953136Z" level=error msg="Failed to delete real server 10.0.0.79 for vip 10.0.0.78 fwmark 294 in s
Jul 03 15:47:42 alan dockerd[1649]: time="2017-07-03T15:47:42.916029723Z" level=error msg="Failed to delete service for vip 10.0.0.78 fwmark 294 in sbox 750c15a (e
Jul 03 15:47:42 alan dockerd[1649]: time="2017-07-03T15:47:42Z" level=error msg="setting up rule failed, [-t mangle -D OUTPUT -d 10.0.0.78/32 -j MARK --set-mark 29
Jul 03 15:47:42 alan dockerd[1649]: time="2017-07-03T15:47:42.974067835Z" level=error msg="Failed to delete firewall mark rule in sbox 750c15a (eacdb15): reexec fa
Jul 03 15:47:44 alan dockerd[1649]: time="2017-07-03T15:47:44.912437124Z" level=warning msg="Deleting bridge mac mac 02:42:0a:00:00:59 failed, no such file or dire
Jul 03 15:47:45 alan dockerd[1649]: time="2017-07-03T15:47:45.452549659Z" level=warning msg="Deleting bridge mac mac 02:42:0a:00:00:4d failed, no such file or dire
Jul 03 15:48:00 alan dockerd[1649]: time="2017-07-03T15:48:00.633000696Z" level=error msg="Failed to delete real server 10.0.0.90 for vip 10.0.0.88 fwmark 302 in s
Jul 03 15:48:00 alan dockerd[1649]: time="2017-07-03T15:48:00.633118481Z" level=error msg="Failed to delete service for vip 10.0.0.88 fwmark 302 in sbox 3febdc5 (d
Jul 03 15:48:00 alan dockerd[1649]: time="2017-07-03T15:48:00.655211684Z" level=warning msg="Deleting bridge mac mac 02:42:0a:00:00:4d failed, no such file or dire
Jul 03 15:48:00 alan dockerd[1649]: time="2017-07-03T15:48:00Z" level=error msg="setting up rule failed, [-t mangle -D OUTPUT -d 10.0.0.88/32 -j MARK --set-mark 30
Jul 03 15:48:00 alan dockerd[1649]: time="2017-07-03T15:48:00.671252899Z" level=error msg="Failed to delete firewall mark rule in sbox 3febdc5 (d839c65): reexec fa
Jul 03 15:48:00 alan dockerd[1649]: time="2017-07-03T15:48:00.671394576Z" level=error msg="Failed to delete real server 10.0.0.90 for vip 10.0.0.88 fwmark 302 in s
Jul 03 15:48:00 alan dockerd[1649]: time="2017-07-03T15:48:00.671432730Z" level=error msg="Failed to delete service for vip 10.0.0.88 fwmark 302 in sbox 6b9c2cb (6
Jul 03 15:48:00 alan dockerd[1649]: time="2017-07-03T15:48:00Z" level=error msg="setting up rule failed, [-t mangle -D OUTPUT -d 10.0.0.88/32 -j MARK --set-mark 30
Jul 03 15:48:00 alan dockerd[1649]: time="2017-07-03T15:48:00.711298298Z" level=error msg="Failed to delete firewall mark rule in sbox 6b9c2cb (6a46317): reexec fa

I attached some findings while debugging this stuff in a similar issue (on 17.05) but I haven’t really had any more feedback on this.

@fcrisciani additionally, you have cloudwatch metrics that shows the amount of consuming and remaining instance credits. I believe that’s the best metrics to use.

Quick heads up for who uses AWS. Check carefully the description of the instance type: https://aws.amazon.com/ec2/instance-types. If you are using T2 instances you have to be very careful because they are not giving a guarantee on the CPU resources. In two words if you use all your CPU credits you vm won’t receive physical CPU and so won’t run. This creates connectivity issues between the nodes where the distributed database mark other nodes as not reachable and cleans them up. There is several ways to identify the problem:

  1. grep for memberlist in the docker logs and if you see messages like, node failed or suspect that means that the distributed database had issues with the keepalive with other nodes
  2. vmstat, will show high steal time, uptime will show an increasing load

I can also confirm that since 17.06-rc3 cluster stability improved and intermittent outages went away. This makes swarmmode and overlaynetworks usable for us again.

@sulphur I am running 17.06-rc3 for a week now. Not faced any network issues. We have some 20 micro services deployed in swarm cluster and service to service communication is working fine as of now.

@eljrax I am deploying the same way as you mentioned “by declaring external in compose file.” Not sure it’s only working because of this

@vovimayhem sorry but time is some thing i don’t have for this any more until i can show for management that this is solved. But if i find some time i will try. We have the old nodes up and running and they experience the problem now too.

@fcrisciani @mavenugo https://github.com/docker/libnetwork/pull/1792 looks promising. But this issue have also some state where the problem is permanent and not something that disappears after 300 seconds. Is it possible that the serf like tool that docker uses have some problems so that changes is not spread between nodes as @BSWANG mentioned?

( When i say we i mean @Paxxi , @AndreasSundstrom and I. We have worked together to analyze this issue )

It’s happening all the time after few minutes.

Cluster deployed on AWS with docker-machine Swarm and Docker with experimental flag

docker version

Client:
 Version:      17.05.0-ce
 API version:  1.29
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:10:54 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.05.0-ce
 API version:  1.29 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:10:54 2017
 OS/Arch:      linux/amd64
 Experimental: true

docker info

Containers: 13
 Running: 9
 Paused: 0
 Stopped: 4
Images: 1028
Server Version: 17.05.0-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 434
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
Swarm: active
 NodeID: 0baa6iv3z4awyjg1rmcgqy2be
 Is Manager: true
 ClusterID: 5rcgshsz3ihxzv77prckizrji
 Managers: 6
 Nodes: 6
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: 11.11.7.31
 Manager Addresses:
  11.11.7.149:2377
  11.11.7.163:2377
  11.11.7.31:2377
  11.11.8.113:2377
  11.11.8.35:2377
  11.11.8.43:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9048e5e50717ea4497b757314bad98ea3763c145
runc version: 9c2d8d184e5da67c95d601382adf14862e4f2228
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-57-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.795GiB
Name: staging-blue-node-1
ID: BDR5:XJ27:PBUZ:63IY:YQ56:VHTM:TNRU:EA3H:24SW:VKKB:EZXG:WUL2
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
 provider=amazonec2
Experimental: true
Insecure Registries:
 registry.****.co:5000
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

I’ve been seeing the same issue. Some (possibly all) overlay IPs stop responding, DNS still resolves the IP but connections to a port on the IP hang indefinitely. Restarting just the docker daemon sometimes solves the issue, but today we needed to do a full reboot to recover. Services are running inside of swarm mode, networks created are “attachable” and sometimes the target IP is a standalone container running outside of swarm mode. If it’s helpful, I also have a daemon-data and goroutine-stacks that was generated during this issue.

Docker version is 17.03.1-ce (similar issue was seen with 1.13.1) Host OS is RHEL 7.3 with kernel 3.10.

Looking through my logs after restarting just dockerd, on host2 I see:

Apr 10 09:01:06 ***host2 dockerd: time="2017-04-10T09:01:06.257178027-06:00" level=debug msg="Neighbor entry already present for IP 10.1.10.67, mac 02:42:0a:00:01:03"
Apr 10 09:01:06 ***host2 dockerd: time="2017-04-10T09:01:06.264903885-06:00" level=debug msg="Neighbor entry already present for IP 10.0.1.3, mac 02:42:0a:00:01:03"
Apr 10 09:01:06 ***host2 dockerd: time="2017-04-10T09:01:06.264921849-06:00" level=debug msg="Neighbor entry already present for IP 10.1.10.67, mac 02:42:0a:00:01:03"
Apr 10 09:01:07 ***host2 dockerd: time="2017-04-10T09:01:07.562636938-06:00" level=debug msg="***host2.***.***-e2a59500b1df: Initiating bulk sync with nodes [***host1.***.***-0edb2b0f8955]"
Apr 10 09:01:07 ***host2 dockerd: time="2017-04-10T09:01:07.562678204-06:00" level=debug msg="***host2.***.***-e2a59500b1df: Initiating unsolicited bulk sync for networks [qd3qzz21s7jm398aq4paq5ke2 qww7tii19z86fug7vg8nsa4a2 dgruwla6gowe8jk8yzvcwcacz vdp1cqfnz5lnz071rt6p2kvgl sn93tbqpqsygur817q15mxd6b ydebbtvtjd3l5va7qvwjvi7ti] with node ***host1.***.***-0edb2b0f8955"
Apr 10 09:01:07 ***host2 dockerd: time="2017-04-10T09:01:07.670120679-06:00" level=debug msg="memberlist: TCP connection from=10.1.10.67:41288"
Apr 10 09:01:07 ***host2 dockerd: time="2017-04-10T09:01:07.677351582-06:00" level=debug msg="checkEncryption(dgruwla, 10.1.10.67, 4100, false)"
Apr 10 09:01:07 ***host2 dockerd: time="2017-04-10T09:01:07.677429689-06:00" level=debug msg="List of nodes: map[10.1.10.67:10.1.10.67]"
Apr 10 09:01:07 ***host2 dockerd: time="2017-04-10T09:01:07.677447405-06:00" level=debug msg="Programming encryption for vxlan 4100 between <nil> and 10.1.10.67"
Apr 10 09:01:07 ***host2 dockerd: time="2017-04-10T09:01:07.677490120-06:00" level=debug msg="/usr/sbin/iptables, [--wait -t mangle -C OUTPUT -p udp --dport 4789 -m u32 --u32 0>>22&0x3C@12&0xFFFFFF00=1049600 -j MARK --set-mark 13681891]"

On host1, I’m seeing:

Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.143150823-06:00" level=debug msg="Creating service for vip 10.0.2.6 fwMark 729 ingressPorts []*libnetwork.PortConfig(nil) in sbox 0953d75 (d850b7a)"
Apr 10 09:01:07 ***host1 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth6242: link becomes ready
Apr 10 09:01:07 ***host1 kernel: br0: port 2(veth6242) entered forwarding state
Apr 10 09:01:07 ***host1 kernel: br0: port 2(veth6242) entered forwarding state
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.151639028-06:00" level=debug msg="Miss notification, l2 mac 02:42:0a:00:02:07"
Apr 10 09:01:07 ***host1 NetworkManager[1160]: <info>  [1491836467.1609] device (vethd28a2ce): driver 'veth' does not support carrier detection.
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07-06:00" level=info msg="Firewalld running: false"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.191785808-06:00" level=debug msg="Creating service for vip 10.0.2.4 fwMark 727 ingressPorts []*libnetwork.PortConfig(nil) in sbox 0953d75 (d850b7a)"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07-06:00" level=info msg="Firewalld running: false"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.235685943-06:00" level=debug msg="Creating service for vip 10.0.2.2 fwMark 728 ingressPorts []*libnetwork.PortConfig(nil) in sbox 0953d75 (d850b7a)"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07-06:00" level=info msg="Firewalld running: false"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.280173727-06:00" level=debug msg="Creating service for vip 10.0.2.8 fwMark 725 ingressPorts []*libnetwork.PortConfig(nil) in sbox 0953d75 (d850b7a)"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07-06:00" level=info msg="Firewalld running: false"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.325045640-06:00" level=debug msg="Creating service for vip 10.0.2.5 fwMark 726 ingressPorts []*libnetwork.PortConfig(nil) in sbox 0953d75 (d850b7a)"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07-06:00" level=info msg="Firewalld running: false"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.391084213-06:00" level=debug msg="Creating service for vip 10.0.1.11 fwMark 307 ingressPorts []*libnetwork.PortConfig(nil) in sbox 0953d75 (d850b7a)"
Apr 10 09:01:07 ***host1 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth24804: link becomes ready
Apr 10 09:01:07 ***host1 kernel: br0: port 16(veth24804) entered forwarding state
Apr 10 09:01:07 ***host1 kernel: br0: port 16(veth24804) entered forwarding state
Apr 10 09:01:07 ***host1 NetworkManager[1160]: <info>  [1491836467.3996] device (vethd1009fd): driver 'veth' does not support carrier detection.
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.399638183-06:00" level=debug msg="Miss notification, l2 mac 02:42:0a:00:01:03"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07-06:00" level=info msg="Firewalld running: false"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.444175587-06:00" level=debug msg="Creating service for vip 10.0.1.36 fwMark 308 ingressPorts []*libnetwork.PortConfig(nil) in sbox 0953d75 (d850b7a)"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07-06:00" level=info msg="Firewalld running: false"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.498370688-06:00" level=debug msg="Creating service for vip 10.0.1.14 fwMark 310 ingressPorts []*libnetwork.PortConfig(nil) in sbox 0953d75 (d850b7a)"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07-06:00" level=info msg="Firewalld running: false"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.543398402-06:00" level=debug msg="Creating service for vip 10.0.1.32 fwMark 337 ingressPorts []*libnetwork.PortConfig(nil) in sbox 0953d75 (d850b7a)"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07-06:00" level=info msg="Firewalld running: false"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.587459954-06:00" level=debug msg="memberlist: TCP connection from=10.1.10.68:54094"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.593656923-06:00" level=debug msg="Creating service for vip 10.0.1.23 fwMark 306 ingressPorts []*libnetwork.PortConfig(nil) in sbox 0953d75 (d850b7a)"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07-06:00" level=info msg="Firewalld running: false"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.641793602-06:00" level=debug msg="Creating service for vip 10.0.1.20 fwMark 309 ingressPorts []*libnetwork.PortConfig(nil) in sbox 0953d75 (d850b7a)"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.663756416-06:00" level=debug msg="***host1.***.***-0edb2b0f8955: Initiating  bulk sync for networks [qd3qzz21s7jm398aq4paq5ke2 qww7tii19z86fug7vg8nsa4a2 dgruwla6gowe8jk8yzvcwcacz vdp1cqfnz5lnz071rt6p2kvgl sn93tbqpqsygur817q15mxd6b ydebbtvtjd3l5va7qvwjvi7ti] with node ***host2.***.***-e2a59500b1df"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07-06:00" level=info msg="Firewalld running: false"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.693889010-06:00" level=debug msg="Creating service for vip 10.0.1.5 fwMark 311 ingressPorts []*libnetwork.PortConfig(nil) in sbox 0953d75 (d850b7a)"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07-06:00" level=info msg="Firewalld running: false"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.739212582-06:00" level=debug msg="Creating service for vip 10.0.1.31 fwMark 312 ingressPorts []*libnetwork.PortConfig(nil) in sbox 0953d75 (d850b7a)"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07-06:00" level=info msg="Firewalld running: false"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.783444768-06:00" level=debug msg="Creating service for vip 10.0.1.27 fwMark 305 ingressPorts []*libnetwork.PortConfig(nil) in sbox 0953d75 (d850b7a)"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07-06:00" level=info msg="Firewalld running: false"
Apr 10 09:01:07 ***host1 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth4d8da54: link becomes ready
Apr 10 09:01:07 ***host1 kernel: docker_gwbridge: port 13(veth4d8da54) entered forwarding state
Apr 10 09:01:07 ***host1 kernel: docker_gwbridge: port 13(veth4d8da54) entered forwarding state
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.851440265-06:00" level=debug msg="sandbox set key processing took 1.920494571s for container d850b7ae4eb86f5d85ca3fab123cfd68c4e77dcf0cf7ce7d38ffd062937f5f2b"
Apr 10 09:01:07 ***host1 NetworkManager[1160]: <info>  [1491836467.8920] device (vethe9991e7): driver 'veth' does not support carrier detection.
Apr 10 09:01:07 ***host1 NetworkManager[1160]: <info>  [1491836467.8958] device (veth4d8da54): link connected
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.895955172-06:00" level=debug msg="libcontainerd: received containerd event: &types.Event{Type:\"start-container\", Id:\"d850b7ae4eb86f5d85ca3fab123cfd68c4e77dcf0cf7ce7d38ffd062937f5f2b\", Status:0x0, Pid:\"\", Timestamp:(*timestamp.Timestamp)(0xc42b6e3d50)}"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.896865954-06:00" level=debug msg=OpenMonitorChannel
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.897184732-06:00" level=debug msg="waiting on events" module="node/agent/taskmanager" task.id=syg4ozkkvbli3gu4arbew4xnd
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.897219511-06:00" level=debug msg="libcontainerd: event unhandled: type:\"start-container\" id:\"d850b7ae4eb86f5d85ca3fab123cfd68c4e77dcf0cf7ce7d38ffd062937f5f2b\" timestamp:<seconds:1491836467 nanos:895640198 > "

I also updated the swarm to 17.03.1-ce, still met the same problem just now. After restarting the docker engine on the problem host, all goes normal, but will happen again.