moby: Swarm is having occasional network connection problems between nodes.
Few times a day I am having connection issues between nodes and clients are seeing occasional “Bad request” error. My swarm setup (aws) has following services: nginx
(global) and web
(replicated=2) and separate overlay network. In nginx.conf
I am using proxy_pass http://web:5000
to route requests to web service. Both services are running and marked as healthy and haven’t been restarted while having these errors. Manager is separate node (30sec-manager1
).
Few times a day for few requests I am receiving an errors that nginx couldn’t connect upstream and I always see 10.0.0.6
IP address mentioned:
Here are related nginx and docker logs. Both web services are replicated on 30sec-worker3
and 30sec-worker4
nodes.
Nginx log:
----------
2017/03/29 07:13:18 [error] 7#7: *44944 connect() failed (113: Host is unreachable) while connecting to upstream, client: 104.154.58.95, server: 30seconds.com, request: "GET / HTTP/1.1", upstream: "http://10.0.0.6:5000/", host: "30seconds.com"
Around same time from docker logs (journalctl -u docker.service)
on node 30sec-manager1:
---------------------------
Mar 29 07:12:50 30sec-manager1 docker[30365]: time="2017-03-29T07:12:50.736935344Z" level=warning msg="memberlist: Refuting a suspect message (from: 30sec-worker3-054c94d39b58)"
Mar 29 07:12:54 30sec-manager1 docker[30365]: time="2017-03-29T07:12:54.659229055Z" level=info msg="memberlist: Marking 30sec-worker3-054c94d39b58 as failed, suspect timeout reached"
Mar 29 07:12:54 30sec-manager1 docker[30365]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-manager1 docker[30365]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-manager1 docker[30365]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-manager1 docker[30365]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-manager1 docker[30365]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-manager1 docker[30365]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:13:10 30sec-manager1 docker[30365]: time="2017-03-29T07:13:10.302960985Z" level=info msg="memberlist: Suspect 30sec-worker3-054c94d39b58 has failed, no acks received"
Mar 29 07:13:11 30sec-manager1 docker[30365]: time="2017-03-29T07:13:11.055187819Z" level=warning msg="memberlist: Refuting a suspect message (from: 30sec-worker3-054c94d39b58)"
Mar 29 07:13:14 30sec-manager1 docker[30365]: time="2017-03-29T07:13:14Z" level=info msg="Firewalld running: false"
Mar 29 07:13:14 30sec-manager1 docker[30365]: time="2017-03-29T07:13:14Z" level=info msg="Firewalld running: false"
Mar 29 07:13:14 30sec-manager1 docker[30365]: time="2017-03-29T07:13:14Z" level=info msg="Firewalld running: false"
Mar 29 07:13:14 30sec-manager1 docker[30365]: time="2017-03-29T07:13:14Z" level=info msg="Firewalld running: false"
Mar 29 07:13:14 30sec-manager1 docker[30365]: time="2017-03-29T07:13:14Z" level=info msg="Firewalld running: false"
Mar 29 07:13:14 30sec-manager1 docker[30365]: time="2017-03-29T07:13:14Z" level=info msg="Firewalld running: false"
Mar 29 07:13:14 30sec-manager1 docker[30365]: time="2017-03-29T07:13:14Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:17 30sec-manager1 docker[30365]: time="2017-03-29T07:13:17Z" level=info msg="Firewalld running: false"
on node 30sec-worker3:
-------------------------
Mar 29 07:12:50 30sec-worker3 docker[30362]: time="2017-03-29T07:12:50.613402284Z" level=info msg="memberlist: Suspect 30sec-manager1-b1cbc10665cc has failed, no acks received"
Mar 29 07:12:55 30sec-worker3 docker[30362]: time="2017-03-29T07:12:55.614174704Z" level=warning msg="memberlist: Refuting a dead message (from: 30sec-worker4-4ca6b1dcaa42)"
Mar 29 07:13:09 30sec-worker3 docker[30362]: time="2017-03-29T07:13:09.613368306Z" level=info msg="memberlist: Suspect 30sec-worker4-4ca6b1dcaa42 has failed, no acks received"
Mar 29 07:13:10 30sec-worker3 docker[30362]: time="2017-03-29T07:13:10.613972658Z" level=info msg="memberlist: Suspect 30sec-manager1-b1cbc10665cc has failed, no acks received"
Mar 29 07:13:11 30sec-worker3 docker[30362]: time="2017-03-29T07:13:11.042788976Z" level=warning msg="memberlist: Refuting a suspect message (from: 30sec-worker4-4ca6b1dcaa42)"
Mar 29 07:13:14 30sec-worker3 docker[30362]: time="2017-03-29T07:13:14.613951134Z" level=info msg="memberlist: Marking 30sec-worker4-4ca6b1dcaa42 as failed, suspect timeout reached"
Mar 29 07:13:25 30sec-worker3 docker[30362]: time="2017-03-29T07:13:25.615128313Z" level=error msg="Bulk sync to node 30sec-manager1-b1cbc10665cc timed out"
on node 30sec-worker4:
-------------------------
Mar 29 07:12:49 30sec-worker4 docker[30376]: time="2017-03-29T07:12:49.658082975Z" level=info msg="memberlist: Suspect 30sec-worker3-054c94d39b58 has failed, no acks received"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54.658737367Z" level=info msg="memberlist: Marking 30sec-worker3-054c94d39b58 as failed, suspect timeout reached"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:13:09 30sec-worker4 docker[30376]: time="2017-03-29T07:13:09.658056735Z" level=info msg="memberlist: Suspect 30sec-worker3-054c94d39b58 has failed, no acks received"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16.303689665Z" level=warning msg="memberlist: Refuting a suspect message (from: 30sec-worker4-4ca6b1dcaa42)"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
syslog on 30sec-worker4:
--------------------------
Mar 29 07:12:49 30sec-worker4 docker[30376]: time="2017-03-29T07:12:49.658082975Z" level=info msg="memberlist: Suspect 30sec-worker3-054c94d39b58 has failed, no acks received"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54.658737367Z" level=info msg="memberlist: Marking 30sec-worker3-054c94d39b58 as failed, suspect timeout reached"
Mar 29 07:12:54 30sec-worker4 kernel: [645679.048975] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 kernel: [645679.100691] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 kernel: [645679.130069] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 kernel: [645679.155859] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 kernel: [645679.180461] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 kernel: [645679.205707] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 kernel: [645679.230326] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 kernel: [645679.255597] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 docker[30376]: message repeated 7 times: [ time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"]
Mar 29 07:13:09 30sec-worker4 docker[30376]: time="2017-03-29T07:13:09.658056735Z" level=info msg="memberlist: Suspect 30sec-worker3-054c94d39b58 has failed, no acks received"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16.303689665Z" level=warning msg="memberlist: Refuting a suspect message (from: 30sec-worker4-4ca6b1dcaa42)"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-worker4 docker[30376]: message repeated 7 times: [ time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"]
I checked other cases when nginx can’t find find upstream and always I find these 3x lines appear most at these times in docker logs in:
level=info msg="memberlist:Suspect 30sec-worker3-054c94d39b58 has failed, no acks received"
level=warning msg="memberlist: Refuting a suspect message (from: 30sec-worker3-054c94d39b58)"
level=warning msg="memberlist: Refuting a dead message (from: 30sec-worker3-054c94d39b58)
By searching other issues, found that these have similar errors, so it may be related: https://github.com/docker/docker/issues/28843 https://github.com/docker/docker/issues/25325
Anything I should check or debug more to spot the problem or is it a bug? Thank you.
Output of docker version
:
Client:
Version: 17.03.0-ce
API version: 1.26
Go version: go1.7.5
Git commit: 60ccb22
Built: Thu Feb 23 11:02:43 2017
OS/Arch: linux/amd64
Server:
Version: 17.03.0-ce
API version: 1.26 (minimum version 1.12)
Go version: go1.7.5
Git commit: 60ccb22
Built: Thu Feb 23 11:02:43 2017
OS/Arch: linux/amd64
Experimental: false
Output of docker info
:
Containers: 18
Running: 3
Paused: 0
Stopped: 15
Images: 16
Server Version: 17.03.0-ce
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 83
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Swarm: active
NodeID: ck99cyhgydt8y1zn8ik2xmcdv
Is Manager: true
ClusterID: in0q54eh74ljazrprt0vza3wj
Managers: 1
Nodes: 5
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 3
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Node Address: 172.31.31.146
Manager Addresses:
172.31.31.146:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 977c511eda0925a723debdc94d09459af49d082a
runc version: a01dafd48bc1c7cc12bdb01206f9fea7dd6feb70
init version: 949e6fa
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.4.0-57-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 990.6 MiB
Name: 30sec-manager1
ID: 5IIF:RONB:Y27Q:5MKX:ENEE:HZWM:XYBV:O6KN:BKL6:AEUK:2VKB:MO5P
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Labels:
provider=amazonec2
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Additional environment details (AWS, VirtualBox, physical, etc.): Amazon AWS (Manager - t2.micro, rest of nodes - t2.small)
docker-compose.yml (There are more services and nodes in setup, but I posted only involved ones)
version: "3"
services:
nginx:
image: 333435094895.dkr.ecr.us-east-1.amazonaws.com/swarm/nginx:latest
ports:
- 80:80
- 81:81
networks:
- thirtysec
depends_on:
- web
deploy:
mode: global
update_config:
delay: 2s
monitor: 2s
web:
image: 333435094895.dkr.ecr.us-east-1.amazonaws.com/swarm/os:latest
command: sh -c "python manage.py collectstatic --noinput && daphne thirtysec.asgi:channel_layer -b 0.0.0.0 -p 5000"
ports:
- 5000:5000
networks:
- thirtysec
deploy:
mode: replicated
replicas: 2
labels: [APP=THIRTYSEC]
update_config:
delay: 15s
monitor: 15s
placement:
constraints: [node.labels.aws_type == t2.small]
healthcheck:
test: goss -g deploy/swarm/checks/web-goss.yaml validate
interval: 2s
timeout: 3s
retries: 15
networks:
thirtysec:
web-goss.yaml
port:
tcp:5000:
listening: true
ip:
- 0.0.0.0
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 36
- Comments: 250 (68 by maintainers)
We have the this problem with swarm and it came also to the point that we had to remove swarm in production. As i said before we have 4 nodes where 3 is managers. Containers are running evenly over all four. We have about 15 containers for each node and the services needs to connect to each other regardless of where they happen to be running. Usually webapp and redis but we have also loadbalancers on manager nodes.
We have used tcpdump in different networknamespaces to follow pings and HTTP requests. We then saw that the ping was received at the destination containers networkns but the container could not send it back. The thing we noticed was that it tried to send out arp requests for the source containers ip but did not get any response. After some more digging we found that in the networkns for the docker overlay network on the node with the destination container the fqdb is different from the one that working.
and example of one that is working:
02:42:ac:14:00:3f dst 10.135.24.176 link-netnsid 0 self permanent
The one that is working points fqdb to the right nodes ip. We think this shows that the docker network on the node with the destination container thinks that the source container is on the same node. There for it does not get an answer for the arp request or any other traffic. Some how the creation/deletion of vlans and mac creates this problem. This happens during both high load and not ( but more often during high load ).
This bug seams to effect a lot of people and there is also other issues that can maybe be linked to this bug. https://github.com/moby/moby/issues/32841
We think this issue should be seen as a showstopper and be prioritized accordingly. We want to go back to “Docker Swarm is awsome” feeling and use it in production again.
We’re in production and facing this too. Had to pull all the critical services out of swarm, and was thinking about ways to migrate to something else, because it’s really hurting us.
Please P1 this!
Run “sudo service docker restart” on the host with service container that can’t be ping from the good ones, problem solved. Maybe good for a while, until creating new or updating services.
@kleptog the issue you describe should be taken care of by https://github.com/docker/libnetwork/pull/1935
@adityacs how are your tests ?
Just adding my two cents here since I’ve been monitoring this thread for a few weeks now.
We started to see these issues few weeks ago on our production sites and weren’t able to fix them so we had to rollback our services to a previous environment. Once we discovered this thread, we upgraded our test environments to RC3 when it came out and things seemed to work fine. In fact, we have a replay machine that replays everything that happens in our production environment and we left it running continuously in our test environments.
Things were fine for almost a week but then we started to see the connectivity issues again. And whenever there were problems, I’d SSH into the nodes and check syslogs and I would find the
msg="Neighbor entry already present for IP 10.1.10.67, mac 02:42:0a:00:01:03"
messages. If I don’t do anything things will start to work eventually again, but it can take up to 40 minutes according to my experience for things to start working again.To me, it seems that RC3 helped but didn’t take away the problem unfortunately. We are desperately looking for a solution here since we are growing and we don’t want to keep using our old infrastructure for running our operations.
FWIW We are running our services on Digital Ocean.
@DBLaci @svscorp the possible root cause for
Unable to complete atomic operation, key modified
has been identified and fix is being tested. You can expect a patch in 17.11@sulphur There are many fixes going into 17.06 for these network issues.
@mavenugo @thaJeztah I’m monitoring this issue closely. I have to make a recommendation to higher management whether to adopt Swarm Mode or not. Currently, just using standalone Swarm “Classic”. But looks like we might need to wait and see. For our enterprise customers; dropping network connections sporadically is not an option.
@vieux Maybe we should keep this open until we get confirmation? It’s a pretty serious issue.
Hoi, we’ve also been having similar issues with some services not being able to see certain other services. After much debugging I finally tracked it down the the “bridge fdb” being wrong. I have written a script which can be used on a Docker swarm 17.03+ to verify if all the FDBs on that host are actually correct w.r.t. the Swarm state. https://gist.github.com/kleptog/9a5aa56e8d2532032b6a7b32bf7cc3aa
It converts the mac addresses in the tables back to services and verifies that the traffic is being sent to the correct host. As it turns out our swarm has various inconsistencies. For example (irrelevant info snipped) (Docker 17.06.1):
Here we see the fdb retains references to MAC addresses/services which no longer exist in the swarm. And one MAC address is double, but the address it is being sent to (172.16.102.101) is the IP of the node itself. It doesn’t seem to harm anything, but it looks wrong.
More problematic is when the entries are actually wrong, like in this example (now Docker 17.03.2):
Here we see that two MAC addresses are duplicated, but now the second entry refers to an IP address of another host.
We’re currently working to make this script so we can track down the exact moment when things go wrong. But I hope this script is useful to others trying to debug their own swarms.
Hi guys.
I had the same problem. The reason was in virtual IP address (VIP) and It is enabled by default. I turned off the VIP everywhere and the problems with the network connection is resolved.
To disable to use the virtual IP address, you must start the services with the following option
--endpoint-mode dnsrr
. For example:If you can’t use this option to start your services (for example, you start them with
docker-stack.yml
), then you can access to your services by name:tasks.my-service
(notmy-service
). In this case you will not be used virtual ip address.From the documentation:
10.0.9.2
- it is virtual ip address.and:
10.0.9.4
,10.0.9.3
and10.0.9.5
- These are “real” ip addresses of containers with your services.For more information read this https://docs.docker.com/engine/swarm/networking/#use-dns-round-robin-for-a-service
@chris-fung It comes back on its own, it happens occasionally for 1-20 requests in a row. The problem is these small interruptions when clients actually see Bad request error on production. I can’t catch or know when it happens. I just receive an error from papertrail nginx logs saying it was unreachable, but the rest of the time everything works.
Thanks @fcrisciani. I’ll test and report back
@sharand sure.
docker network ls
and identify the id of the network (1st column)nsenter --net=/var/run/docker/netns/1-<network id>
do auto complete for the last partbridge fdb show
and that will show something like:02:42:0a:00:00:0c dev veth1 master br0
where the interface is a vethX are the local containers, so you should be able to do a docker inspect <container id> and find the same mac address The entries instead that has vxlan0 are remote containers so you should check that thedst <IP>
is actually correctly matching the destination node where the container is running.The issue that we were tracking so far had missing remote entries, or actually mac address of containers that were local, had entries pointing to other nodes so entries with vxlan0 as interface.
My suggestion is to identify one pair of containers for which you are seeing the connectivity issue and validate following the above steps if that is the failure signature. If so we should have a fix in the next 17.10 RC, if that is not the case and the fdb entries are pointing to the correct endpoints then we should gather further information and we will move forward from there.
Same problem for us on Azure or Openstack environment (Ubuntu 16.04, Docker version 17.07.0-ce, build 8784753).
I found a lot of :
kernel: IPVS: __ip_vs_del_service: enter
in the log when the problem occurs.This issue started to be very critical for us.
We use AWS ECS in production, and I have been running a docker swarm on bare metal, ubuntu 16.04.3 and docker 17.06.2-ce. We have been running the swarm since docker 1.11 I believe? We moved some less important production apps into swarm about 4 months ago. We have had this issue the entire time, and the issue used to be that the services would just not be reachable after a period, and restarting docker, etc would resolve, for a while. Now, we see requests that should take 400 ms, that sometimes take 10-30s. The biggest culprit seems to be redis running on our manager nodes, with apps running on worker nodes.
We have bee using a VERY stable workaround for 3 months or so. On 2 of the manager nodes, we also run haproxy. All of our external traffic goes through this, so none of our apps are published externally, only internally. If I set one of my apps on the workers to use redis within the overlay network, I get the intermittent “hangs”. If I instead tell the worker app to use the external haproxy address that then maps through to the same overlay network, everything is solid.
I thought with this recent update to 17.06.2 that things seemed better with an initial test so I deployed a new app with direct overlay communication to redis. It definitely periodically hangs. So I switched my redis URI instead to route through haproxy (still going to the same redis service, just routed through haproxy outside the overlay) and no problems.
I stumbled onto this workaround after noticing that I had a couple of apps running outside the swarm hitting redis through haproxy, and they never burped. Here is a view of NewRelic response times over 3 hours. You can clearly see where I made the switch back to using the haproxy workaround. Now my connections are SUPER stable at 400ms or so.
And here is just the last 60 mins.
I would love to see this resolved, I love everything ELSE about swarm and want to replace our production AWS ECS clusters.
@odelreym The problem you are running into is because IPVS removes idle connections after 900 seconds. For long running sessions that can be idle for > 15 mins you have to tweak the tcp keepalive settings. Some more info here & here
We’re running:
What I see is over time, I start getting timeouts to swarm services. Restarting docker on the problematic nodes fixes it. Deleting all services deployed to those nodes also fixes it. My assumption is the later works because it’s deleting and re-creating the networks.
This is kind of scary considering we only find out as services start degrading and failing.
Output of Docker version
Output of docker info
We have the same issue on old docker 1.12.6. The same kernel messages IPVS: __ip_vs_del_service: enter, and virtual and sometimes bare metal node services restarting with “suspect timeout reached” in docker log. We are also considering to leave swarm because of that 😦((
@dtmistry @Nossnevs 17.10-rc2 is available, let me know if you guys can give it a try and report your results
Well the best is that the Docker EE (production certified version) have exactly the same bug…
The other “temporary” way around is to write a monitoring script that ping all conainers from all containers. Mostly you will see 1 host affected so then you need to drain or reboot it.
I know it is ugly but script that ping is very light and can be run every minute. I believe in swarm team and Docker and I i’m sure they will figue it out. ❤️
@odelreym thanks for sharing this, I’m sure will help a lot of people seeing a similar issue. I would actually look into this to see if there is variable that can be tweaked to change it at the host level
We have upgraded our nodes to 17.06 and at first glace it have been smooth sailing and almost no issues (will comeback with more on this later). But we have removed a lot of the load on the node because we moved a lot of production code away from docker. So I’m trying to reproduce the problem in 17.06 on our swarm cluster with this python code https://github.com/Nossnevs/docker_swarm_test to add some of the stress that we had back when we noticed the bug in the first time.
The code start up two services test_a and test_b which talk to each other over http by using service name as domain name. After a specified time it moves the containers to a random specified node and starts talking again. You can also specify the number of service pairs.
I started with 2 containers which worked fine and the only transmission errors i found in Kibana was when the container moved (which is to be expected). I increased the pairs to 10 and after 2 moves one of the pairs had continuous errors and unfortunately other services also started to acting up. In the end i had to restart docker on all nodes.
I can’t repeat this on our cluster because we have some services that need to be up and running.
But if you like to reproduce the problem you may use this code on you own risk. Feel free to improve it.
I tested 17.06-rc3. I am running swarm cluster on top of ubuntu 16.04.2. As far as I tested I am not facing any issues now, services are able to communicate with each other. Hope I won’t be getting any network issues with 17.06 😃
Hi All
Our problem is like this:
A swarm mode cluster with two nodes. A service in global mode. On One node, let the container ping the container of the same service On the node. On the other node, restart the docker engine per minutes.
After a few minutes,there is a “Destination Host Unreachable” error while the two containers’ IP remain unchanged. The situation is unrecoverable until the docker engine having been restarted again.
the docker version is: Version: 17.03.1-ce
we’v found that the arp item of the continuous being restarted container
has not been synchronized to the neigh table of the overlay network space.
@muhammadwaheed @thaJeztah I debugged alot and tried some different enviroments, with my network loss issues between swarm nodes.
We tried multiple VMware hosts/cluster and hardware, also tested docker version between 17.03 and 17.05, it looks like that a virtualization enviroment causes some weired issues with docker, switched to physical enviroment and works fine without any issues.
We use 4. solution since monday and we received very positive feedback about the performance and stability of the applications running on them.
We have the same problem. We have 4 nodes that loses connection to each other and it is not a network problem. We consider it to happen randomly. After couple of minutes it gets back the connections and the swarm heals it self ending with all 4 nodes working.
Docker 17.03
@fcrisciani Oh 😄 I’m just trying to work out what
net.ipv4.tcp_timestamps
andnet.ipv4.tcp_window_scaling
have to do with seemed random networking issues I see in my set up after a few weeks of runtime 😃I had a similar issue issue with
everything started fine for a few minutes then, 1 of the 2 replicas stopped working and all of the outbound (_default network traffic as well) and ingress traffic got stucked (timed out )from the container and also from the host (failing host) itself, after a while we did a tcp dump and realize that we encountered this problem
Ran
sudo sysctl -w net.ipv4.tcp_timestamps=0; sudo sysctl -w net.ipv4.tcp_window_scaling=0
this on the hosts actually solved our problems. (beware it will be temporarily in order to make it permanent I had to modify /etc/systcl.conf)So would recommend listing your tcp sysctl settings by doing a
sudo sysctl -p |grep tcp
and check if those two guys are enabled.UPDATE my final setup was like this:
Been running for more than 7 days now in 16 swarm cluster of 3+ servers each, no problem at all, all network glitches removed, hope you guys find it useful
@DBLaci I had similar issues on 17.09. Upgrade to 17.10.0-ce solve the issue we had. It was related to Overlay fix for transient IP reuse docker/libnetwork#1935
I didn’t create allot of networks manually, most of them were created using stack deploy.
You need to check ping as @fcrisciani mention to try and isolate the problem you have
@DBLaci once you hit the condition of connection refused, can you try to open a shell in one src and dst container and try to ping each other using the container ip directly not the VIP?
@fcrisciani Not an exact way of reproduce because the problem apperass somewhat randomly or after a time (possibly idle):
Docker 17.11-rc3 (at the moment) Ubuntu 16.04 (default sysctl now) on AWS - cpu/memory is not a bottleneck the swarm nodes (5) are on one subnet with 3 manager nodes (3/5)
We have the same problem on Ubuntu 16.06 Docker version 17.09.0-ce
@rgardam sorry to bust your theory, but i’ve got this problem in our private data center where docker runs on bare matal linux (no vmware or anything like that).
@omerh we are tracking the issue here https://github.com/moby/moby/issues/35310 and the PR for the fix is here https://github.com/docker/libnetwork/pull/2004
@fcrisciani, no sorry. The issue was on a production environment. I still have the old masters running, ill try to reproduce it later on the coming week
during my upgrades i also encountered
"Unable to complete atomic operation, key modified"
. I made the update on a drained node. Not sure if it was demoted, though. (I was running an all-nodes-as-manager setup). Using 17.10 I encountered not visible overlay-networks, until i promoted the node (so back on an all-manager setup) (this seems to be an other issue though)@fcrisciani After upgrading to 17.10-rc2 we had issues starting the services which were attached to a user defined overlay network. Specifically the below error -
"Unable to complete atomic operation, key modified"
But I believe the way the upgrade happen was a problem.
Once 17.10 was available we upgraded each node by draining/promoting/demoting. And I’ve not seen any issues since
@fcrisciani Hmmm, it doesn’t work in my case. I was using entrypoint: dnsrr until the moment 17.10 was released, because that was helping (though, Nginx caches IP and on a service restart it was failing translating to the old IP).
Now I removed it back, and situation returned, in random time a tool (behind nginx) is not responsive anymove. In logs, nginx timeouts connecting to ldap (that is on another node). Only service restart helps.
But, in 17.10 docker service restart leaves state “Down” in “docker node ls”, so I have to reboot the machine in order to make it “Ready” 😦 That is of course another issue, a non blocking one for sure 😃
@SunDeryInc isn’t it mentioned there?
@dtmistry @Nossnevs the next RC of 17.10 has 2 fixes for that case, 1st IP allocation will be sequential to avoid the possibility of IP overlap, plus there is a fix in overlay driver to have a consistent configuration in case of IP overlap. Would be great is you guys can give a feedback on that as it become available.
for what it’s worth I swapped out our swarm stack for k8s weeks ago and instead of crashing every few days it’s crashed never. of course ymmv but k8s solved all of my instability issues so far.
On 9 Oct 2017 09:47, “djalal” notifications@github.com wrote:
@sanimej I have been testing this along the weekend and lowering the keepalive parameter to 100 seconds the problem is gone (I think a timeout param below 30 min is ok)
Parameters setup for Hiraki Database driver
I can always see 10 established connections to the Postgres container (that’s what it is defined in the previous config, => good)
Each 900 seconds I can see these type of messages in catalina.out:
but now it doesn’t affect to the application (maybe is something that hiraki does for monitoring if the socket is alive or not but it seems it is not important)
I don’t know if lowering the keepalive to 100 seconds can have collateral effects to another application/container (I don’t expect it 'cause the others services that I have are pure http), but, so far, it is good for me.
Anyway, along this week I will continue testing this
Nacho.
Thanks @fcrisciani and the rest for the god job done.
I can confirm what @odelreym mentions. Any persistent/pooled connections broke after a certain amount of time for us. All our transient connections always worked fine. This behaviour was present even after an update to 17.06-rc. An alternative to lowering timeouts is to set
endpoint_mode
todnsrr
for the services with incoming persistent connections (if suitable for your services). After doing that, things are stable and working fine.Hi there
I have been suffering for a while these type of problems with the swarm (I was in 17.05, Centos 7.3)
Last Friday I updated my 3-node test cluster with 17.06 and so far, it is working fine, but the issue that I had with a Tomcat container connected to a PostgresQL container, each one deployed in different docker hosts, it was still there
The symptom was that each time I try to log into the application (30/35 min before the stack was launched into the swarm), the tomcat wanted to reuse a stablished connection to the Postgres (checked using netstat… the sockets were there), for checking the username/password, but the browser was waiting a response which never came back.
So, looking deeper into the logs (searching for values similar to 30 min/1800 seconds) I saw this
and changing the connection timeout defined in Hiraki to 900000 (15 min) the stack worked properly
I think IPVS hibernates the idle connections 20-30 min after the last time they were used and if the application is not able to back them to life or a broken pipe happens, the service seems to hang … at least in my case.
I already tested a stack ‘flow-proxy<->nginx’ which run separately… and in 17.05, 30-40 min later, from flow proxy I was able to ping the other container but I wasn’t able to do a telnet to the nginx service for instance .
Now, I can confirm it works in 17.06
I hope this information helps to anyone
@fcrisciani
I have been testing the network connectivity between the nodes and found it to be somewhat flaky at times, probably related to network congestion over at DO datacenters, but this is something I would expect to be quite a common scenario for cloud environments. I have mentioned this specifically here and here quite a while ago.
What troubles me is that even if there is brief connectivity issues between the nodes it cannot be viable for the overlay network to become unstable for 30-60 minutes or more for multiple services. In my tests the connectivity issues can be a single ping request failure followed by immediately successful ones. As I mentioned in the linked comments it doesn’t look like swarm is handling these connectivity issues in a reliable way, mostly since the nodes are kept in active state, services kept running but with a wide range of connectivity issues that lingers in overlay network even when node connectivity is back in business pretty much right away.
What do you mean by glitches? Do you mean that brief connectivity issues are expected to cause these rather serious issues for substantial times?
For reference here is an extract of kernel parameters used on these machines to cope with a lot of http traffic and some other stuff.
Anyone else that have experience or heard of these issues on digital ocean specifically? I have reached out to them regarding the network congestion but it seems to be something they can’t really do much about.
I can confirm that this is still present to some extent in
17.06
.We are getting a lot of the same memberlist warning and failed suspects, multiple times per day. We are not running on AWS. Machines are relatively high spec 12 CPU 32GB memory, and they are not overloaded.
These memberlist warning are generally interleaved with node join events which looks kind of weird.
The whole suite of the previously mentioned memberlist weirdness is something we still see frequently.
After upgrading I actually destroyed the swarm, rebooted all machines, and recreated the whole swarm and all services and while that was coming back up I got a lot of these errors:
I attached some findings while debugging this stuff in a similar issue (on 17.05) but I haven’t really had any more feedback on this.
@fcrisciani additionally, you have cloudwatch metrics that shows the amount of consuming and remaining instance credits. I believe that’s the best metrics to use.
Quick heads up for who uses AWS. Check carefully the description of the instance type: https://aws.amazon.com/ec2/instance-types. If you are using T2 instances you have to be very careful because they are not giving a guarantee on the CPU resources. In two words if you use all your CPU credits you vm won’t receive physical CPU and so won’t run. This creates connectivity issues between the nodes where the distributed database mark other nodes as not reachable and cleans them up. There is several ways to identify the problem:
memberlist
in the docker logs and if you see messages like, node failed or suspect that means that the distributed database had issues with the keepalive with other nodesI can also confirm that since 17.06-rc3 cluster stability improved and intermittent outages went away. This makes swarmmode and overlaynetworks usable for us again.
@sulphur I am running 17.06-rc3 for a week now. Not faced any network issues. We have some 20 micro services deployed in swarm cluster and service to service communication is working fine as of now.
@eljrax I am deploying the same way as you mentioned “by declaring external in compose file.” Not sure it’s only working because of this
@vovimayhem sorry but time is some thing i don’t have for this any more until i can show for management that this is solved. But if i find some time i will try. We have the old nodes up and running and they experience the problem now too.
@fcrisciani @mavenugo https://github.com/docker/libnetwork/pull/1792 looks promising. But this issue have also some state where the problem is permanent and not something that disappears after 300 seconds. Is it possible that the serf like tool that docker uses have some problems so that changes is not spread between nodes as @BSWANG mentioned?
( When i say we i mean @Paxxi , @AndreasSundstrom and I. We have worked together to analyze this issue )
It’s happening all the time after few minutes.
Cluster deployed on AWS with
docker-machine
Swarm and Docker with experimental flagdocker version
docker info
I’ve been seeing the same issue. Some (possibly all) overlay IPs stop responding, DNS still resolves the IP but connections to a port on the IP hang indefinitely. Restarting just the docker daemon sometimes solves the issue, but today we needed to do a full reboot to recover. Services are running inside of swarm mode, networks created are “attachable” and sometimes the target IP is a standalone container running outside of swarm mode. If it’s helpful, I also have a daemon-data and goroutine-stacks that was generated during this issue.
Docker version is 17.03.1-ce (similar issue was seen with 1.13.1) Host OS is RHEL 7.3 with kernel 3.10.
Looking through my logs after restarting just dockerd, on host2 I see:
On host1, I’m seeing:
I also updated the swarm to 17.03.1-ce, still met the same problem just now. After restarting the docker engine on the problem host, all goes normal, but will happen again.