moby: Swarm routing failures

We are using swarm services with published ports to provide access to a number of services and web apps. After a service update, we see intermittent connection failures (lack of response / timeouts). The service being called does not appear to receive the request, it appears to be lost in the overlay network somehow.

Steps to reproduce the issue:

  1. Deploy an HTTP based service with a published port (using docker service create) with 2 or more replicas
  2. Upgrade the service to a new version (docker service update)
  3. Perform repeated requests to a machine using curl http://machine:port

Describe the results you received: Worst case, every other request timed out / did not respond

Describe the results you expected: Every request successfully routed to a running task for the service

Additional information you deem important (e.g. issue happens only occasionally): Generally (but not always) resolved via a docker service rm / docker service create to reset the service, however this is not realistic in a production environment.

We are using labels to constrain specific services to certain nodes. We are only routing requests to the servers meeting these criteria, although I know that we could route to any server in the swarm if we wanted to.

Output of docker version:

Client:
 Version:      17.03.0-ce
 API version:  1.26
 Go version:   go1.7.5
 Git commit:   3a232c8
 Built:        Tue Feb 28 08:01:32 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.0-ce
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   3a232c8
 Built:        Tue Feb 28 08:01:32 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info: (from one of the swarm managers)

Containers: 14
 Running: 6
 Paused: 0
 Stopped: 8
Images: 16
Server Version: 17.03.0-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 212
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: active
 NodeID: 1wspyg3dm7eik0pq1ut9iaotg
 Is Manager: true
 ClusterID: 0liuti0ieescusyzoftqhmj8t
 Managers: 3
 Nodes: 14
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: 172.17.2.11
 Manager Addresses:
  0.0.0.0:2377
  172.17.2.11:2377
  172.17.2.12:2377
  172.17.2.13:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 977c511eda0925a723debdc94d09459af49d082a
runc version: a01dafd48bc1c7cc12bdb01206f9fea7dd6feb70
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-66-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 6.804 GiB
Name: stg-ne-mgr01
ID: YRT7:TRHX:5J33:Y2ZE:EBQT:4EHP:PWXS:V6RV:T6NU:CLKH:PYSM:F4PA
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.): Running on Azure VMs. Host OS is Ubuntu 16.04 LTS

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 1
  • Comments: 62 (16 by maintainers)

Most upvoted comments

Is there any workaround? I am comparing swarm and k8s to deploy our API right now. I’d love to use swarm but this issue is preventing us from moving forward. I am very new to docker so I’m not sure if swarm is even production ready. Any advice?

@diegito We have tested 17.06.2-ce and we are still facing timeout issues as well, pretty frustrating. Docker swarm is basically unusable for production. We have been testing it on and off since swarm mode was introduced and the issues always comes back intermittently, but it doesn’t take long for it to happen. Also there shouldn’t be any timeouts when rolling, otherwise what is the point of swarm mode.

Just to give some more input here:

  • On our end, things seemed to have stabilized with v17.06.2
  • We tried to upgrade to v17.09 but the timeouts came back so we did not move forward.
  • I’ll test with v17.12 and post here if anything new happens

@diegito I’ve updated the title. I’m surprised that we haven’t seen a fix for this too though, it’s an issue we’re battling with on a fairly regular basis at the moment.

I have the network routing failures at any time, not only on service update. Also after restarting a service the issue lingers. I fixed it with what I mentioned before but… what’s the point of having a Swarm if you can’t use its features?

@ben-moore maybe we can change the title of the issue to simply “Swarm routing failures”?

It’s a really big issue actually… I’m surprised there is no hotfix or any update from the docker team on this. Or am I wrong?

Running into the same issue here with 17.03 windows server worker and a 17.06 running ubuntu server 16.04. Need to get this working within the next 5 business days or I am hosed

@fcrisciani thank you so much, that was it, I added a health cmd check to my service and I can do rolling updates without any interruptions of service. I will also have to test upscaling/downscaling but at least the restart of the service as it is works.

I can’t recall having seen any mention in the official documentation that having a service health cmd check is a per-requisite in order to to have zero-loss rolling updates so I think the documentation might need some adapting to make it clearer.

@fcrisciani Thank you, went back to vip mode after changing the keep alive and that seems to have done the trick. This is with 17.12.1. @ushuz You should try this out.

@fcrisciani Is this being tracked somewhere? It seems there are quite a few threads regarding timeout issues with swarm. Where do we need to enable the keep alive?

I tested this with 17.12.1 and we are still experiencing timeout issues. FYI I tried vkozlovski suggestion here: https://github.com/moby/moby/issues/32738 and the timeout issue seemed to take longer to happen (from a few minutes to days).

I will try with the latest and see.

I have been struggling almost similar issues. Here’s my configuration: 6 Nodes Swarm Mode cluster running on Azure (3 managers, 3 workers).

I have defined an overlay network for all my services, called ‘mynet’.

I have nginx service (a service with 3 tasks/replicas) running on master nodes only (with label constraints) and published port 80. The nginx service is acting as reverse proxy (layer 7 router) that has following configs:

http {  
  server {
    listen       80;

    location /api/ {
      proxy_pass       http://my-api:80/;
      proxy_http_version 1.1;
      proxy_set_header X-Real-IP $remote_addr;
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header Upgrade $http_upgrade;
      proxy_set_header Connection keep-alive;
      proxy_set_header Host $host;
      proxy_cache_bypass $http_upgrade;
    }
  }  
}

The service created like:

docker service create --with-registry-auth --replicas 3 --name rproxy -p 80:80 --network=mynet --detach --constraint 'node.role==manager' some\myproxy:v1

Now, my API app is running as a service too, pretty much like:

docker service create --with-registry-auth --replicas 1 --name my-api --network=mynet  --constraint 'node.role==worker' some/image:v1

This was working for quite a while, I could hit the manager nodes from ourside (via a Azure Load balancer) like

http://loadbalancer.address.com/api/health

However, after a while, we started getting 502 gateway error while pushed a new version of the API image. To troubleshoot, I have ssh-éd into the nginx containers and saw that I can’t curl url like, http://my-api:80/

It seems that docker DNS is not resolving service names to IP. The containers look fine though (when I do docker service ls, logs, ps, inspect etc.)

Docker version I have used:

:~$ docker version
Client:
 Version:      17.09.0-ce
 API version:  1.32
 Go version:   go1.8.3
 Git commit:   afdb6d4
 Built:        Tue Sep 26 22:42:38 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.09.0-ce
 API version:  1.32 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   afdb6d4
 Built:        Tue Sep 26 22:41:20 2017
 OS/Arch:      linux/amd64
 Experimental: false

I think it’s a hard to reproduce bug, because often I scale the service (api app) down to 0 and wait a min and scale it back, and things starts working. Which makes me doubt it’s a bug into the docker service name resolution process.

Could anybody from Docker team please shed a bit light here? As I have found plenty of people raising similar issues on Swarm, feel less confident running swarm on production. If this issue doesn’t get addressed we might need to switch to K8s for production loads.

Inside dmesg I could not see anything abnormal besides one message that seem not to be a concern though: 22592778.495781] aufs au_opts_verify:1597:dockerd[30266]: dirperm1 breaks the protection by the permission bits on the lower branch. Our swarm spawns over 3 machines with 3 managers and there were resources available in them. We also moved (using swarm filters and tags) the external-facing services (services with ports exposed) in a machine with very low load, but the result was the same: seldom timeouts. Number of requests hitting the server was not that high either and files descriptors were not a as far as I could see. These are some sample logs we collected from an nginx instance in front of the cluster. 192.168.51.* are our machines.

2017/07/28 06:37:37 [error] 22380#22380: *18939 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: test.domain.com, request: "DELETE /testurl HTTP/1.1", upstream: "http://192.168.51.108:7500/testurl", host: "test.domain.com", referrer: "https://test.domain.com/dashboard/"
2017/07/28 06:38:37 [error] 22380#22380: *18939 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: test.domain.com, request: "DELETE /testurl HTTP/1.1", upstream: "http://192.168.51.109:7500/testurl", host: "test.domain.com", referrer: "https://test.domain.com/dashboard/"
2017/07/28 06:39:37 [error] 22380#22380: *18939 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: test.domain.com, request: "DELETE /testurl HTTP/1.1", upstream: "http://192.168.51.107:7500/testurl", host: "test.domain.com", referrer: "https://test.domain.com/dashboard/"
2017/07/28 06:45:50 [error] 22381#22381: *19002 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: test.domain.com, request: "DELETE /testurl HTTP/1.1", upstream: "http://192.168.51.107:7500/testurl", host: "test.domain.com", referrer: "https://test.domain.com/dashboard/"
2017/07/28 06:47:57 [error] 22381#22381: *19036 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: test.domain.com, request: "DELETE /testurl HTTP/1.1", upstream: "http://192.168.51.107:7500/testurl", host: "test.domain.com", referrer: "https://test.domain.com/dashboard"
2017/07/28 06:59:58 [error] 22384#22384: *19095 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: test.domain.com, request: "DELETE /testurl HTTP/1.1", upstream: "http://192.168.51.108:7500/testurl", host: "test.domain.com", referrer: "https://test.domain.com/dashboard"
2017/07/28 07:00:58 [error] 22384#22384: *19095 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: test.domain.com, request: "DELETE /testurl HTTP/1.1", upstream: "http://192.168.51.109:7500/testurl", host: "test.domain.com", referrer: "https://test.domain.com/dashboard/"
2017/07/28 07:01:58 [error] 22384#22384: *19095 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: test.domain.com, request: "DELETE /testurl HTTP/1.1", upstream: "http://192.168.51.107:7500/testurl", host: "test.domain.com", referrer: "https://test.domain.com/dashboard/"
2017/07/28 07:30:07 [error] 22378#22378: *19263 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: test.domain.com, request: "DELETE /testurl HTTP/1.1", upstream: "http://192.168.51.109:7500/testurl", host: "test.domain.com", referrer: "https://test.domain.com/dashboard/"
2017/07/28 07:31:07 [error] 22378#22378: *19263 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: test.domain.com, request: "DELETE /testurl HTTP/1.1", upstream: "http://192.168.51.107:7500/testurl", host: "test.domain.com", referrer: "https://test.domain.com/dashboard/"
2017/07/28 07:32:07 [error] 22378#22378: *19263 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: test.domain.com, request: "DELETE /testurl HTTP/1.1", upstream: "http://192.168.51.108:7500/testurl", host: "test.domain.com", referrer: "https://test.domain.com/dashboard/"
2017/07/28 08:04:09 [error] 22381#22381: *19730 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: test.domain.com, request: "DELETE /testurl HTTP/1.1", upstream: "http://192.168.51.108:7500/testurl", host: "test.domain.com", referrer: "https://test.domain.com/dashboard/"
2017/07/28 08:05:09 [error] 22381#22381: *19730 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: test.domain.com, request: "DELETE /testurl HTTP/1.1", upstream: "http://192.168.51.109:7500/testurl", host: "test.domain.com", referrer: "https://test.domain.com/dashboard/"
2017/07/28 08:06:09 [error] 22381#22381: *19730 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: test.domain.com, request: "DELETE /testurl HTTP/1.1", upstream: "http://192.168.51.107:7500/testurl", host: "test.domain.com", referrer: "https://test.domain.com/dashboard/"
2017/07/28 08:11:10 [error] 22383#22383: *19780 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: test.domain.com, request: "DELETE /testurl HTTP/1.1", upstream: "http://192.168.51.108:7500/testurl", host: "test.domain.com", referrer: "https://test.domain.com/dashboard/"
2017/07/28 08:12:10 [error] 22383#22383: *19780 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: test.domain.com, request: "DELETE /testurl HTTP/1.1", upstream: "http://192.168.51.107:7500/testurl", host: "test.domain.com", referrer: "https://test.domain.com/dashboard/"
2017/07/28 08:13:10 [error] 22383#22383: *19780 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: test.domain.com, request: "DELETE /testurl HTTP/1.1", upstream: "http://192.168.51.109:7500/testurl", host: "test.domain.com", referrer: "https://test.domain.com/dashboard/"
2017/07/28 08:24:32 [error] 22383#22383: *19979 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: test.domain.com, request: "DELETE /testurl HTTP/1.1", upstream: "http://192.168.51.107:7500/testurl", host: "test.domain.com", referrer: "https://test.domain.com/dashboard/"
2017/07/28 08:25:32 [error] 22383#22383: *19979 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: test.domain.com, request: "DELETE /testurl HTTP/1.1", upstream: "http://192.168.51.109:7500/testurl", host: "test.domain.com", referrer: "https://test.domain.com/dashboard/"
2017/07/28 08:26:32 [error] 22383#22383: *19979 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: test.domain.com, request: "DELETE /testurl HTTP/1.1", upstream: "http://192.168.51.108:7500/testurl", host: "test.domain.com", referrer: "https://test.domain.com/dashboard/"
2017/07/28 08:31:09 [error] 22388#22388: *20073 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: test.domain.com, request: "DELETE /testurl HTTP/1.1", upstream: "http://192.168.51.107:7500/testurl", host: "test.domain.com", referrer: "https://test.domain.com/dashboard/"
2017/07/28 08:32:09 [error] 22388#22388: *20073 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: test.domain.com, request: "DELETE /testurl HTTP/1.1", upstream: "http://192.168.51.108:7500/testurl", host: "test.domain.com", referrer: "https://test.domain.com/dashboard/"
2017/07/28 08:33:09 [error] 22388#22388: *20073 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: test.domain.com, request: "DELETE /testurl HTTP/1.1", upstream: "http://192.168.51.109:7500/testurl", host: "test.domain.com", referrer: "https://test.domain.com/dashboard/"
2017/07/28 08:41:21 [error] 22389#22389: *20275 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: test.domain.com, request: "DELETE /testurl HTTP/1.1", upstream: "http://192.168.51.109:7500/testurl", host: "test.domain.com", referrer: "https://test.domain.com/dashboard/"
2017/07/28 08:42:21 [error] 22389#22389: *20275 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: test.domain.com, request: "DELETE /testurl HTTP/1.1", upstream: "http://192.168.51.107:7500/testurl", host: "test.domain.com", referrer: "https://test.domain.com/dashboard/"
2017/07/28 08:43:21 [error] 22389#22389: *20275 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: test.domain.com, request: "DELETE /testurl HTTP/1.1", upstream: "http://192.168.51.108:7500/testurl", host: "test.domain.com", referrer: "https://test.domain.com/dashboard/"
2017/07/28 09:22:56 [crit] 22385#22385: *20711 SSL_do_handshake() failed (SSL: error:14094085:SSL routines:ssl3_read_bytes:ccs received early) while SSL handshaking, client: xxx.xxx.xxx.xxx, server: 0.0.0.0:443
2017/07/28 09:22:56 [crit] 22385#22385: *20712 SSL_do_handshake() failed (SSL: error:14094085:SSL routines:ssl3_read_bytes:ccs received early) while SSL handshaking, client: xxx.xxx.xxx.xxx, server: 0.0.0.0:443
2017/07/28 10:08:00 [error] 22389#22389: *21109 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: test.domain.com, request: "DELETE /testurl HTTP/1.1", upstream: "http://192.168.51.108:7500/testurl", host: "test.domain.com", referrer: "https://test.domain.com/dashboard/"
2017/07/28 10:09:00 [error] 22389#22389: *21109 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: test.domain.com, request: "DELETE /testurl HTTP/1.1", upstream: "http://192.168.51.107:7500/testurl", host: "test.domain.com", referrer: "https://test.domain.com/dashboard/"
2017/07/28 10:10:00 [error] 22389#22389: *21109 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: test.domain.com, request: "DELETE /testurl HTTP/1.1", upstream: "http://192.168.51.109:7500/testurl", host: "test.domain.com", referrer: "https://test.domain.com/dashboard/"

Thanks @fcrisciani. I don’t seem to relate that issue. What I see seem to happen even on a new instance that has never been started before. The pattern I see is rather random. If I have n instances of a container inside Swarm and try to contact the service port from outside the Swarm, I sometimes get a request timeout. There’s no consistency in what happens. Also for containers that have a single instance it happens the same: they are unreachable/timeout at times. We tried restarting them, or also restarting the docker daemon but it does not seem to help. External traffic incurs in the same issue over and over.

Let me know if I can provide you with any other detail that can help you digging in this issue, or anything that you might want me to test on my end.

I agree with @mvdstam this looks like a dup of https://github.com/moby/moby/issues/30321

There seems to be a relevant fix here waiting to be merged.

Yep, I’m still seeing this issue on 17.06 as well. Can someone from the Docker team feedback on this, as it definitely doesn’t seem to be the same issue that was linked when I reported it.

Hi, I also have the same issue starting from 1.13 to 17.06 with published services. We have the problem with Ubuntu 14.04 and 16.04.

Sometimes, scaling back to 1 fix the issue.

FYI in front of my Swarm cluster, I have an Azure LB which balances the incoming requests to my 3 nodes.

This is really impacting us, we can’t scale…

We seem to be having the same problem. We’re on the latest version. We run a curl command to the service (nginx) and it will work sometimes, then stop working for a long time. If we leave it alone, it will start working again for a while.

We scaled down the nginx service to just 1 instance to troubleshoot, but the problems persist.

We are running the curl command with a “watch” command every 2 seconds.

Interestingly, we can reproduce this on many clients connecting to the service, but the connection timeouts are host-specific. In other words, if we’re being timed out on one client, another client can still connect. Sometimes we have just one client that can’t connect, while all others can, and other times we have 1 or two that can’t. We never have a situation when nobody can connect.

This occurs even if we run the curl command on the hardware node where we’re running the nginx service. This rules out the networking outside of the docker server.

In all cases, if we wait a while, we can connect again.

Pings never stop working during the timeouts.

We can reproduce this on two swarms on different hardware in different locations and on different versions ( 17.05.0-ce and 1.12.5).

We also found that if we keep an ssl session open, it will keep working. This seems to only be a problem with new sessions.