traefik: Gateway Timeout when rolling update a scaled service in Docker swarm mode

What version of Traefik are you using (traefik version)?

v1.2.3

What is your environment & configuration (arguments, toml…)?

docker service create \
    --name traefik \
    --constraint=node.role==manager \
    --publish 80:80 --publish 8080:8080 \
    --mount type=bind,source=/var/run/docker.sock,target=/var/run/docker.sock \
    --network traefik-net \
    traefik \
    --docker \
    --docker.swarmmode \
    --docker.domain=traefik \
    --docker.watch \
    --web

What did you do?

I am following the swarm mode user guide (https://docs.traefik.io/user-guide/swarm-mode/) directly on my Docker manager node and have setup the whoami0 service and scaled it up to 2 tasks in order to have redundancy. Now I wanted to test a rolling update of the service using docker service update --force --update-delay=10s whoami0 and noticed that during the rolling update traefik returns twice a Gateway Timeout. As far as I understand the rolling updates in swarm mode there should be no downtime because only one container/task gets stopped at a time. So there is always one container running while the other one gets restarted.

What did you expect to see?

Zero downtime

What did you see instead?

Gateway Timeout

If applicable, please paste the log output in debug mode (--debug switch)

time="2017-04-21T21:47:22Z" level=warning msg="Error forwarding to http://10.0.0.5:80, err: dial tcp 10.0.0.5:80: i/o timeout" 

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 1
  • Comments: 21 (9 by maintainers)

Most upvoted comments

@mvdstam as far as I know, from version 1.2+, traefik by default does not use by swarm service VIP and instead use the IPs of each task/container. (that way it can achieve traefik features like stickiness)

From encountering a similar issue in the past, I noticed that using swarm service VIP load balancer (traefik.backend.loadbalancer.swarm=true) produce better results regarding availability in updates.

If I remember correctly, combining it with image/service health-checking actually achieved zero downtime updates, but the load balancing itself didn’t work so good (not equally distributed between containers).

In order to achieve zero downtime availability, we are currently using two services and changing the label priority between them to do rolling updates (style blue-green deployments), but the retry approach sounds interesting as well.

@hostingnuggets I think the reason for the long delays is due to the implementation of traefik’s docker provider, the backend endpoints are based on swarm task list (which contains status/desired status as well) and traefik polls this list quite slowly: https://github.com/containous/traefik/blob/v1.3/provider/docker/docker.go SwarmDefaultWatchTime = 15 * time.Second Which means it can take 15 seconds to get the right list of tasks and each task status, for example, it might take several seconds to identify that task is shutting down, that a new task was created or got to healthy state.

When using (traefik.backend.loadbalancer.swarm=true), it can work instantly as the backend endpoint does not change at all (it’s a service VIP) and swarm itself is responsible for managing the backends.