traefik: Gateway errors during rolling deployment w/sticky sessions
Do you want to request a feature or report a bug?
Bug
What did you do?
I’m using traefik as a load balancer for a docker swarm so I can enable sticky sessions (loadBalancer.sticky.cookie=true). Everything works great except for during a rolling update of my service. Docker swarm brings up my new containers and then shuts down the old ones.
During this process, there’s a few seconds where the old containers have been stopped, but traefik is still trying to route requests to them (resulting in 502 and 504 errors). The default docker swarm routing network works correctly (stops sending requests to stopped containers), but I need to use traefik’s routing to enable the sticky sessions.
I tried enabling traefik’s retry middleware, but when I checked the debug logs, it just keeps retrying the ip address of the old container that has already been shut down. I am able to produce errors in the browser as well as via the command line with curl or ab (where the sticky cookie isn’t actually included in the request).
What did you expect to see?
I expected that when a docker container goes down, traefik would instantly stop routing requests to it. If that’s not actually possible, it should at least detect that requests are failing and the retry middleware should send the subsequent calls to one of the instances that is working.
What did you see instead?
502 and 504 errors during rolling deployments. Retry middleware keeps retrying the same ip/container that isn’t running any more. After a few seconds, the configuration updates and then things start working again.
Output of traefik version
:
traefik:v2.4.2 docker image
Version: 2.4.2
Codename: livarot
Go version: go1.15.7
Built: 2021-02-02T17:20:41Z
OS/Arch: linux/amd64
What is your environment & configuration?
Docker swarm configured with docker-compose deploy labels. Traefik running with swarmMode=true, and all containers using external “traefik_network”.
App config labels:
- "traefik.http.routers.app.rule=PathPrefix(`/`)"
- "traefik.http.services.app.loadbalancer.server.port=8080"
- "traefik.http.services.app.loadBalancer.sticky.cookie=true"
- "traefik.http.middlewares.retry-mw.retry.attempts=10"
- "traefik.http.middlewares.retry-mw.retry.initialinterval=250ms"
- "traefik.http.routers.app.middlewares=retry-mw@docker"
- "traefik.http.services.app.loadBalancer.healthCheck.path=/healthcheck"
If applicable, please paste the log output in DEBUG level (--log.level=DEBUG
switch)
time="2021-02-10T14:59:12Z" level=debug msg="New attempt 4 for request: /MyRequestPath" middlewareName=retry-mw@docker middlewareType=Retry
time="2021-02-10T14:59:12Z" level=debug msg="'502 Bad Gateway' caused by: dial tcp 10.0.2.OldContainerIpHere:8080: connect: no route to host"
About this issue
- Original URL
- State: open
- Created 3 years ago
- Comments: 16 (5 by maintainers)
I do not consider this issue resolved. I’m glad there’s a workaround, but the underlying issue should be fixed. Or at the very least, the workaround should be documented.
Here’s a simple setup to reproduce the errors.
docker-stack.yml file:
Setup commands:
Testing commands:
docker service update --force demo_app
IMMEDIATELY after previous command, run:ab -n 5000 -s 120 localhost/
Adjust the 5000 parameter based on how fast your computer handles the incoming requests. If you run those two commands close enough together, the requests will initially go to the first container, but then at some point it’ll switch over to the new one.
During the switchover, there will be errors where Traefik tries to send requests to the container that just shut down rather than using the new container that should already be working correctly. You can also see from the logs that the Retry middleware just keeps trying the ip address of the old container rather than switching over to the new one.
Typically when I’ve been testing it, I see somewhere from 10-30 failed requests each time. If you inspect the docker logs for the traefik container, you can find the errors easily by searching for “gateway”. They’re typically a mixture of 502 and 504 errors. You can also find the retries easily if you search for “New attempt”.
Let me know if you have any other questions or any difficulties with reproducing the issue.
@Alvise88 Can you elaborate on that at all? Why is it pure luck? And what side effects?
My naive assumption is that this change allows Traefik to more quickly see the updates from Docker, so it stops using the old container and starts using the new one fast enough that there are no errors.
That’s not how the delay feature works. If your swarm is running multiple replicas, the delay is how much time the swarm will wait in between starting up each new replica.
In my reproduction sample code above, there is only one replica, so adding a delay option doesn’t do anything at all. The way 100% uptime is supposed to be preserved is because of the start-first order. This ensures that the new container is up and running before the swarm takes the old one offline.
In my real code, I also have a docker healthcheck configured, so I know for sure that the new container is successfully accepting requests before the old ones are removed.
@allanjackson Similar situation here, I also see “Bad Gateway” often but not only when updating, it occurs independent of service updates and independent of task crashes. This happens with the latest 2.3 here.
I also use retry middleware and stickiness basically the same way like you (without the traefik lb healthCheck).
It happened around 260 times in 48 hours (50 req/s average)