traefik: Gateway errors during rolling deployment w/sticky sessions

Do you want to request a feature or report a bug?

Bug

What did you do?

I’m using traefik as a load balancer for a docker swarm so I can enable sticky sessions (loadBalancer.sticky.cookie=true). Everything works great except for during a rolling update of my service. Docker swarm brings up my new containers and then shuts down the old ones.

During this process, there’s a few seconds where the old containers have been stopped, but traefik is still trying to route requests to them (resulting in 502 and 504 errors). The default docker swarm routing network works correctly (stops sending requests to stopped containers), but I need to use traefik’s routing to enable the sticky sessions.

I tried enabling traefik’s retry middleware, but when I checked the debug logs, it just keeps retrying the ip address of the old container that has already been shut down. I am able to produce errors in the browser as well as via the command line with curl or ab (where the sticky cookie isn’t actually included in the request).

What did you expect to see?

I expected that when a docker container goes down, traefik would instantly stop routing requests to it. If that’s not actually possible, it should at least detect that requests are failing and the retry middleware should send the subsequent calls to one of the instances that is working.

What did you see instead?

502 and 504 errors during rolling deployments. Retry middleware keeps retrying the same ip/container that isn’t running any more. After a few seconds, the configuration updates and then things start working again.

Output of `traefik version`:

traefik:v2.4.2 docker image

Version:      2.4.2
Codename:     livarot
Go version:   go1.15.7
Built:        2021-02-02T17:20:41Z
OS/Arch:      linux/amd64

What is your environment & configuration?

Docker swarm configured with docker-compose deploy labels. Traefik running with swarmMode=true, and all containers using external “traefik_network”.

App config labels:

- "traefik.http.routers.app.rule=PathPrefix(`/`)"
- "traefik.http.services.app.loadbalancer.server.port=8080"
- "traefik.http.services.app.loadBalancer.sticky.cookie=true"

- "traefik.http.middlewares.retry-mw.retry.attempts=10"
- "traefik.http.middlewares.retry-mw.retry.initialinterval=250ms"
- "traefik.http.routers.app.middlewares=retry-mw@docker"

- "traefik.http.services.app.loadBalancer.healthCheck.path=/healthcheck"

If applicable, please paste the log output in DEBUG level (`--log.level=DEBUG` switch)

time="2021-02-10T14:59:12Z" level=debug msg="New attempt 4 for request: /MyRequestPath" middlewareName=retry-mw@docker middlewareType=Retry
time="2021-02-10T14:59:12Z" level=debug msg="'502 Bad Gateway' caused by: dial tcp 10.0.2.OldContainerIpHere:8080: connect: no route to host"

About this issue

Original URL
State: open
Created 3 years ago
Comments: 16 (5 by maintainers)

Most upvoted comments

I do not consider this issue resolved. I’m glad there’s a workaround, but the underlying issue should be fixed. Or at the very least, the workaround should be documented.

allanjackson on Jun 3, 2021

Here’s a simple setup to reproduce the errors.

docker-stack.yml file:

version: "3.7"
services:
    
    app:
        image: traefik/whoami
        
        networks:
            - traefik_network
        
        ports:
            - 80
      
        deploy:
            replicas: 1
            update_config:
                order: start-first
                parallelism: 1

            labels:
                - "traefik.http.routers.app.rule=PathPrefix(`/`)"
                - "traefik.http.services.app.loadbalancer.server.port=80"
                - "traefik.http.services.app.loadBalancer.sticky.cookie=true"

                - "traefik.http.middlewares.retry-mw.retry.attempts=3"
                - "traefik.http.routers.app.middlewares=retry-mw@docker"
    
    
    load-balancer:
        image: traefik:v2.4.2
        
        command:            
            - --providers.docker=true
            - --providers.docker.swarmMode=true
            - --providers.docker.network=traefik_network
            - --entryPoints.app.address=:80
            
            - --log.level=DEBUG
            
        networks:
            - traefik_network
            
        ports:
            - "80:80"
        
        volumes:
            - /var/run/docker.sock:/var/run/docker.sock


networks:
    traefik_network:
        external: true

Setup commands:

docker swarm init
docker network create -d overlay traefik_network --attachable
docker stack deploy -c docker-stack.yml demo

Testing commands: docker service update --force demo_app IMMEDIATELY after previous command, run: ab -n 5000 -s 120 localhost/

Adjust the 5000 parameter based on how fast your computer handles the incoming requests. If you run those two commands close enough together, the requests will initially go to the first container, but then at some point it’ll switch over to the new one.

During the switchover, there will be errors where Traefik tries to send requests to the container that just shut down rather than using the new container that should already be working correctly. You can also see from the logs that the Retry middleware just keeps trying the ip address of the old container rather than switching over to the new one.

Typically when I’ve been testing it, I see somewhere from 10-30 failed requests each time. If you inspect the docker logs for the traefik container, you can find the errors easily by searching for “gateway”. They’re typically a mixture of 502 and 504 errors. You can also find the retries easily if you search for “New attempt”.

Let me know if you have any other questions or any difficulties with reproducing the issue.

allanjackson on Feb 11, 2021

@Alvise88 Can you elaborate on that at all? Why is it pure luck? And what side effects?

My naive assumption is that this change allows Traefik to more quickly see the updates from Docker, so it stops using the old container and starts using the new one fast enough that there are no errors.

allanjackson on Mar 4, 2021

That’s not how the delay feature works. If your swarm is running multiple replicas, the delay is how much time the swarm will wait in between starting up each new replica.

In my reproduction sample code above, there is only one replica, so adding a delay option doesn’t do anything at all. The way 100% uptime is supposed to be preserved is because of the start-first order. This ensures that the new container is up and running before the swarm takes the old one offline.

In my real code, I also have a docker healthcheck configured, so I know for sure that the new container is successfully accepting requests before the old ones are removed.

allanjackson on Mar 3, 2021

@allanjackson Similar situation here, I also see “Bad Gateway” often but not only when updating, it occurs independent of service updates and independent of task crashes. This happens with the latest 2.3 here.

I also use retry middleware and stickiness basically the same way like you (without the traefik lb healthCheck).

It happened around 260 times in 48 hours (50 req/s average)

pozylon on Feb 11, 2021