traefik: Traefik won't add auto restarted container to backend
Do you want to request a feature or report a bug?
Bug
What did you do?
We have Traefik and a Drone server running in Docker (with Docker Compose).
After upgrading Traefik to version 1.7 (image: traefik:v1.7
), sometimes when the Drone server exits because of a mysterious bug, Drone is restarted automatically by Docker Compose. But Traefik would fail to add Drone to backend again.
At this time we have to restart Treafik manually.
What did you expect to see?
Traefik would see restarted container and add it to backend correctly.
Traefik logs:
time="2018-07-27T11:01:56Z" level=warning msg="Unable to find the IP address for the container \"/drone-server\": the server is ignored."
time="2018-07-27T11:01:56Z" level=error msg="undefined backend 'backend-drone-server' for frontend frontend-Host-ci-example-com-0. Skipping frontend frontend-Host-ci-example
What did you see instead?
Traefik failed to get the Drone container’s IP, and won’t add it to backend.
Output of traefik version
: (What version of Traefik are you using?)
v1.7.0-rc2
What is your environment & configuration (arguments, toml, provider, platform, …)?
logLevel = "INFO"
defaultEntryPoints = ["http", "https"]
checkNewVersion = false
InsecureSkipVerify = true
[metrics]
[metrics.prometheus]
entryPoint = "traefik"
buckets = [0.1,0.3,1.2,5.0]
[api]
entryPoint = "traefik"
dashboard = true
[docker]
exposedbydefault = false
[entryPoints]
[entryPoints.http]
address = ":80"
[entryPoints.http.redirect]
entryPoint = "https"
[entryPoints.https]
address = ":443"
[entryPoints.https.tls]
[[entryPoints.https.tls.certificates]]
certFile = "/etc/traefik/tls.cert"
keyFile = "/etc/traefik/tls.key"
[acme]
storage = "/data/acme.json"
email = "email@example.com"
entryPoint = "https"
onHostRule = true
ondemand = false
[acme.httpChallenge]
entryPoint = "http"
docker-compose.yml:
version: '3.4'
drone-server:
container_name: drone-server
image: drone/drone:0.8.5
restart: unless-stopped
ports:
- "9000:9000"
proxy:
container_name: proxy
# https://hub.docker.com/r/library/traefik/tags/
image: traefik:v1.7
restart: unless-stopped
ports:
- "80:80"
- "443:443"
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- ./traefik.toml:/etc/traefik/traefik.toml
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 9
- Comments: 35 (8 by maintainers)
We have also encountered this issue (as reported by @jortkoopmans) on the 1.6.6 release of Traefik when running on CentOS. It does not seem to be a regression in 1.7.
I think we have isolated the race condition to the following function: https://github.com/containous/traefik/blob/246b245959f1a16d78a17f13e9bc659bcafb3d47/server/server_configuration.go#L149-L179
If a Docker container is restarted, it sends a
die
and astart
provider event. These events are turned intoconfig.Message
types and they get processed by this function in quick succession. The race is between the read at line 164 and the write at line 178 which seem to be running in parallel for both config messages.The race condition happens if during the processing of the
start
event thedeepEqual
at line 164 sees the current configuration state before the remove triggered by thedie
message is applied (line 178). The result is that return on line 166 drops thestart
configuration change. This means only thedie
message is being processed, and the Docker container no longer exists according to Traefik.A simple workaround that works for us is to comment out the return on line 166, which means the
start
is processed. The throttle inproviderConfigUpdateCh
(applied on line 174) already prevents too many configuration changes.I just had it happen with latest 1.7 version in production a few times yesterday when upgrading containers.
I can also confirm that @jortkoopmans zip file broke after 11 restarts on my workstation which runs
Sorry for delay, I have finally got around to build traefik after disabling the ‘return’ line (I’m using up to date 1.7 branch), as suggested by @gregoryelmajjouti. I can confirm this solves the problem. However the unit-tests are failing;
I’m not a (go) developer, so I won’t be able to provide a PR which also addresses the unit tests.
I have tested on 1.7.14, the problem is unfortunately still there. Since its a race condition, you may not have hit it yet (i.e. run some more tests to break it).
Sam here. Seen this issue several times with 1.7.10. Restarting the container again usually fixes it.
Since it has been a while, I re-ran the test on traefik v2.1.4. To my pleasant surprise, it seems this issue has been resolved by other code refactoring.
I’ve done a superficial analysis and I believe it is related to the rework done here; https://github.com/containous/traefik/pull/5815 Effectively it is splitting the ‘Server’ into different parts, one is the ConfigurationWatcher, which is handling the changes in the configuration of container routing. Note that the ‘return’ line is still there (as mentioned by @gregoryelmajjouti ), but the race condition is not happening anymore in my tests. Presumably that is associated to the splitting of ‘Server’ / refactoring.
Therefore the problem is resolved for me, I encourage other participants of this thread to re-test their use-cases.
I’ve been struggling with this for a while, at least…I think it’s similar to what is reported in this issue by several of you. The failure behavior is intermittent. If for any reason a container is restarted, there is a chance that the traefik routes will not be re-established.
I can reproduce this intermittent behavior by simply setting up traefik and an nginx container, and then restarting the nginx container a couple of times (docker restart someapp). I’m suspecting a timing issue.
A successful, normal (working) restart, shows in Traefik as;
While a unsuccessful restart (broken) shows:
To me, it looks like the missing ‘reload’ of the Server configuration, after the container died, has something to do with it.
To facilitate further troubleshooting, I’ve attached various files (in zip): working.log (see also above) broken.log (see also above) docker-compose.yml traefik.toml docker-info.log traefik_3780.zip
Traefik image ‘latest’ Digest sha256:c25ac3c2736dc41b08c6f40ef3393d03dd7640a037cd53d681af459ccfa95c28
Thanks for your feedbacks!
The questions about OS and Docker versions are aimed at providing as much context as possible to reproduce the issue, which is the requirement to fix it in the future. Same for the
2.0
version, as the codebase changed quite a lot: it helps to gather conditions.@jortkoopmans @gregoryelmajjouti thanks for the deep technical elements. The developer team will look into this, but no ETA given that it’s hard for us to reproduce and analyse. As the proposed workaround is breaking tests, it means that it might solve this issue, but could lead to greater trouble somewhere else, hence the required analysis.
Thanks @jortkoopmans , we are currently trying to reproduce the issue.
Based on your feedback, I tried on Docker4Mac (with 1 CPU 1 Gb) with similar docker-compose.yml, but I did not see anything as you described. However the disk is an SSD, so I’m gonna try on slower machines on Digital Ocean and AWS to reproduce.
I also encountered the :
Unable to find the IP address for the container ... : the server is ignored.
error message.After some troubleshooting I realized that my container was misbehaving and was continuously restarting. As such traefik did not have the time to set up a backend for the container.