traefik: Traefik won't add auto restarted container to backend

Do you want to request a feature or report a bug?

Bug

What did you do?

We have Traefik and a Drone server running in Docker (with Docker Compose). After upgrading Traefik to version 1.7 (image: traefik:v1.7), sometimes when the Drone server exits because of a mysterious bug, Drone is restarted automatically by Docker Compose. But Traefik would fail to add Drone to backend again. At this time we have to restart Treafik manually.

What did you expect to see?

Traefik would see restarted container and add it to backend correctly.

Traefik logs:

time="2018-07-27T11:01:56Z" level=warning msg="Unable to find the IP address for the container \"/drone-server\": the server is ignored."
time="2018-07-27T11:01:56Z" level=error msg="undefined backend 'backend-drone-server' for frontend frontend-Host-ci-example-com-0. Skipping frontend frontend-Host-ci-example

What did you see instead?

Traefik failed to get the Drone container’s IP, and won’t add it to backend.

Output of traefik version: (What version of Traefik are you using?)

v1.7.0-rc2

What is your environment & configuration (arguments, toml, provider, platform, …)?

logLevel = "INFO"

defaultEntryPoints = ["http", "https"]

checkNewVersion = false
InsecureSkipVerify = true

[metrics]
  [metrics.prometheus]
    entryPoint = "traefik"
    buckets = [0.1,0.3,1.2,5.0]

[api]
  entryPoint = "traefik"
  dashboard = true

[docker]
exposedbydefault = false

[entryPoints]
  [entryPoints.http]
  address = ":80"
    [entryPoints.http.redirect]
    entryPoint = "https"
  [entryPoints.https]
  address = ":443"
    [entryPoints.https.tls]
      [[entryPoints.https.tls.certificates]]
      certFile = "/etc/traefik/tls.cert"
      keyFile = "/etc/traefik/tls.key"

[acme]
storage = "/data/acme.json"
email = "email@example.com"
entryPoint = "https"
onHostRule = true
ondemand = false
  [acme.httpChallenge]
  entryPoint = "http"

docker-compose.yml:

version: '3.4'

  drone-server:
    container_name: drone-server
    image: drone/drone:0.8.5
    restart: unless-stopped
    ports:
      - "9000:9000"

  proxy:
    container_name: proxy
    # https://hub.docker.com/r/library/traefik/tags/
    image: traefik:v1.7
    restart: unless-stopped
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - ./traefik.toml:/etc/traefik/traefik.toml

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 9
  • Comments: 35 (8 by maintainers)

Most upvoted comments

We have also encountered this issue (as reported by @jortkoopmans) on the 1.6.6 release of Traefik when running on CentOS. It does not seem to be a regression in 1.7.

I think we have isolated the race condition to the following function: https://github.com/containous/traefik/blob/246b245959f1a16d78a17f13e9bc659bcafb3d47/server/server_configuration.go#L149-L179

If a Docker container is restarted, it sends a die and a start provider event. These events are turned into config.Message types and they get processed by this function in quick succession. The race is between the read at line 164 and the write at line 178 which seem to be running in parallel for both config messages.

The race condition happens if during the processing of the start event the deepEqual at line 164 sees the current configuration state before the remove triggered by the die message is applied (line 178). The result is that return on line 166 drops the start configuration change. This means only the die message is being processed, and the Docker container no longer exists according to Traefik.

A simple workaround that works for us is to comment out the return on line 166, which means the start is processed. The throttle in providerConfigUpdateCh (applied on line 174) already prevents too many configuration changes.

I just had it happen with latest 1.7 version in production a few times yesterday when upgrading containers.

I can also confirm that @jortkoopmans zip file broke after 11 restarts on my workstation which runs


$ uname -a
Linux transwhale 5.0.0-29-generic #31-Ubuntu SMP Thu Sep 12 13:05:32 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

$ docker version
Client:
 Version:           18.06.1-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        e68fc7a
 Built:             Tue Aug 21 17:24:51 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.06.1-ce
  API version:      1.38 (minimum version 1.12)
  Go version:       go1.10.3
  Git commit:       e68fc7a
  Built:            Tue Aug 21 17:23:15 2018
  OS/Arch:          linux/amd64
  Experimental:     false

$ docker info
Containers: 308
 Running: 2
 Paused: 0
 Stopped: 306
Images: 1111
Server Version: 18.06.1-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
runc version: 69663f0bd4b60df09991c08812a60108003fa340
init version: fec3683
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 5.0.0-29-generic
Operating System: Ubuntu 19.04
OSType: linux
Architecture: x86_64
CPUs: 12
Total Memory: 31.34GiB
Name: transwhale
ID: NOGJ:AHFC:HRLY:HEOP:WTNX:D5KI:P6SY:ODLX:LBGU:B4B4:NRVO:OW4J
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Username: thomasf
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

@jortkoopmans, does the fix provided by @gregoryelmajjouti resolve your issue?

If so, we probably want to have a PR to mainline this fix, if it is still an issue…

Sorry for delay, I have finally got around to build traefik after disabling the ‘return’ line (I’m using up to date 1.7 branch), as suggested by @gregoryelmajjouti. I can confirm this solves the problem. However the unit-tests are failing;

--- FAIL: TestListenProvidersSkipsSameConfigurationForProvider (0.12s)
    server_test.go:137: Same configuration should not be published multiple times

I’m not a (go) developer, so I won’t be able to provide a PR which also addresses the unit tests.

After updating to 1.7.14 and try to update container few time, issue disapear, I hope 1.7.14 fix this issue.

I have tested on 1.7.14, the problem is unfortunately still there. Since its a race condition, you may not have hit it yet (i.e. run some more tests to break it).

Sam here. Seen this issue several times with 1.7.10. Restarting the container again usually fixes it.

Since it has been a while, I re-ran the test on traefik v2.1.4. To my pleasant surprise, it seems this issue has been resolved by other code refactoring.

I’ve done a superficial analysis and I believe it is related to the rework done here; https://github.com/containous/traefik/pull/5815 Effectively it is splitting the ‘Server’ into different parts, one is the ConfigurationWatcher, which is handling the changes in the configuration of container routing. Note that the ‘return’ line is still there (as mentioned by @gregoryelmajjouti ), but the race condition is not happening anymore in my tests. Presumably that is associated to the splitting of ‘Server’ / refactoring.

Therefore the problem is resolved for me, I encourage other participants of this thread to re-test their use-cases.

I’ve been struggling with this for a while, at least…I think it’s similar to what is reported in this issue by several of you. The failure behavior is intermittent. If for any reason a container is restarted, there is a chance that the traefik routes will not be re-established.

I can reproduce this intermittent behavior by simply setting up traefik and an nginx container, and then restarting the nginx container a couple of times (docker restart someapp). I’m suspecting a timing issue.

A successful, normal (working) restart, shows in Traefik as;

time="2018-10-19T13:12:09Z" level=debug msg="Provider event received {Status:die ID:f3fde9c2dba7db19a0f93ddeb561886bf256be66dc5577543618007ae3618150 From:nginx Type:container Action:die Actor:{ID:f3fde9c2dba7db19a0f93ddeb561886bf256be66dc5577543618007ae3618150 Attributes:map[maintainer:NGINX Docker Maintainers <docker-maint@nginx.com> traefik.frontend.entryPoints:http,https com.docker.compose.config-hash:b2b6bad273c2aab4e14fef131d833e586f292e4a56b35863fc0b12775ce5b53a com.docker.compose.oneoff:False com.docker.compose.version:1.22.0 traefik.frontend.rule:HostRegexp:{catchall:.*} com.docker.compose.project:temp image:nginx name:someapp com.docker.compose.container-number:1 com.docker.compose.service:someapp exitCode:0 traefik.backend:someapp traefik.port:80]} Scope:local Time:1539954729 TimeNano:1539954729940309118}"
time="2018-10-19T13:12:10Z" level=debug msg="Filtering disabled container /traefik"
time="2018-10-19T13:12:10Z" level=debug msg="Configuration received from provider docker: {}"
time="2018-10-19T13:12:10Z" level=debug msg="Creating entry point redirect http -> https"
time="2018-10-19T13:12:10Z" level=info msg="Server configuration reloaded on :8080"
time="2018-10-19T13:12:10Z" level=info msg="Server configuration reloaded on :80"
time="2018-10-19T13:12:10Z" level=info msg="Server configuration reloaded on :443"
time="2018-10-19T13:12:10Z" level=debug msg="Provider event received {Status:start ID:f3fde9c2dba7db19a0f93ddeb561886bf256be66dc5577543618007ae3618150 From:nginx Type:container Action:start Actor:{ID:f3fde9c2dba7db19a0f93ddeb561886bf256be66dc5577543618007ae3618150 Attributes:map[image:nginx name:someapp traefik.backend:someapp traefik.frontend.entryPoints:http,https com.docker.compose.config-hash:b2b6bad273c2aab4e14fef131d833e586f292e4a56b35863fc0b12775ce5b53a com.docker.compose.project:temp com.docker.compose.version:1.22.0 maintainer:NGINX Docker Maintainers <docker-maint@nginx.com> traefik.frontend.rule:HostRegexp:{catchall:.*} traefik.port:80 com.docker.compose.container-number:1 com.docker.compose.oneoff:False com.docker.compose.service:someapp]} Scope:local Time:1539954730 TimeNano:1539954730278104338}"
time="2018-10-19T13:12:10Z" level=debug msg="Filtering disabled container /traefik"
time="2018-10-19T13:12:10Z" level=debug msg="originLabelsmap[com.docker.compose.config-hash:b2b6bad273c2aab4e14fef131d833e586f292e4a56b35863fc0b12775ce5b53a com.docker.compose.version:1.22.0 maintainer:NGINX Docker Maintainers <docker-maint@nginx.com> traefik.frontend.rule:HostRegexp:{catchall:.*} com.docker.compose.container-number:1 com.docker.compose.oneoff:False com.docker.compose.project:temp com.docker.compose.service:someapp traefik.backend:someapp traefik.frontend.entryPoints:http,https traefik.port:80]"
time="2018-10-19T13:12:10Z" level=debug msg="allLabelsmap[:map[traefik.frontend.rule:HostRegexp:{catchall:.*} traefik.backend:someapp traefik.frontend.entryPoints:http,https traefik.port:80]]"
time="2018-10-19T13:12:10Z" level=debug msg="originLabelsmap[com.docker.compose.container-number:1 com.docker.compose.oneoff:False com.docker.compose.project:temp com.docker.compose.service:someapp traefik.backend:someapp traefik.frontend.entryPoints:http,https traefik.port:80 com.docker.compose.config-hash:b2b6bad273c2aab4e14fef131d833e586f292e4a56b35863fc0b12775ce5b53a com.docker.compose.version:1.22.0 maintainer:NGINX Docker Maintainers <docker-maint@nginx.com> traefik.frontend.rule:HostRegexp:{catchall:.*}]"
time="2018-10-19T13:12:10Z" level=debug msg="allLabelsmap[:map[traefik.frontend.rule:HostRegexp:{catchall:.*} traefik.port:80 traefik.backend:someapp traefik.frontend.entryPoints:http,https]]"
time="2018-10-19T13:12:10Z" level=debug msg="Backend backend-someapp: no load-balancer defined, fallback to 'wrr' method"
time="2018-10-19T13:12:10Z" level=debug msg="Configuration received from provider docker: {\"backends\":{\"backend-someapp\":{\"servers\":{\"server-someapp-9121159f5bca979e8087544fd72045a9\":{\"url\":\"http://192.168.32.3:80\",\"weight\":1}},\"loadBalancer\":{\"method\":\"wrr\"}}},\"frontends\":{\"frontend-HostRegexp-catchall-0\":{\"entryPoints\":[\"http\",\"https\"],\"backend\":\"backend-someapp\",\"routes\":{\"route-frontend-HostRegexp-catchall-0\":{\"rule\":\"HostRegexp:{catchall:.*}\"}},\"passHostHeader\":true,\"priority\":0,\"basicAuth\":null}}}"
time="2018-10-19T13:12:12Z" level=debug msg="Creating entry point redirect http -> https"
time="2018-10-19T13:12:12Z" level=debug msg="Wiring frontend frontend-HostRegexp-catchall-0 to entryPoint http"
time="2018-10-19T13:12:12Z" level=debug msg="Creating backend backend-someapp"
time="2018-10-19T13:12:12Z" level=debug msg="Adding TLSClientHeaders middleware for frontend frontend-HostRegexp-catchall-0"
time="2018-10-19T13:12:12Z" level=debug msg="Creating load-balancer wrr"
time="2018-10-19T13:12:12Z" level=debug msg="Creating server server-someapp-9121159f5bca979e8087544fd72045a9 at http://192.168.32.3:80 with weight 1"
time="2018-10-19T13:12:12Z" level=debug msg="Creating route route-frontend-HostRegexp-catchall-0 HostRegexp:{catchall:.*}"
time="2018-10-19T13:12:12Z" level=debug msg="Wiring frontend frontend-HostRegexp-catchall-0 to entryPoint https"
time="2018-10-19T13:12:12Z" level=debug msg="Creating backend backend-someapp"
time="2018-10-19T13:12:12Z" level=debug msg="Adding TLSClientHeaders middleware for frontend frontend-HostRegexp-catchall-0"
time="2018-10-19T13:12:12Z" level=debug msg="Creating load-balancer wrr"
time="2018-10-19T13:12:12Z" level=debug msg="Creating server server-someapp-9121159f5bca979e8087544fd72045a9 at http://192.168.32.3:80 with weight 1"
time="2018-10-19T13:12:12Z" level=debug msg="Creating route route-frontend-HostRegexp-catchall-0 HostRegexp:{catchall:.*}"
time="2018-10-19T13:12:12Z" level=info msg="Server configuration reloaded on :8080"
time="2018-10-19T13:12:12Z" level=info msg="Server configuration reloaded on :80"
time="2018-10-19T13:12:12Z" level=info msg="Server configuration reloaded on :443"

While a unsuccessful restart (broken) shows:

time="2018-10-19T13:07:57Z" level=debug msg="Provider event received {Status:die ID:f3fde9c2dba7db19a0f93ddeb561886bf256be66dc5577543618007ae3618150 From:nginx Type:container Action:die Actor:{ID:f3fde9c2dba7db19a0f93ddeb561886bf256be66dc5577543618007ae3618150 Attributes:map[name:someapp traefik.frontend.rule:HostRegexp:{catchall:.*} com.docker.compose.config-hash:b2b6bad273c2aab4e14fef131d833e586f292e4a56b35863fc0b12775ce5b53a com.docker.compose.project:temp maintainer:NGINX Docker Maintainers <docker-maint@nginx.com> traefik.backend:someapp traefik.port:80 com.docker.compose.container-number:1 com.docker.compose.version:1.22.0 exitCode:0 traefik.frontend.entryPoints:http,https com.docker.compose.oneoff:False com.docker.compose.service:someapp image:nginx]} Scope:local Time:1539954477 TimeNano:1539954477012610712}"
time="2018-10-19T13:07:57Z" level=debug msg="Filtering disabled container /traefik"
time="2018-10-19T13:07:57Z" level=debug msg="Configuration received from provider docker: {}"
time="2018-10-19T13:07:57Z" level=debug msg="Creating entry point redirect http -> https"
time="2018-10-19T13:07:57Z" level=debug msg="Provider event received {Status:start ID:f3fde9c2dba7db19a0f93ddeb561886bf256be66dc5577543618007ae3618150 From:nginx Type:container Action:start Actor:{ID:f3fde9c2dba7db19a0f93ddeb561886bf256be66dc5577543618007ae3618150 Attributes:map[com.docker.compose.container-number:1 com.docker.compose.service:someapp maintainer:NGINX Docker Maintainers <docker-maint@nginx.com> name:someapp traefik.backend:someapp traefik.frontend.rule:HostRegexp:{catchall:.*} traefik.port:80 com.docker.compose.config-hash:b2b6bad273c2aab4e14fef131d833e586f292e4a56b35863fc0b12775ce5b53a com.docker.compose.oneoff:False com.docker.compose.project:temp com.docker.compose.version:1.22.0 image:nginx traefik.frontend.entryPoints:http,https]} Scope:local Time:1539954477 TimeNano:1539954477343667521}"
time="2018-10-19T13:07:57Z" level=debug msg="Filtering disabled container /traefik"
time="2018-10-19T13:07:57Z" level=debug msg="originLabelsmap[com.docker.compose.version:1.22.0 maintainer:NGINX Docker Maintainers <docker-maint@nginx.com> traefik.backend:someapp com.docker.compose.config-hash:b2b6bad273c2aab4e14fef131d833e586f292e4a56b35863fc0b12775ce5b53a com.docker.compose.container-number:1 com.docker.compose.oneoff:False com.docker.compose.project:temp com.docker.compose.service:someapp traefik.frontend.rule:HostRegexp:{catchall:.*} traefik.frontend.entryPoints:http,https traefik.port:80]"
time="2018-10-19T13:07:57Z" level=debug msg="allLabelsmap[:map[traefik.frontend.entryPoints:http,https traefik.port:80 traefik.backend:someapp traefik.frontend.rule:HostRegexp:{catchall:.*}]]"
time="2018-10-19T13:07:57Z" level=debug msg="originLabelsmap[com.docker.compose.config-hash:b2b6bad273c2aab4e14fef131d833e586f292e4a56b35863fc0b12775ce5b53a com.docker.compose.container-number:1 com.docker.compose.oneoff:False com.docker.compose.project:temp com.docker.compose.service:someapp com.docker.compose.version:1.22.0 maintainer:NGINX Docker Maintainers <docker-maint@nginx.com> traefik.backend:someapp traefik.frontend.rule:HostRegexp:{catchall:.*} traefik.frontend.entryPoints:http,https traefik.port:80]"
time="2018-10-19T13:07:57Z" level=debug msg="allLabelsmap[:map[traefik.backend:someapp traefik.frontend.rule:HostRegexp:{catchall:.*} traefik.frontend.entryPoints:http,https traefik.port:80]]"
time="2018-10-19T13:07:57Z" level=debug msg="Backend backend-someapp: no load-balancer defined, fallback to 'wrr' method"
time="2018-10-19T13:07:57Z" level=debug msg="Configuration received from provider docker: {\"backends\":{\"backend-someapp\":{\"servers\":{\"server-someapp-9121159f5bca979e8087544fd72045a9\":{\"url\":\"http://192.168.32.3:80\",\"weight\":1}},\"loadBalancer\":{\"method\":\"wrr\"}}},\"frontends\":{\"frontend-HostRegexp-catchall-0\":{\"entryPoints\":[\"http\",\"https\"],\"backend\":\"backend-someapp\",\"routes\":{\"route-frontend-HostRegexp-catchall-0\":{\"rule\":\"HostRegexp:{catchall:.*}\"}},\"passHostHeader\":true,\"priority\":0,\"basicAuth\":null}}}"
time="2018-10-19T13:07:57Z" level=info msg="Skipping same configuration for provider docker"
time="2018-10-19T13:07:57Z" level=info msg="Server configuration reloaded on :80"
time="2018-10-19T13:07:57Z" level=info msg="Server configuration reloaded on :443"
time="2018-10-19T13:07:57Z" level=info msg="Server configuration reloaded on :8080"

To me, it looks like the missing ‘reload’ of the Server configuration, after the container died, has something to do with it.

To facilitate further troubleshooting, I’ve attached various files (in zip): working.log (see also above) broken.log (see also above) docker-compose.yml traefik.toml docker-info.log traefik_3780.zip

uname -a:
Linux HOSTNAME 3.10.0-862.3.2.el7.x86_64 #1 SMP Mon May 21 23:36:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
cat /etc/redhat-release
CentOS Linux release 7.5.1804 (Core)

Traefik image ‘latest’ Digest sha256:c25ac3c2736dc41b08c6f40ef3393d03dd7640a037cd53d681af459ccfa95c28

Thanks for your feedbacks!

The questions about OS and Docker versions are aimed at providing as much context as possible to reproduce the issue, which is the requirement to fix it in the future. Same for the 2.0 version, as the codebase changed quite a lot: it helps to gather conditions.

@jortkoopmans @gregoryelmajjouti thanks for the deep technical elements. The developer team will look into this, but no ETA given that it’s hard for us to reproduce and analyse. As the proposed workaround is breaking tests, it means that it might solve this issue, but could lead to greater trouble somewhere else, hence the required analysis.

Thanks @jortkoopmans , we are currently trying to reproduce the issue.

Based on your feedback, I tried on Docker4Mac (with 1 CPU 1 Gb) with similar docker-compose.yml, but I did not see anything as you described. However the disk is an SSD, so I’m gonna try on slower machines on Digital Ocean and AWS to reproduce.

I also encountered the : Unable to find the IP address for the container ... : the server is ignored. error message.

After some troubleshooting I realized that my container was misbehaving and was continuously restarting. As such traefik did not have the time to set up a backend for the container.