traefik: traefik hangs - stops handling requests

traefik version: 1.0.2

Set open files limit to 1000000: ulimit -n 1000000

traefik config:

traefikLogsFile = "/var/log/traefik.log"
accessLogsFile = "/var/log/traefik-access.log"
logLevel = "DEBUG"

defaultEntryPoints = ["http"]

[entryPoints]
  [entryPoints.http]
  address = ":8000"

[retry]

[web]
address = ":8080"

[file]

[backends]
  [backends.backend1]
    [backends.backend1.servers.server1]
    url = "http://localhost:80"

[frontends]
  [frontends.frontend1]
  backend = "backend1"
    [frontends.frontend1.routes.test_1]
    rule = "Host:test-nginx.example.net"
# curl -s http://localhost:8080/api | jq .
{
  "file": {
    "frontends": {
      "frontend1": {
        "priority": 0,
        "routes": {
          "test_1": {
            "rule": "Host:test-nginx.example.net"
          }
        },
        "backend": "backend1",
        "entryPoints": [
          "http"
        ]
      }
    },
    "backends": {
      "backend1": {
        "loadBalancer": {
          "method": "wrr"
        },
        "servers": {
          "server1": {
            "weight": 0,
            "url": "http://localhost:80"
          }
        }
      }
    }
  }
}

After start everything works:

# curl -svo /dev/null -H "Host:test-nginx.example.net" http://localhost:8000
* Rebuilt URL to: http://localhost:8000/
* Hostname was NOT found in DNS cache
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8000 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.35.0
> Accept: */*
> Host:test-nginx.example.net
>
< HTTP/1.1 200 OK
< Content-Length: 612
< Content-Type: text/html
< Date: Fri, 02 Sep 2016 11:56:33 GMT
< Last-Modified: Tue, 04 Mar 2014 11:46:45 GMT
* Server nginx/1.4.6 (Ubuntu) is not blacklisted
< Server: nginx/1.4.6 (Ubuntu)
<
{ [data not shown]
* Connection #0 to host localhost left intact

Then lets test it with wrk: wrk -t30 -c400 -d30s -H "Host: test-nginx.example.net" http://localhost:8000

After some time (one or few attempts) traefik is unresponsive on port 8000:

# curl -svo /dev/null -H "Host:test-nginx.example.net" http://localhost:8000
* Rebuilt URL to: http://localhost:8000/
* Hostname was NOT found in DNS cache
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8000 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.35.0
> Accept: */*
> Host:test-nginx.example.net

So traffic is no longer processed.

Health API is also unresponsive:

# curl -svo /dev/null http://localhost:8080/health
* Hostname was NOT found in DNS cache
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8080 (#0)
> GET /health HTTP/1.1
> User-Agent: curl/7.35.0
> Host: localhost:8080
> Accept: */*

Whats interesting dashboard is responsive (but without data):

# curl -svo /dev/null http://localhost:8080/dashboard/
* Hostname was NOT found in DNS cache
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8080 (#0)
> GET /dashboard/ HTTP/1.1
> User-Agent: curl/7.35.0
> Host: localhost:8080
> Accept: */*
>
< HTTP/1.1 200 OK
< Accept-Ranges: bytes
< Content-Length: 1388
< Content-Type: text/html; charset=utf-8
< Last-Modified: Tue, 02 Aug 2016 17:29:47 GMT
< Date: Fri, 02 Sep 2016 12:03:29 GMT
<
{ [data not shown]
* Connection #0 to host localhost left intact
screen shot 2016-09-02 at 16 04 59

Traefik access logs just stop writing:

127.0.0.1 - - [02/Sep/2016:17:01:44 +0200] "GET / HTTP/1.1" 200 612 "" "" 23814 "frontend1" "http://localhost:80" 218.853406ms
127.0.0.1 - - [02/Sep/2016:17:01:44 +0200] "GET / HTTP/1.1" 200 612 "" "" 23813 "frontend1" "http://localhost:80" 219.370314ms
127.0.0.1 - - [02/Sep/2016:17:01:44 +0200] "GET / HTTP/1.1" 200 612 "" "" 23812 "frontend1" "http://localhost:80" 219.933964ms

Logs (severity DEBUG) shows in log file all the time even for requests without reply:

time="2016-09-02T17:02:20+02:00" level=debug msg="Round trip: http://localhost:80, code: 200, duration: 9.730448ms"
time="2016-09-02T17:02:20+02:00" level=debug msg="Round trip: http://localhost:80, code: 200, duration: 10.423115ms"
time="2016-09-02T17:02:20+02:00" level=debug msg="Round trip: http://localhost:80, code: 200, duration: 10.664671ms"

strace during attempt to send request to traefik (curl -svo /dev/null -H "Host:test-nginx.example.net" http://localhost:8000): https://gist.github.com/r0bj/b618c74b1bc0db5c11f78db08c34fc15

So it seems that request hits backend but response isn’t sent to original sender.

There are many connections in CLOSE_WAIT status: https://gist.github.com/r0bj/c647c76fe65a562ffd2e024e11a260cd

Restart treafik daemon fixes this issue.

It’s easy to replicate this issue:

  1. start vagrant host with ubuntu from official ubuntu/trusty64 image with default settings
  2. ulimit -n 1000000
  3. install nginx
  4. start traefik with above config
  5. hit it with wrk benchmark one or more times

One can also replicate this issue with sending wrk requests to non existing backend (resulting 404): wrk -t30 -c400 -d30 http://localhost:8000/

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 22 (9 by maintainers)

Most upvoted comments

Already released: https://github.com/containous/traefik/releases/tag/v1.0.3 😃 Just waiting for docker to merge: https://github.com/docker-library/official-images/pull/2169. You can use containous/traefik:v1.0.3 in the meantime.

I think I found the issue. This seems due to a race condition in https://github.com/thoas/stats. It produces when accessing to the /health endpoint of traefik’s web ui, and at the same time, making requests to traefik reverse proxy.

Could you confirm that you are accessing /health during your tests (with a healthcheck or if you have the web ui opened) ?

A workaround is to avoid accessing webui during tests and change your healthcheck to /api endpoint.

I’m investigating if the issue is still present in the master branch.