traefik: traefik hangs - stops handling requests
traefik version: 1.0.2
Set open files limit to 1000000:
ulimit -n 1000000
traefik config:
traefikLogsFile = "/var/log/traefik.log"
accessLogsFile = "/var/log/traefik-access.log"
logLevel = "DEBUG"
defaultEntryPoints = ["http"]
[entryPoints]
[entryPoints.http]
address = ":8000"
[retry]
[web]
address = ":8080"
[file]
[backends]
[backends.backend1]
[backends.backend1.servers.server1]
url = "http://localhost:80"
[frontends]
[frontends.frontend1]
backend = "backend1"
[frontends.frontend1.routes.test_1]
rule = "Host:test-nginx.example.net"
# curl -s http://localhost:8080/api | jq .
{
"file": {
"frontends": {
"frontend1": {
"priority": 0,
"routes": {
"test_1": {
"rule": "Host:test-nginx.example.net"
}
},
"backend": "backend1",
"entryPoints": [
"http"
]
}
},
"backends": {
"backend1": {
"loadBalancer": {
"method": "wrr"
},
"servers": {
"server1": {
"weight": 0,
"url": "http://localhost:80"
}
}
}
}
}
}
After start everything works:
# curl -svo /dev/null -H "Host:test-nginx.example.net" http://localhost:8000
* Rebuilt URL to: http://localhost:8000/
* Hostname was NOT found in DNS cache
* Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8000 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.35.0
> Accept: */*
> Host:test-nginx.example.net
>
< HTTP/1.1 200 OK
< Content-Length: 612
< Content-Type: text/html
< Date: Fri, 02 Sep 2016 11:56:33 GMT
< Last-Modified: Tue, 04 Mar 2014 11:46:45 GMT
* Server nginx/1.4.6 (Ubuntu) is not blacklisted
< Server: nginx/1.4.6 (Ubuntu)
<
{ [data not shown]
* Connection #0 to host localhost left intact
Then lets test it with wrk
:
wrk -t30 -c400 -d30s -H "Host: test-nginx.example.net" http://localhost:8000
After some time (one or few attempts) traefik is unresponsive on port 8000:
# curl -svo /dev/null -H "Host:test-nginx.example.net" http://localhost:8000
* Rebuilt URL to: http://localhost:8000/
* Hostname was NOT found in DNS cache
* Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8000 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.35.0
> Accept: */*
> Host:test-nginx.example.net
So traffic is no longer processed.
Health API is also unresponsive:
# curl -svo /dev/null http://localhost:8080/health
* Hostname was NOT found in DNS cache
* Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8080 (#0)
> GET /health HTTP/1.1
> User-Agent: curl/7.35.0
> Host: localhost:8080
> Accept: */*
Whats interesting dashboard is responsive (but without data):
# curl -svo /dev/null http://localhost:8080/dashboard/
* Hostname was NOT found in DNS cache
* Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8080 (#0)
> GET /dashboard/ HTTP/1.1
> User-Agent: curl/7.35.0
> Host: localhost:8080
> Accept: */*
>
< HTTP/1.1 200 OK
< Accept-Ranges: bytes
< Content-Length: 1388
< Content-Type: text/html; charset=utf-8
< Last-Modified: Tue, 02 Aug 2016 17:29:47 GMT
< Date: Fri, 02 Sep 2016 12:03:29 GMT
<
{ [data not shown]
* Connection #0 to host localhost left intact

Traefik access logs just stop writing:
127.0.0.1 - - [02/Sep/2016:17:01:44 +0200] "GET / HTTP/1.1" 200 612 "" "" 23814 "frontend1" "http://localhost:80" 218.853406ms
127.0.0.1 - - [02/Sep/2016:17:01:44 +0200] "GET / HTTP/1.1" 200 612 "" "" 23813 "frontend1" "http://localhost:80" 219.370314ms
127.0.0.1 - - [02/Sep/2016:17:01:44 +0200] "GET / HTTP/1.1" 200 612 "" "" 23812 "frontend1" "http://localhost:80" 219.933964ms
Logs (severity DEBUG) shows in log file all the time even for requests without reply:
time="2016-09-02T17:02:20+02:00" level=debug msg="Round trip: http://localhost:80, code: 200, duration: 9.730448ms"
time="2016-09-02T17:02:20+02:00" level=debug msg="Round trip: http://localhost:80, code: 200, duration: 10.423115ms"
time="2016-09-02T17:02:20+02:00" level=debug msg="Round trip: http://localhost:80, code: 200, duration: 10.664671ms"
strace during attempt to send request to traefik (curl -svo /dev/null -H "Host:test-nginx.example.net" http://localhost:8000
):
https://gist.github.com/r0bj/b618c74b1bc0db5c11f78db08c34fc15
So it seems that request hits backend but response isn’t sent to original sender.
There are many connections in CLOSE_WAIT
status:
https://gist.github.com/r0bj/c647c76fe65a562ffd2e024e11a260cd
Restart treafik daemon fixes this issue.
It’s easy to replicate this issue:
- start vagrant host with ubuntu from official
ubuntu/trusty64
image with default settings - ulimit -n 1000000
- install nginx
- start traefik with above config
- hit it with
wrk
benchmark one or more times
One can also replicate this issue with sending wrk
requests to non existing backend (resulting 404):
wrk -t30 -c400 -d30 http://localhost:8000/
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 22 (9 by maintainers)
Already released: https://github.com/containous/traefik/releases/tag/v1.0.3 😃 Just waiting for docker to merge: https://github.com/docker-library/official-images/pull/2169. You can use
containous/traefik:v1.0.3
in the meantime.I think I found the issue. This seems due to a race condition in https://github.com/thoas/stats. It produces when accessing to the
/health
endpoint of traefik’s web ui, and at the same time, making requests to traefik reverse proxy.Could you confirm that you are accessing
/health
during your tests (with a healthcheck or if you have the web ui opened) ?A workaround is to avoid accessing webui during tests and change your healthcheck to
/api
endpoint.I’m investigating if the issue is still present in the master branch.