traefik: Sporadic 502 response only when running through traefik
Do you want to request a feature or report a bug?
Bug
What did you do?
I have a graphql API running on NodeJS using Apollo and Express, with traefik in front.
When proxying through traefik I get sporadic 502 responses that I have not been able to resolve.
I does never happen when I bypass the proxy and connect directly to the backend node server.
I am running all tests locally on my dev machine.
My first attempt to force the error was load testing with the locust framework. However, even when sending large amounts of request through the proxy I was unable to replicate the error. It only happens when I use the frontend application in the browser.
After reading this oxy issue I started suspecting cancelled connections.
I added a custom HTTP header with a UUID to be able to trace all requests, which I print on the backend.
app.use((req, res, next) => {
const id = req.headers['x-request-id'];
if (id) {
console.log(`Request id: ${id}`);
}
next();
});
Then I also added the following event listener to the express server to detect cancelled requests
app.use((req, res, next) => {
req.connection.on('close', () => {
const id = req.headers['x-request-id'];
console.log(`Cancelled request ${id}`);
});
next();
});
What I can see is that I do get cancelled requests when running the application in the browser, and at some point i get a 502
response from traefik. And in the traefik log this is
DEBU[2018-04-26T13:43:51+02:00] vulcand/oxy/forward/http: Round trip: http://localhost:6543, code: 502, Length: 11, duration: 66.352475ms
And the nodejs backend log looks something like this:
...
Request id: 7455804b-490a-4361-98e5-43d12bf4aca8
Request id: 737f8d9d-3300-461b-858b-07006582a937
POST /graphql 200 83.542 ms - 310
POST /graphql 200 16.441 ms - 682
Request id: 096e0e39-90e6-475c-b8ad-0aa2dfd2e345
POST /graphql 200 5.338 ms - 163
Request id: 69f17cb2-cdf1-4db5-a9f5-08e46d795892
Request id: 50d3aec6-5cda-4e8b-ac0e-a30a57fa94c9
POST /graphql 200 58.596 ms - 310
POST /graphql 200 15.526 ms - 682
Request id: 1d051f3a-7d80-464b-b50f-6d8e733d1940
<------------- Here I get the 502
Cancelled request 2e0a8e14-9880-46e7-8e51-ad528d55a81d
Cancelled request b9489e71-7fc5-4f1c-b30a-668aac4652f9
Cancelled request 249c529c-b9cb-4b48-a491-8e38a7ee50d8
Cancelled request a5be4a66-9d43-4e30-a92d-862b355399a0
Cancelled request 3721fe71-fe18-4389-812a-a90cc2f4f0f1
Cancelled request 71b74750-8078-471e-91b8-a8119e5db797
Cancelled request 34fb6b91-9fa5-4d68-92da-c267089f5910
Cancelled request 692770b1-61c3-49c2-8309-8e7be629dca1
Cancelled request 05790579-8290-4787-a7b7-82596ad24520
Cancelled request c8edcc39-30c7-4812-941c-a1899298acf7
Cancelled request 2ba9e715-ab7c-48ee-9d35-b5609179de6e
Cancelled request b34f4725-665f-4b27-b3e1-cefec20c2ade
Cancelled request 04bd3718-f6aa-4318-a469-fa3e17f54a20
Cancelled request 4aedc60c-269a-420c-b083-1ea8f2e3243e
Cancelled request 25be7334-43f9-4135-9537-36b0e36e698c
Cancelled request 47bc1f9f-55c7-4f31-9957-7f0ad4285314
Cancelled request bae3237c-efc8-4831-8260-6edbcedef28f
Cancelled request 54685ecb-4d34-4698-b956-d0602b74a2e4
Cancelled request 965f6ff2-da91-423c-a8e4-c2f4252f25fc
Cancelled request 95c77d5c-230d-4875-8b25-fc0673c8e595
Cancelled request 01658960-4627-42f8-a496-d29408a9579b
Cancelled request 38221ac3-47ed-42f2-a56e-31deacdbfd62
Cancelled request e73bec6b-744c-47bc-b001-0d914f03e976
Cancelled request 73fade75-a943-45df-8b21-f8c50a480170
Cancelled request 02688ad9-e947-415f-b70c-3cda16c50cf2
Cancelled request 5d7d26c2-8c69-4083-a2d3-f0e1ae23bd0f
Cancelled request f81a0258-085d-462f-9fcb-8a8b47918d04
...
The failed request that gets a 502 response in the browser never reach the node server backend.
I get a whole bunch of canceled request after the 502 occurs. These request IDs have been successfully served by the nodejs application at an earlier point.
The canceling of the request seem to indicate some kind of connection leak? Or maybe just a sideffect of having chrome developer tools open?
Anyway I never get any error response when bypassing the traefik instance.
As the oxy issue describes, if I just could get some other response than 502 for cancelled requests I could handle this better on the client side.
Output of traefik version
: (What version of Traefik are you using?)
Get the problem with the docker release as well as my homebrew install
Homebrew traefik version:
Version: dev
Codename: cheddar
Go version: go1.10
Built: I don't remember exactly
OS/Arch: darwin/amd64
Docker traefik version:
Version: v1.5.2
Codename: cancoillotte
Go version: go1.9.4
Built: 2018-02-12_10:56:31AM
OS/Arch: linux/amd64
What is your environment & configuration (arguments, toml, provider, platform, …)?
debug = true
logLevel = "DEBUG"
defaultEntryPoints = ["http"]
[entryPoints]
[entryPoints.ping]
address = ":8082"
[entryPoints.api]
address = ":8080"
[entryPoints.http]
address = ":80"
[retry]
[ping]
entryPoint = "ping"
[api]
entryPoint = "api"
[api.statistics]
[file]
[backends]
[backends.bct]
[backends.bct.servers]
[backends.bct.servers.server0]
# url = "http://docker.for.mac.host.internal:6543"
url = "http://localhost:6543"
[frontends]
[frontends.bct]
entryPoints = ["http"]
backend = "bct"
[docker]
endpoint = "unix:///var/run/docker.sock"
# domain = "docker.for.mac.host.internal"
domain = "localhost"
watch = true
exposedbydefault = false
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 20
- Comments: 49 (5 by maintainers)
May be useful: to disable keepalives on Traefik side just set in Traefilk config:
I was unable to find any documentation according to negative values for this parameter, but looks like this link explains why it works: https://github.com/golang/go/blob/release-branch.go1.12/src/net/http/transport.go#L743
In such way I’ve fixed integration with Sentry (it uses UWSGI in its official docker image). This parameter works with Traefik
v1.7.12
andv2.0.0-beta1
as well. By the way, I cannot fix the issue just decreasingidleConnTimeout
(from #4983)We had exactly the same problem. In our case, it was between traefik V1.7.6 and multiple Node.js instances:
tcpdump show that the connection of this request is closed by RST tcp:
The connection (port 38934, from proxy) is already open (keep alive). The request arrive to node, is ACKed by node (with the good seq) and just after reset. We can see a second connection (port 38933), which has no problem.
I suspected a race condition if the request arrive to Node exactly at the moment it decided to close the connection after the keep alive timeout (5s on node>=8):
I tested this script on environment with only one node backend and only these requests (to have one request every ~5s)
Bingo ! I got many 502 errors.
The solution was to set node keep alive timeout higher than traefik (90s, not configurable on traefik #727 ). In this case, Traefik will close connection before Node and the race condition won’t be possible any more.
I am not sure, but I think it is more a nodejs bug (it should not ACKnowledge the packet and reset connection just after). Howewer, it would be nice to be able to configure traefik keep alive timeout.
My colleagues managed to fix this issue in our project. In essence Traefik doesn’t support HTTP keep-alive, which causes errors when the client tries to reuse connections. This looks like the same issue as #615 and #1046.
We’re having the same issue with Traefik
v1.7.4
(same problem appeared in previous versions we’ve tried). Traefik is “consistently” throwing502
for approximately0.5%-2%
of requests. The issue becomes more apparent with a lower number of backend instances.I wasn’t able to reproduce the issue on simple
GET
requests. With more complex payload, 30 concurrent connections on 4 instances of Traefik with 2 backends and 1000 requests in total:same experiment with HAProxy
The performance is somewhat comparable, but haproxy is not throwing randomly
502
s.With more backend instances there are fewer errors, but still it hits at least
0.5%
portion of traffic. Which is a huge problem when you’re targeting for availability starting with99.9%
had the same issue with traefik 2.0.4. After setting --providers.docker.network=XXXX the issue didn’t came up anymore.
I fixed my problem by updating the docker version. I was using 18.03.1, my problem was solved with version 18.09 CE.
We have fixed, no idea what’s the resolution.
Now we revert all the traefik version into V1.6.4 and isolate the traefik per namespace. ( --kubernetes.namespaces=xxx). We also roll out the k8s cluster to fix the potential network-proxy issue and add “–retry.attempts=10”.
Hope this may help someone.
BTW: Our docker version is 18.06 on k8s. We didn’t touch that as this works for a long time.
@jhinklew Same here. We use K8S V1.9.4 on AWS. We may resolve this issue yesterday. What we did is
–kubernetes.namespaces=xxx –idletimeout=60s
–forwardingtimeouts.responseheadertimeout=60s
After that, all the “500” and “502” disappeared. We are still working on another issue atm.
~Is this issue still ongoing in 1.7.3? I am experiencing similar behavior of sporadic 502’s~ I learned this was due to a separate issue.