traefik: Sporadic 502 response only when running through traefik

Do you want to request a feature or report a bug?

Bug

What did you do?

I have a graphql API running on NodeJS using Apollo and Express, with traefik in front.

When proxying through traefik I get sporadic 502 responses that I have not been able to resolve.

I does never happen when I bypass the proxy and connect directly to the backend node server.

I am running all tests locally on my dev machine.

My first attempt to force the error was load testing with the locust framework. However, even when sending large amounts of request through the proxy I was unable to replicate the error. It only happens when I use the frontend application in the browser.

After reading this oxy issue I started suspecting cancelled connections.

I added a custom HTTP header with a UUID to be able to trace all requests, which I print on the backend.

  app.use((req, res, next) => {
    const id = req.headers['x-request-id'];
    if (id) {
      console.log(`Request id: ${id}`);
    }
    next();
  });

Then I also added the following event listener to the express server to detect cancelled requests

  app.use((req, res, next) => {
    req.connection.on('close', () => {
      const id = req.headers['x-request-id'];
      console.log(`Cancelled request ${id}`);
    });

    next();
  });

What I can see is that I do get cancelled requests when running the application in the browser, and at some point i get a 502 response from traefik. And in the traefik log this is

DEBU[2018-04-26T13:43:51+02:00] vulcand/oxy/forward/http: Round trip: http://localhost:6543, code: 502, Length: 11, duration: 66.352475ms 

And the nodejs backend log looks something like this:

...
Request id: 7455804b-490a-4361-98e5-43d12bf4aca8
Request id: 737f8d9d-3300-461b-858b-07006582a937
POST /graphql 200 83.542 ms - 310
POST /graphql 200 16.441 ms - 682
Request id: 096e0e39-90e6-475c-b8ad-0aa2dfd2e345
POST /graphql 200 5.338 ms - 163
Request id: 69f17cb2-cdf1-4db5-a9f5-08e46d795892
Request id: 50d3aec6-5cda-4e8b-ac0e-a30a57fa94c9
POST /graphql 200 58.596 ms - 310
POST /graphql 200 15.526 ms - 682
Request id: 1d051f3a-7d80-464b-b50f-6d8e733d1940
<------------- Here I get the 502
Cancelled request 2e0a8e14-9880-46e7-8e51-ad528d55a81d
Cancelled request b9489e71-7fc5-4f1c-b30a-668aac4652f9
Cancelled request 249c529c-b9cb-4b48-a491-8e38a7ee50d8
Cancelled request a5be4a66-9d43-4e30-a92d-862b355399a0
Cancelled request 3721fe71-fe18-4389-812a-a90cc2f4f0f1
Cancelled request 71b74750-8078-471e-91b8-a8119e5db797
Cancelled request 34fb6b91-9fa5-4d68-92da-c267089f5910
Cancelled request 692770b1-61c3-49c2-8309-8e7be629dca1
Cancelled request 05790579-8290-4787-a7b7-82596ad24520
Cancelled request c8edcc39-30c7-4812-941c-a1899298acf7
Cancelled request 2ba9e715-ab7c-48ee-9d35-b5609179de6e
Cancelled request b34f4725-665f-4b27-b3e1-cefec20c2ade
Cancelled request 04bd3718-f6aa-4318-a469-fa3e17f54a20
Cancelled request 4aedc60c-269a-420c-b083-1ea8f2e3243e
Cancelled request 25be7334-43f9-4135-9537-36b0e36e698c
Cancelled request 47bc1f9f-55c7-4f31-9957-7f0ad4285314
Cancelled request bae3237c-efc8-4831-8260-6edbcedef28f
Cancelled request 54685ecb-4d34-4698-b956-d0602b74a2e4
Cancelled request 965f6ff2-da91-423c-a8e4-c2f4252f25fc
Cancelled request 95c77d5c-230d-4875-8b25-fc0673c8e595
Cancelled request 01658960-4627-42f8-a496-d29408a9579b
Cancelled request 38221ac3-47ed-42f2-a56e-31deacdbfd62
Cancelled request e73bec6b-744c-47bc-b001-0d914f03e976
Cancelled request 73fade75-a943-45df-8b21-f8c50a480170
Cancelled request 02688ad9-e947-415f-b70c-3cda16c50cf2
Cancelled request 5d7d26c2-8c69-4083-a2d3-f0e1ae23bd0f
Cancelled request f81a0258-085d-462f-9fcb-8a8b47918d04
...

The failed request that gets a 502 response in the browser never reach the node server backend.

I get a whole bunch of canceled request after the 502 occurs. These request IDs have been successfully served by the nodejs application at an earlier point.

The canceling of the request seem to indicate some kind of connection leak? Or maybe just a sideffect of having chrome developer tools open?

Anyway I never get any error response when bypassing the traefik instance.

As the oxy issue describes, if I just could get some other response than 502 for cancelled requests I could handle this better on the client side.

Output of traefik version: (What version of Traefik are you using?)

Get the problem with the docker release as well as my homebrew install

Homebrew traefik version:

Version:      dev
Codename:     cheddar
Go version:   go1.10
Built:        I don't remember exactly
OS/Arch:      darwin/amd64

Docker traefik version:

Version:      v1.5.2
Codename:     cancoillotte
Go version:   go1.9.4
Built:        2018-02-12_10:56:31AM
OS/Arch:      linux/amd64

What is your environment & configuration (arguments, toml, provider, platform, …)?

debug = true

logLevel = "DEBUG"
defaultEntryPoints = ["http"]

[entryPoints]
  [entryPoints.ping]
  address = ":8082"

  [entryPoints.api]
  address = ":8080"

  [entryPoints.http]
  address = ":80"

[retry]

[ping]
entryPoint = "ping"

[api]
entryPoint = "api"
  [api.statistics]

[file]
[backends]
  [backends.bct]
    [backends.bct.servers]
      [backends.bct.servers.server0]
        # url = "http://docker.for.mac.host.internal:6543"
        url = "http://localhost:6543"


[frontends]
  [frontends.bct]
    entryPoints = ["http"]
    backend = "bct"

[docker]
  endpoint = "unix:///var/run/docker.sock"
  # domain = "docker.for.mac.host.internal"
  domain = "localhost"
  watch = true
  exposedbydefault = false

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 20
  • Comments: 49 (5 by maintainers)

Commits related to this issue

Most upvoted comments

May be useful: to disable keepalives on Traefik side just set in Traefilk config:

MaxIdleConnsPerHost = -1

I was unable to find any documentation according to negative values for this parameter, but looks like this link explains why it works: https://github.com/golang/go/blob/release-branch.go1.12/src/net/http/transport.go#L743

In such way I’ve fixed integration with Sentry (it uses UWSGI in its official docker image). This parameter works with Traefik v1.7.12 and v2.0.0-beta1 as well. By the way, I cannot fix the issue just decreasing idleConnTimeout (from #4983)

We had exactly the same problem. In our case, it was between traefik V1.7.6 and multiple Node.js instances:

  • 502 in traefik logs
  • no trace of these requests on node instances logs

tcpdump show that the connection of this request is closed by RST tcp:

15:55:01.107956 IP proxy.38934 > node.20101: Flags [P.], seq 3508916203:3508916609, ack 1893329136, win 547, length 406
E.....@.@.:...d...d...N..%..p...P..#!...POST /someroute HTTP/1.1
Host: example.com
User-Agent: curl/7.54.0
Content-Length: 49
Accept: */*
Accept-Encoding: deflate, gzip
Content-Type: application/json
Counter: 1779114827
X-Forwarded-For: 1.2.3.4
X-Forwarded-Host: example.com
X-Forwarded-Port: 443
X-Forwarded-Proto: https
X-Forwarded-Server: proxy
X-Real-Ip: 1.2.3.4

15:55:01.108095 IP node.20101 > proxy.38934: Flags [.], ack 3508916609, win 555, length 0
E..(..@.@./...d...d.N...p....%..P..+.[........
15:55:01.108331 IP proxy.38934 > node.20101: Flags [P.], seq 3508916609:3508916658, ack 1893329136, win 547, length 49
E..Y..@.@.<M..d...d...N..%..p...P..# ...{"token": "token"}
15:55:01.108449 IP node.20101 > proxy.38934: Flags [.], ack 3508916658, win 555, length 0
E..(..@.@./...d...d.N...p....%..P..+.*........
15:55:01.471175 IP node.20101 > proxy.38933: Flags [F.], seq 2831110990, ack 34277271, win 530, length 0
E..(UO@.@..@..d...d.N.....SN....P....6........
15:55:01.471369 IP proxy.38933 > node.20101: Flags [F.], seq 34277271, ack 2831110991, win 530, length 0
E..(.7@.@..X..d...d...N.......SOP... [..
15:55:01.471471 IP node.20101 > proxy.38933: Flags [.], ack 34277272, win 530, length 0
E..(UP@.@..?..d...d.N.....SO....P....5........
15:55:01.478389 IP node.20101 > proxy.38934: Flags [R.], seq 1893329136, ack 3508916658, win 555, length 0
E..(..@.@./...d...d.N...p....%..P..+.&........

The connection (port 38934, from proxy) is already open (keep alive). The request arrive to node, is ACKed by node (with the good seq) and just after reset. We can see a second connection (port 38933), which has no problem.

I suspected a race condition if the request arrive to Node exactly at the moment it decided to close the connection after the keep alive timeout (5s on node>=8):

import requests
import time

start = 4.982
end = 5.00
step = 0.0001

session = requests.Session()
payload = '{"token": "token"}'

sleep_seconds = start
while sleep_seconds < end:
    r = session.post("https://example.com/someroute", data=payload)
    print(r.status_code)
    sleep_seconds += step
    print(f'wait {sleep_seconds}')
    time.sleep(sleep_seconds)

I tested this script on environment with only one node backend and only these requests (to have one request every ~5s)

Bingo ! I got many 502 errors.

The solution was to set node keep alive timeout higher than traefik (90s, not configurable on traefik #727 ). In this case, Traefik will close connection before Node and the race condition won’t be possible any more.

I am not sure, but I think it is more a nodejs bug (it should not ACKnowledge the packet and reset connection just after). Howewer, it would be nice to be able to configure traefik keep alive timeout.

My colleagues managed to fix this issue in our project. In essence Traefik doesn’t support HTTP keep-alive, which causes errors when the client tries to reuse connections. This looks like the same issue as #615 and #1046.

We’re having the same issue with Traefik v1.7.4 (same problem appeared in previous versions we’ve tried). Traefik is “consistently” throwing 502 for approximately 0.5%-2% of requests. The issue becomes more apparent with a lower number of backend instances.

I wasn’t able to reproduce the issue on simple GET requests. With more complex payload, 30 concurrent connections on 4 instances of Traefik with 2 backends and 1000 requests in total:

Average request time: 0.8031490905199999
0. precentile: 0.0697
10. precentile: 0.0987
20. precentile: 0.1109
30. precentile: 0.1275
40. precentile: 0.5045
50. precentile: 0.8564
60. precentile: 1.0508
70. precentile: 1.0612
80. precentile: 1.0687
90. precentile: 1.7269
100. precentile: 3.0648
Errors: 20/1000 2.00%

same experiment with HAProxy

Average request time: 0.9934568203530013
0. precentile: 0.0741
10. precentile: 0.1035
20. precentile: 0.1366
30. precentile: 0.4125
40. precentile: 0.9118
50. precentile: 1.0552
60. precentile: 1.0616
70. precentile: 1.0667
80. precentile: 1.0824
90. precentile: 2.1481
100. precentile: 3.0761
Errors: 0/1000 0.00%

The performance is somewhat comparable, but haproxy is not throwing randomly 502s.

With more backend instances there are fewer errors, but still it hits at least 0.5% portion of traffic. Which is a huge problem when you’re targeting for availability starting with 99.9%

had the same issue with traefik 2.0.4. After setting --providers.docker.network=XXXX the issue didn’t came up anymore.

I fixed my problem by updating the docker version. I was using 18.03.1, my problem was solved with version 18.09 CE.

We have fixed, no idea what’s the resolution.

Now we revert all the traefik version into V1.6.4 and isolate the traefik per namespace. ( --kubernetes.namespaces=xxx). We also roll out the k8s cluster to fix the potential network-proxy issue and add “–retry.attempts=10”.

  • ( Upgrade to V1.6.6 fix the 500 error, still 502
  • ( Upgrade to V1.7.4 or V1.7.5 throws 502
  • Even a fresh deployment didn’t help

Hope this may help someone.

BTW: Our docker version is 18.06 on k8s. We didn’t touch that as this works for a long time.

@jhinklew Same here. We use K8S V1.9.4 on AWS. We may resolve this issue yesterday. What we did is

    • Recreate the entire k8s cluster
    • Redeploy all the traefik with the following parameters –retry.attempts=10
      –kubernetes.namespaces=xxx –idletimeout=60s
      –forwardingtimeouts.responseheadertimeout=60s

After that, all the “500” and “502” disappeared. We are still working on another issue atm.

~Is this issue still ongoing in 1.7.3? I am experiencing similar behavior of sporadic 502’s~ I learned this was due to a separate issue.