traefik: Upstream golang HTTP2 bug hangs chromium-based browsers

Do you want to request a feature or report a bug?

Bug

What did you do?

We used to use Traefik as a k8s ingress controller back in the 1.6/1.7 days but at one point a new version introduced a crippling bug that caused users using Chrome to encounter “lockups” where all network communication between Chrome and Traefik would cease and the users would no longer be able to use our app until some period of time later (usually around 5-10 minutes). This included other browser tabs and windows, the only recourse an end-user had would be to completely close Chrome or use a different browser. Unfortunately we didn’t have the time to debug further and switched to ingress-nginx which did not exhibit the issue.

Recently, we decided to give Traefik 2.4 another shot (hoping that the issue had been resolved already) but unfortunately it was still present. However, now much better at debugging we were able to track it down to an upstream issue in the Golang /net/http library.

This is the underlying issue: https://github.com/golang/go/issues/42534

I confirmed this by reproducing the hang and then while the hang was happening opening the debug pprof and seeing where the goroutines were stuck: pprof-goroutines.log This shows that a very large number of goroutines are all stuck in the same spot as the other person in the golang bug report experienced (writeHeaders) (exact offset is slightly different likely due to slightly different versions of go used).

Additionally confirming the bug report is the fact that adding the following to server_entrypoint_tcp.go at line 503 and creating a new Traefik image results in the issue no longer being reproducible:

http2.ConfigureServer(serverHTTP, nil)

Unfortunately according to that issue report it seems that the bugfix missed the 1.16 somehow so this is likely something that should be fixed in Traefik until the upstream situation changes. I am going to hold off submitting the above code change as a 1-line PR as a Traefik maintainer may have a better idea of how to fix this.

Given how long this issue has been present it is very likely that other people have encountered this and noted it as general instability around Traefik.

Output of traefik version: (What version of Traefik are you using?)

Version:      2.4.6
Codename:     livarot
Go version:   go1.15.8
Built:        2021-03-01T18:25:05Z
OS/Arch:      linux/amd64

What is your environment & configuration (arguments, toml, provider, platform, …)?

--entryPoints.traefik.address=:9000/tcp
--entryPoints.web.address=:8000/tcp
--entryPoints.websecure.address=:8443/tcp
--api.dashboard=true
--ping=true
--providers.kubernetescrd
--providers.kubernetesingress
--entrypoints.web.http.redirections.entryPoint.to=:443
--entrypoints.web.http.redirections.entryPoint.scheme=https
--entrypoints.websecure.http.tls=true
--entrypoints.websecure.http.tls.options=default
--log.level=INFO
--metrics.prometheus=true
--providers.kubernetesingress.ingressendpoint.hostname=ingress-tr-i-o1.our.domain
--providers.kubernetesingress.labelselector=traffic=traefik-internal
--entryPoints.web.proxyProtocol.trustedIPs=100.64.0.0/10
--entryPoints.websecure.proxyProtocol.trustedIPs=100.64.0.0/10
--entryPoints.web.transport.lifeCycle.requestAcceptGraceTimeout=60
--entryPoints.websecure.transport.lifeCycle.requestAcceptGraceTimeout=60
--entryPoints.web.transport.respondingTimeouts.idleTimeout=65
--entryPoints.websecure.transport.respondingTimeouts.idleTimeout=65
--api.debug=true

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 19 (4 by maintainers)

Most upvoted comments

I’m seeing this issue on 2.7

That version of Traefik is end-of-life and has known vulnerabilities in the golang runtime as well as in the various dependencies (including the net/http libraries). You really should be updating far more often than you are.

To answer your question, no this is not an issue anymore as Traefik switched to using the http2 library directly a few releases back.

Hello @ReillyBrogan @Marcus-Smallman,

It looks like PR #8781 fixed this issue according to the original comment, and more specifically those changes: https://github.com/traefik/traefik/blob/8c56d1a3388bb1ce52e3fd6a76ecedde8e78de8b/pkg/server/server_entrypoint_tcp.go#L546-L558

It has been shipped with v2.8.

We are closing this issue accordingly. Closed by #8781.

@geekgonecrazy I spoke with Traefik on this issue a few months ago and they looked at the issue on the Traefik Enterprise side and gave me a quote on how much it would be to get it fixed in the open source project. So i guess its still an issue thats not being resolved.

I was hoping they would share their findings with the community so that this could potentially be fixed by one of the contributors but i’ve not heard anything in months and we still struggle with this.