prometheus: Prometheus should use http2 health monitor/set http2 ReadIdleTimeout
What did you do?
After upgrade from Prometheus v2.18.1 to v2.19.2, we had massive issues with a central Prometheus federating a bunch of others. By bisecting the changes, I was able to nail it down to enabling http2 in #7258 (which was well hidden in the commit message and not mentioned in the changelog). Federation failed for up to 15 minutes, which we explain with the kernel getting hold of the broken connection after some time, but Prometheus still trying to use it for that long. Revoking the prometheus/common
upgrade resolved the issue with 2.19.2. We assume we have rare issues of long-lasting network connections breaking on infrastructure level.
The issue somewhat reminds me of kubernetes/kubernetes#87615, where golang/net#55 is proposed to early recognize and close broken connections. Investigating how to set the ReadIdleTimeout
value in Prometheus, I realized we’re lucky that Prometheus is vendoring x/net
and does not require Golang 1.15, but the value is not properly exposed the way htt2.Transport
is usually instantiated (also Kubernetes is doing it this way): golang/go#40201. A pull request to resolve this is (allow configuration through http2.ConfigureTransportWithOptions
) already proposed, but not merged or even discussed yet: golang/net#74. It even “will not make it into 1.15”.
From discussion with the implementer on golang side, there would be three options to resolve this issue until we can finally make the value configurable in an official way:
- Hacking the vendored
x/net/http2
code to include the variable. Easy, but probably worst method to implement it. I used it anyway to build an internal hotfix release and verify the issue and solution through http2 health checks. - Using reflection to access the inadvertently hidden fields by getting access to the
http2.Transport
(example in thegolang/net
PR’s test case). - Implementing a (hidden?) feature toggle to not upgrade the
http.Transport
for http2, ie. putting anif
statement around thehttp2.ConfigureTransport
call (but this would need to passed through from Prometheus to common somehow). Not using http2 also resolves the issue, as I tested by reverting the entire common upgrade.
We ran code with the fix for ~48h now, and did not have any federation down issues any more (we had in average one per hour before with 2.19.2).
What did you expect to see?
Broken http2 connections are recognized soon and reestablished.
What did you see instead? Under which circumstances?
Federation failing, paging oncall engineer quite often.
Environment
- System information:
$ uname -srm
Linux 5.3.0-46-generic x86_64
- Prometheus version:
v2.19.2 as from docker hub, and manually build/patched Prometheus as discussed above
- Prometheus configuration file:
global:
scrape_interval: 1m
scrape_timeout: 45s
evaluation_interval: 1m
[snip]
scrape_configs:
- job_name: some_federation
honor_timestamps: true
params:
match[]:
- '{__name__=~"kube_.*|kubelet_.*|pv_collector_bound_pvc_count"}'
scrape_interval: 1m
scrape_timeout: 45s
metrics_path: /api/v1/namespaces/kube-system/services/prometheus:9090/proxy/prometheus/federate // access through Kubernetes API server
scheme: https
static_configs:
- targets:
- 198.51.100.100:6443
labels:
tenant: some_cluster
tls_config:
ca_file: /tmp/prometheus/tls/some_cluster-ca.crt
cert_file: /tmp/prometheus/tls/some_cluster.crt
key_file: /tmp/prometheus/tls/some_cluster.key
insecure_skip_verify: false
[snip]
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 15 (15 by maintainers)
Commits related to this issue
- Enable HTTP2 again We are re-enabling HTTP 2 again. There has been a few bugfixes upstream in go, and we have also enabled ReadIdleTimeout. Fix #7588 Fix #9068 Signed-off-by: Julien Pivotto <roidel... — committed to roidelapluie/prometheus by roidelapluie 3 years ago
- Enable HTTP2 again We are re-enabling HTTP 2 again. There has been a few bugfixes upstream in go, and we have also enabled ReadIdleTimeout. Fix #7588 Fix #9068 Signed-off-by: Julien Pivotto <roidel... — committed to roidelapluie/prometheus by roidelapluie 3 years ago
- Enable HTTP2 again (#9398) We are re-enabling HTTP 2 again. There has been a few bugfixes upstream in go, and we have also enabled ReadIdleTimeout. Fix #7588 Fix #9068 Signed-off-by: Julien P... — committed to prometheus/prometheus by roidelapluie 3 years ago
The present plan is that we’re going to disable HTTP/2 again, until this is all fixed in Go. HTTP/2 will remain for the blackbox exporter, as these issues shouldn’t affect it.