prometheus: Prometheus should use http2 health monitor/set http2 ReadIdleTimeout

What did you do?

After upgrade from Prometheus v2.18.1 to v2.19.2, we had massive issues with a central Prometheus federating a bunch of others. By bisecting the changes, I was able to nail it down to enabling http2 in #7258 (which was well hidden in the commit message and not mentioned in the changelog). Federation failed for up to 15 minutes, which we explain with the kernel getting hold of the broken connection after some time, but Prometheus still trying to use it for that long. Revoking the prometheus/common upgrade resolved the issue with 2.19.2. We assume we have rare issues of long-lasting network connections breaking on infrastructure level.

The issue somewhat reminds me of kubernetes/kubernetes#87615, where golang/net#55 is proposed to early recognize and close broken connections. Investigating how to set the ReadIdleTimeout value in Prometheus, I realized we’re lucky that Prometheus is vendoring x/net and does not require Golang 1.15, but the value is not properly exposed the way htt2.Transport is usually instantiated (also Kubernetes is doing it this way): golang/go#40201. A pull request to resolve this is (allow configuration through http2.ConfigureTransportWithOptions) already proposed, but not merged or even discussed yet: golang/net#74. It even “will not make it into 1.15”.

From discussion with the implementer on golang side, there would be three options to resolve this issue until we can finally make the value configurable in an official way:

  • Hacking the vendored x/net/http2 code to include the variable. Easy, but probably worst method to implement it. I used it anyway to build an internal hotfix release and verify the issue and solution through http2 health checks.
  • Using reflection to access the inadvertently hidden fields by getting access to the http2.Transport (example in the golang/net PR’s test case).
  • Implementing a (hidden?) feature toggle to not upgrade the http.Transport for http2, ie. putting an if statement around the http2.ConfigureTransport call (but this would need to passed through from Prometheus to common somehow). Not using http2 also resolves the issue, as I tested by reverting the entire common upgrade.

We ran code with the fix for ~48h now, and did not have any federation down issues any more (we had in average one per hour before with 2.19.2).

What did you expect to see?

Broken http2 connections are recognized soon and reestablished.

What did you see instead? Under which circumstances?

Federation failing, paging oncall engineer quite often.

Environment

  • System information:
$ uname -srm
Linux 5.3.0-46-generic x86_64
  • Prometheus version:

v2.19.2 as from docker hub, and manually build/patched Prometheus as discussed above

  • Prometheus configuration file:
global:
  scrape_interval: 1m
  scrape_timeout: 45s
  evaluation_interval: 1m
[snip]
scrape_configs:
- job_name: some_federation
  honor_timestamps: true
  params:
    match[]:
    - '{__name__=~"kube_.*|kubelet_.*|pv_collector_bound_pvc_count"}'
  scrape_interval: 1m
  scrape_timeout: 45s
  metrics_path: /api/v1/namespaces/kube-system/services/prometheus:9090/proxy/prometheus/federate // access through Kubernetes API server
  scheme: https
  static_configs:
  - targets:
    - 198.51.100.100:6443
    labels:
      tenant: some_cluster
  tls_config:
    ca_file: /tmp/prometheus/tls/some_cluster-ca.crt
    cert_file: /tmp/prometheus/tls/some_cluster.crt
    key_file: /tmp/prometheus/tls/some_cluster.key
    insecure_skip_verify: false
[snip]

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 15 (15 by maintainers)

Commits related to this issue

Most upvoted comments

The present plan is that we’re going to disable HTTP/2 again, until this is all fixed in Go. HTTP/2 will remain for the blackbox exporter, as these issues shouldn’t affect it.