prometheus: Istio mTLS fails after Prometheus 2.20.1

What did you do?

Using Istio 1.6.14 I am mounting the Istio sidecar manually without proxying any traffic so I can access the Istio mTLS certificates. I have a scrape configuration set up to use those certificates to scrape endpoints that have Istio sidecars.

Under Prometheus v2.20.1 this works perfectly. Under Prometheus v2.21.0 and above it fails with “connection reset by peer.”

You can follow along on my troubleshooting attempt in the newsgroup but I’ve reached a point where I can’t figure it out and I think there’s a bug in here somewhere.

What did you expect to see?

I expected v 2.28.0 to continue scraping Istio pods just like v2.20.1 did, using the same scrape configuration and the same certificates.

What did you see instead? Under which circumstances?

In versions 2.21.0 through 2.28.0 any endpoint using Istio mTLS fails to be scraped with the message “connection reset by peer.” Here’s the debug log message under v2.28.0:

level=debug ts=2021-07-06T20:58:32.984Z caller=scrape.go:1236 component="scrape manager" scrape_pool=kubernetes-pods-istio-secure target=https://10.244.3.10:9102/metrics msg="Scrape failed" err="Get \"https://10.244.3.10:9102/metrics\": read tcp 10.244.4.89:36666->10.244.3.10:9102: read: connection reset by peer"

Environment

  • System information: Linux 5.4.0-1047-azure x86_64
  • Prometheus version:
prometheus, version 2.28.0 (branch: HEAD, revision: ff58416a0b0224bab1f38f949f7d7c2a0f658940)
  build user:       root@32b9079a2740
  build date:       20210621-15:45:36
  go version:       go1.16.5
  platform:         linux/amd64
  • Prometheus configuration file:

The relevant scrape job is here. The certificates are mounted at /etc/istio-certs. I have validated that the certificate files are there and properly mounted.

    - job_name: kubernetes-pods-istio-secure
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - action: keep
        regex: true
        source_labels:
        - __meta_kubernetes_pod_annotation_prometheus_io_scrape
      - action: keep
        regex: (([^;]+);([^;]*))|(([^;]*);(true))
        source_labels:
        - __meta_kubernetes_pod_annotation_sidecar_istio_io_status
        - __meta_kubernetes_pod_annotation_istio_mtls
      - action: replace
        regex: (.+)
        source_labels:
        - __meta_kubernetes_pod_annotation_prometheus_io_path
        target_label: __metrics_path__
      - action: keep
        regex: ([^:]+):(\d+)
        source_labels:
        - __address__
      - action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        source_labels:
        - __address__
        - __meta_kubernetes_pod_annotation_prometheus_io_port
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - action: replace
        source_labels:
        - __meta_kubernetes_namespace
        target_label: namespace
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_name
        target_label: pod_name
      scheme: https
      tls_config:
        ca_file: /etc/istio-certs/root-cert.pem
        cert_file: /etc/istio-certs/cert-chain.pem
        insecure_skip_verify: true
        key_file: /etc/istio-certs/key.pem
  • Logs:
level=debug ts=2021-07-06T20:58:32.984Z caller=scrape.go:1236 component="scrape manager" scrape_pool=kubernetes-pods-istio-secure target=https://10.244.3.10:9102/metrics msg="Scrape failed" err="Get \"https://10.244.3.10:9102/metrics\": read tcp 10.244.4.89:36666->10.244.3.10:9102: read: connection reset by peer"

Additional context / things I’ve tried:

I noticed in v2.21.0 that several things changed, and I’m not sure if any of them affect this issue.

  • The Go version was updated to 1.15
  • There were some challenges around HTTP/2 which caused it to be disabled

I have tried setting GODEBUG=x509ignoreCN=0 on the pod to see if it’s the Go certificate common name handling that was causing the issue. It didn’t help.

I’ve verified that v2.20.1 is definitely working and none of the versions above that work. I’ve tried them all.

I’ve created a different container with both curl and openssl in them and mounted the certificates there just to make sure it wasn’t a weird mounting problem. Both curl and openssl work.

curl https://10.244.3.10:9102/metrics --cacert /etc/istio-certs/root-cert.pem --cert /etc/istio-certs/cert-chain.pem --key /etc/istio-certs/key.pem --insecure

openssl s_client -connect 10.244.3.10:9102 -cert /etc/istio-certs/cert-chain.pem  -key /etc/istio-certs/key.pem -CAfile /etc/istio-certs/root-cert.pem -alpn "istio"

I noticed openssl doesn’t work unless you set that alpn flag. I saw #6910 and thought this may be related, but I’m unsure. The fix for that one says it’ll be out in 2.19.0 but that hasn’t been released yet.

Relevant curl output:

root@sleep-5f98748557-s4wh5:/# curl https://10.244.3.10:9102/metrics --cacert /etc/istio-certs/root-cert.pem --cert /etc/istio-certs/cert-chain.pem --key /etc/istio-certs/key.pem --insecure -v
*   Trying 10.244.3.10:9102...
* TCP_NODELAY set
* Connected to 10.244.3.10 (10.244.3.10) port 9102 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/istio-certs/root-cert.pem
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Request CERT (13):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Certificate (11):
* TLSv1.3 (OUT), TLS handshake, CERT verify (15):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: [NONE]
*  start date: Jul  7 20:21:33 2021 GMT
*  expire date: Jul  8 20:21:33 2021 GMT
*  issuer: O=cluster.local
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x564d80d81e10)
> GET /metrics HTTP/2
> Host: 10.244.3.10:9102
> user-agent: curl/7.68.0
> accept: */*
>
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* Connection state changed (MAX_CONCURRENT_STREAMS == 2147483647)!
< HTTP/2 200

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 37 (20 by maintainers)

Commits related to this issue

Most upvoted comments

We are releasing Prometheus 2.31.0-rc.0 today that will fix the issues with istio

This will work in Prometheus 2.31 without magic.

It was fun! I didn’t know anything about Istio until now.

Ahhhhhhhhh 🤦🏻‍♂️ I had changed the namespace in every single config except the one that worked.

Get "https://10.1.67.80:8080/metrics": read tcp 10.1.67.91:41460->10.1.67.80:8080: read: connection reset by peer

That’s a relief, glad it’s working for you now. For the time being, I guess you could use that custom container, it should be stable enough… maybe? You may want to make the same change on the 2.28.1 release tag just to be safe, although I don’t think anything big has been merged since.

@roidelapluie and I have been talking about ways to re-enable HTTP/2 and hopefully it will be out in the next release.

That would be great, thanks!

FROM golang:1.14

WORKDIR /go/src/prometheus
COPY . .

RUN go get -v ./...
RUN go install -v ./...

ENTRYPOINT ["./prometheus"]
CMD	[ "--config.file=/etc/prometheus/prometheus.yml", \
             "--storage.tsdb.path=/prometheus", \
             "--web.console.libraries=/usr/share/prometheus/console_libraries", \
             "--web.console.templates=/usr/share/prometheus/consoles" ]

I did just write up this Dockerfile for building on Go 1.14 that does work, so that might be worth a shot.