thanos: query: error DeadlineExceeded missing error details

Thanos, Prometheus and Golang version used: The Docker images: Thanos: 0.23.0 and newer go: go1.16.8

What happened: When the query fails to connect to a store/sidecar it does not display certificates issues like before. This makes debugging difficulty.

level=warn
ts=2021-12-03T16:11:21.556102388Z
caller=endpointset.go:500
component=endpointset
msg="update of node failed" err="getting metadata: fallback fetching info from prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded"
address=prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local:10901

What you expected to happen: The error should display more information if the certificate is invalid like in Thanos v0.22.0

level=warn ts=2021-12-03T16:30:14.55260649Z
caller=endpointset.go:512
component=endpointset
msg="update of node failed" err="getting metadata: fallback fetching info from prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local:10901 after err: rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: authentication handshake failed: x509: certificate is valid for prometheus-sidecar-blockstorage-1-0.prometheus-operated.scrape-blockstorage.svc.cluster.local, not prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local\": rpc error: code = DeadlineExceeded desc = context deadline exceeded"
address=prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local:10901

How to reproduce it (as minimally and precisely as possible):

  • Create an invalid Cert for the Store/Sidecar
  • Run Thanos 0.23.0 or newer with TLS enabled at the Store/Sidecar

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 10
  • Comments: 20 (8 by maintainers)

Most upvoted comments

Still relevant! Problem persists in version 0.28.0.

Still relevant, and rather annoying when trying to decipher why two Thanos components can’t talk.

Hello! So I spent some time trying to PR a fix, to find out that I was a bit off the mark.

While I still think the problem is between grpc-go versions v1.39.0 and v1.40.0, upon closer inspection the changes to internal/status/status.go look fine.

I did some more digging and found out I’m not proficient enough in all of the changes to make a definitive call, so I filed https://github.com/grpc/grpc-go/issues/5342 to have the experts look at it. I did find some solid anchors to what I think needs to be bubbled up, and I’ve dropped those hints onto the upstream bug report.

Sorry I wasn’t more helpful, and hopefully we get a fix soon!

Just wanted to add this here, I ran into this same issue during an upgrade:

level=warn ts=2022-06-09T03:31:27.679924722Z caller=endpointset.go:517 component=endpointset msg="update of node failed" err="getting metadata: fallback fetching info from metrics.mycluster.com:443: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=metrics.mycluster.com:443

I’m not sure if others are experiencing this using the bitnami helm chart, but that is how I installed Thanos. The issue was that the querier parameter --grpc-client-server-name was missing after upgrading from the 9.x to 10.x version of the chart. At some point they changed the values structure from query.grpc.client.servername to query.grpc.client.serverName (capitalized N).

I’ve faced the same issue today, I’ve missed adding the store port and I could only see the errors after rollback to version 0.22