thanos: query: error DeadlineExceeded missing error details
Thanos, Prometheus and Golang version used: The Docker images: Thanos: 0.23.0 and newer go: go1.16.8
What happened: When the query fails to connect to a store/sidecar it does not display certificates issues like before. This makes debugging difficulty.
level=warn
ts=2021-12-03T16:11:21.556102388Z
caller=endpointset.go:500
component=endpointset
msg="update of node failed" err="getting metadata: fallback fetching info from prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded"
address=prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local:10901
What you expected to happen: The error should display more information if the certificate is invalid like in Thanos v0.22.0
level=warn ts=2021-12-03T16:30:14.55260649Z
caller=endpointset.go:512
component=endpointset
msg="update of node failed" err="getting metadata: fallback fetching info from prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local:10901 after err: rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: authentication handshake failed: x509: certificate is valid for prometheus-sidecar-blockstorage-1-0.prometheus-operated.scrape-blockstorage.svc.cluster.local, not prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local\": rpc error: code = DeadlineExceeded desc = context deadline exceeded"
address=prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local:10901
How to reproduce it (as minimally and precisely as possible):
- Create an invalid Cert for the Store/Sidecar
- Run Thanos 0.23.0 or newer with TLS enabled at the Store/Sidecar
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 10
- Comments: 20 (8 by maintainers)
Still relevant! Problem persists in version 0.28.0.
Still relevant, and rather annoying when trying to decipher why two Thanos components can’t talk.
Hello! So I spent some time trying to PR a fix, to find out that I was a bit off the mark.
While I still think the problem is between grpc-go versions
v1.39.0andv1.40.0, upon closer inspection the changes tointernal/status/status.golook fine.I did some more digging and found out I’m not proficient enough in all of the changes to make a definitive call, so I filed https://github.com/grpc/grpc-go/issues/5342 to have the experts look at it. I did find some solid anchors to what I think needs to be bubbled up, and I’ve dropped those hints onto the upstream bug report.
Sorry I wasn’t more helpful, and hopefully we get a fix soon!
Just wanted to add this here, I ran into this same issue during an upgrade:
I’m not sure if others are experiencing this using the bitnami helm chart, but that is how I installed Thanos. The issue was that the querier parameter
--grpc-client-server-namewas missing after upgrading from the 9.x to 10.x version of the chart. At some point they changed the values structure fromquery.grpc.client.servernametoquery.grpc.client.serverName(capitalizedN).I’ve faced the same issue today, I’ve missed adding the store port and I could only see the errors after rollback to version 0.22