thanos: query: error DeadlineExceeded missing error details
Thanos, Prometheus and Golang version used: The Docker images: Thanos: 0.23.0 and newer go: go1.16.8
What happened: When the query fails to connect to a store/sidecar it does not display certificates issues like before. This makes debugging difficulty.
level=warn
ts=2021-12-03T16:11:21.556102388Z
caller=endpointset.go:500
component=endpointset
msg="update of node failed" err="getting metadata: fallback fetching info from prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded"
address=prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local:10901
What you expected to happen: The error should display more information if the certificate is invalid like in Thanos v0.22.0
level=warn ts=2021-12-03T16:30:14.55260649Z
caller=endpointset.go:512
component=endpointset
msg="update of node failed" err="getting metadata: fallback fetching info from prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local:10901 after err: rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: authentication handshake failed: x509: certificate is valid for prometheus-sidecar-blockstorage-1-0.prometheus-operated.scrape-blockstorage.svc.cluster.local, not prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local\": rpc error: code = DeadlineExceeded desc = context deadline exceeded"
address=prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local:10901
How to reproduce it (as minimally and precisely as possible):
- Create an invalid Cert for the Store/Sidecar
- Run Thanos 0.23.0 or newer with TLS enabled at the Store/Sidecar
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 10
- Comments: 20 (8 by maintainers)
Still relevant! Problem persists in version 0.28.0.
Still relevant, and rather annoying when trying to decipher why two Thanos components can’t talk.
Hello! So I spent some time trying to PR a fix, to find out that I was a bit off the mark.
While I still think the problem is between grpc-go versions
v1.39.0
andv1.40.0
, upon closer inspection the changes tointernal/status/status.go
look fine.I did some more digging and found out I’m not proficient enough in all of the changes to make a definitive call, so I filed https://github.com/grpc/grpc-go/issues/5342 to have the experts look at it. I did find some solid anchors to what I think needs to be bubbled up, and I’ve dropped those hints onto the upstream bug report.
Sorry I wasn’t more helpful, and hopefully we get a fix soon!
Just wanted to add this here, I ran into this same issue during an upgrade:
I’m not sure if others are experiencing this using the bitnami helm chart, but that is how I installed Thanos. The issue was that the querier parameter
--grpc-client-server-name
was missing after upgrading from the 9.x to 10.x version of the chart. At some point they changed the values structure fromquery.grpc.client.servername
toquery.grpc.client.serverName
(capitalizedN
).I’ve faced the same issue today, I’ve missed adding the store port and I could only see the errors after rollback to version 0.22