thanos: query: with grafana gives inconsistent dashboard with every refresh
Thanos, Prometheus and Golang version used:
- Thanos: 11.1.0, deployed through Bitnami helm chart, quay.io/thanos/thanos:v0.25.2
- Prometheus: 0.57.0 deployed through kube-prometheus-stack helm chart
- Grafana: v9.0.1, also through kube-prometheus-stack
Object Storage Provider:
What happened: When using the Thanos querier as data source in Grafana, the dashboard data becomes inconsistent on every refresh. See gif for example. This doesn’t happen if Grafana points directly to the Thanos sidecar. This is without usage of the storage gateway.
It might very well be a configuration issue, but we have no idea what could cause this.
What you expected to happen: Consistent data between refreshes
How to reproduce it (as minimally and precisely as possible): We haven’t tried to reproduce this in a minimal setup, but it does happen on all of our environments (10+) all running the same configuration. I can supply the complete values.yaml privately if needed, but it boils down to:
thanos:
storegateway:
enabled: false
query:
enabled: true
replicaLabel:
- prometheus_replica
dnsDiscovery:
sidecarsService: 'prometheus-stack-kube-prom-thanos-discovery'
sidecarsNamespace: 'prometheus'
kube-prometheus-stack:
grafana:
sidecar:
datasources:
url: http://prometheus-stack-thanos-query:9090/
initDatasources: true
dashboards:
searchNamespace: ALL
labelValue: null # Needs to be null in order to load our dashboards
prometheus:
replicas: 3
thanosService:
enabled: true
thanosServiceMonitor:
enabled: true
service:
sessionAffinity: 'ClientIP'
prometheusSpec:
thanos:
objectStorageConfig:
key: config-file.yaml
name: thanos-secret
Chart:
- name: kube-prometheus-stack
version: 36.6.2
repository: https://prometheus-community.github.io/helm-charts
- name: thanos
version: 10.5.5
repository: https://charts.bitnami.com/bitnami
Full logs to relevant components:
- Grafana: no logging occurs
- Query: no logging occurs
Anything else we need to know:
Environment: K8S on AKS. First time deploying Thanos.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 8
- Comments: 23 (11 by maintainers)
Commits related to this issue
- api: fix race between Respond() and query/queryRange Fix a data race between Respond() and query/queryRange functions by returning an extra optional function from instrumented functions that releases... — committed to GiedriusS/thanos by GiedriusS 2 years ago
- api: fix race between Respond() and query/queryRange (#5583) * api: fix race between Respond() and query/queryRange Fix a data race between Respond() and query/queryRange functions by returning a... — committed to thanos-io/thanos by GiedriusS 2 years ago
- api: fix race between Respond() and query/queryRange Fix a data race between Respond() and query/queryRange functions by returning an extra optional function from instrumented functions that releases... — committed to vinted/thanos by GiedriusS 2 years ago
Can’t reproduce this anymore with https://github.com/thanos-io/thanos/commit/d00a713a9bd66e650418bde6bd80ac7f2dc67428, will close this issue in a few days if nothing comes up.
We saw the same behavior after upgrading from 0.25.2 to 0.27.0. Downgrading just the Querier to 0.26.0 resolved the issue.
In our case Grafana was receiving out of order samples and illogical results(results in billions when they should be in hundreds).
We’ve tested with 0.24.0, 0.25.0 and 0.26.0 they all don’t seem to have the issue. It only starts occurring since 0.27.0.
We’re seeing the same thing after upgrading Thanos from 0.24.0 to 0.27.0. Grafana 9.03 and Prometheus 2.32.1.
Still catching up, but https://github.com/thanos-io/thanos/releases/tag/v0.28.0 would be the fastest fix forward to get unblocked. I’ll discuss if someone can make a patch
Was able to reproduce this by myself after updating. Reverting https://github.com/thanos-io/thanos/pull/5410 doesn’t help, must be something more serious. Will try to reproduce locally && fix it. Must be some kind of race condition.
Hi, what we seem to be seeing is that thanos-query is returning data for other panels/queries in the same dashboard in the same query result. We can usually plot the wrong graph on another panel in the same dashboard where it does make sense, and on refresh usually the graph moves to the correct panel.
If we inspect the query response we can see that data is contained in the result that does not make sense for the given query. For example, we have a query that should return 1 or 0. Among the 1’s and 0’s there are for some reason some time series with values like 1945. This should not be possible. If we execute the query standalone from the thanos-query UI the correct result is returned, but when executed from a grafana dashboard a mixed result is returned.