thanos: Deduplication returning deduped and non-deduped results in 0.31.0+
Thanos, Prometheus and Golang version used:
Thanos Version: 0.31.0 Prometheus Version: 2.42.0
Object Storage Provider:
S3
What happened:
Slack Thread: https://cloud-native.slack.com/archives/CK5RSSC10/p1680534545839619
We recently upgraded Thanos from v0.28.1 to v0.31.0… and we’re seeing some odd behavior now with portions of our queries. Basically queries that pull data from the last few hours are all correct. However, longer queries will return duplicate data as long as “use deduplication” is checked (which has been the default setting forever). The really weird part is if we uncheck that box, we suddenly get the right data.
We have the following CLI flags configured on our thanos-query pods:
- --query.replica-label=replica
- --query.replica-label=prometheus_replica
- --query.replica-label=cluster
- --query.replica-label=__replica__
- --query.replica-label=tenant_id
- --query.replica-label=receive
- --query.replica-label=prometheus
- --query.replica-label=thanos_ruler_replica
- --query.auto-downsampling
- --query.max-concurrent-select=25
- --store.response-timeout=1m
- --query.max-concurrent=500
- --query.metadata.default-time-range=6h
What you expected to happen:
I expect to get the right results as happened in the 0.28.0
, 0.29.0
, 0.30.0
, 0.30.2
releases.
Anything else we need to know:
After discovering the issue, we rolled back to 0.28.0
and then upgraded one release at a time… the problem only occurs when we upgrade to the 0.31.0
release.
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 10
- Comments: 37 (25 by maintainers)
Commits related to this issue
- e2e(query): Reproduce dedup issue from #6257 Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com> — committed to douglascamata/thanos by douglascamata a year ago
- e2e(query): Reproduce dedup issue from #6257 Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com> — committed to douglascamata/thanos by douglascamata a year ago
- e2e(query): Reproduce dedup issue from #6257 Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com> — committed to douglascamata/thanos by douglascamata a year ago
- e2e(query): Reproduce dedup issue from #6257 Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com> — committed to douglascamata/thanos by douglascamata a year ago
- Merge pull request #6377 from douglascamata/dedup-error-e2e-test e2e(query): Reproduce dedup issue from #6257 — committed to thanos-io/thanos by fpetkovski a year ago
- apps/datalake-metrics: revert to 0.30.2 due to https://github.com/thanos-io/thanos/issues/6257 — committed to thaum-xyz/ankhmorpork by paulfantom a year ago
- e2e(query): Reproduce dedup issue from #6257 Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com> — committed to databricks/thanos by douglascamata a year ago
- components/query: add a paragraph about filter Document the why/what from https://github.com/thanos-io/thanos/issues/6257 in the Querier documentation. Signed-off-by: Giedrius Statkevičius <giedrius... — committed to thanos-io/thanos by GiedriusS a year ago
- components/query: add a paragraph about filter (#6607) Document the why/what from https://github.com/thanos-io/thanos/issues/6257 in the Querier documentation. Signed-off-by: Giedrius Statkeviči... — committed to thanos-io/thanos by GiedriusS a year ago
- components/query: add a paragraph about filter (#6607) Document the why/what from https://github.com/thanos-io/thanos/issues/6257 in the Querier documentation. Signed-off-by: Giedrius Statkeviči... — committed to harsh-ps-2003/thanos by GiedriusS a year ago
- Upgrade to thanos 0.32.1 (#81) * Updates busybox SHA (#6283) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: fpetkovski <fpetkovski@users.noreply.github.com> * Upgrade prometheus to... — committed to stolostron/thanos by subbarao-meduri 10 months ago
- [release-2.9] Upgrade to thanos 0.32.3 (#84) * fix 0.31 changelog (#6278) Signed-off-by: junot <junotxiang@kubesphere.io> * Query: Switch Multiple Engines (#6234) * Query: Switch engines usi... — committed to stolostron/thanos by subbarao-meduri 9 months ago
Got some time to think about this and fix it so hopefully this will get done soon.
Can confirm fix is working in our setup. Thank you @fpetkovski !
Thanks for reporting @diranged and reproducing @douglascamata, I will take a look later this week too. Sounds like we missed some edge cases, sorry for that.
To recap: dedup used to work by saving all
storepb.SeriesResponse
into a slice and then resorting everything so that series would be sorted by replica labels first. In 0.31.0 we addedwithoutReplicaLabels
as per the proposal: https://thanos.io/tip/proposals-accepted/20221129-avoid-global-sort.md/. The idea was to remove replica labels before getting the response so that we could use k-way merge to merge everything in order.However, we missed that a StoreAPI might start sending unsorted responses with this functionality. Consider the following:
a=1,b=1 a=1,b=2 a=2,b=1 a=2,b=2
If the replica label is
a
then the response becomes:b=1 b=2 b=1 b=2
Because
storage.SeriesSet
is an iterator over a sorted series set, it means that to have lazy, in order evaluation of the iterator we’d need to buffer everything that comes between the twob=1
series in the response from a StoreAPI.If
a
is an external label then the situation changes a bit. This means that the response would look like:a=1,b=1 a=1,b=2 a=1,b=1 a=1,b=2
Since external labels always have the same value, it means that such a response is not possible and it would look something like this instead:
a=1,b=1,c=1 a=1,b=2,c=1 a=1,b=1 a=1,b=2
Deduplication on external labels makes it so that even with the removal of the external label, the response is sorted and we can use k-way merge directly.
This is the difference between deduplication on internal labels and external labels. Due to this, it seems like we would inevitably have to buffer everything to provide a general deduplication algorithm. But, we should avoid buffering as much as possible. The small ad-hoc benchmark in the original proposal shows a huge performance improvement.
Thus, my suggestion would be to:
I also suggest removing this https://github.com/thanos-io/thanos/issues/6257#issuecomment-1514160495 hack because a StoreAPI must always send a sorted series set. If it doesn’t then that’s a bug in the StoreAPI implementation.
I filled https://github.com/thanos-io/thanos/issues/6563 and https://github.com/prometheus/prometheus/issues/12605 for these bugs. But these are other problems impacting deduplication since the beginning of Thanos, I think. So, let’s fix the original problem first.
@GiedriusS yesterday I could write some e2e tests that can show the bug. I’m adding one more scenario to it and will get a PR opened. 👍
I believe this comes from the stores not re-ordering the series after adding the external labels. You can replicate it by adding the metrics:
And an external label
ext="label"
. Before the external label,metric{} < metric{alocal="label"}
because it has fewer labels. However, after adding it, both sidecar and ruler returned this when I tested them:and since
alocal < ext
, the order is incorrect. If we have two stores returning the same duplicated data, ProxyResponseHeap will yield the following:Making dedupResponseHeap ignore the duplication. I tried to work out a solution in the ProxyResponseHeap Less function, but I think this should probably be re-sorted in the stores instead. The same incorrect order is in prometheus remote read response to the sidecar
Can we then adapt https://github.com/thanos-io/thanos/pull/6317 to resort in stores instead of doing it in the querier?
I think you are right, this could be one way the issue could happen. We should probably exclude external labels from the comparison because they interfere with the sorting order.
We are seeing this same issue without receive in the picture.
We’ve since rolled back but I can try and reproduce again in a dev environment next week.
We just have Prometheus + sidecar, store, and ruler. All behind the query servers.
@zhangrj do you have any replica labels that are stored in TSDB and are not external labels?