thanos: querier: Rate over deduplicated counter from many replicas can lead to double reset account.

Found by GitLab, we were investigating offline with @SuperQ

Their issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9293

This can be only reproducible with large rates [30m+] which means it has to do with chunks ordering or overlaps.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 22 (22 by maintainers)

Commits related to this issue

Added DebugLocalStore and repro test for querier counter reset bug. Reproduces: https://github.com/thanos-io/thanos/issues/2401 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com> — committed to thanos-io/thanos by bwplotka 4 years ago
Added LocalStore and realistic data for querier counter reset bug. Tries to reproduces: https://github.com/thanos-io/thanos/issues/2401 I would still merge as it is a great test, and allows us to qu... — committed to thanos-io/thanos by bwplotka 4 years ago
Added LocalStore and realistic data for querier counter reset bug. Tries to reproduces: https://github.com/thanos-io/thanos/issues/2401 I would still merge as it is a great test, and allows us to qu... — committed to thanos-io/thanos by bwplotka 4 years ago
Fixed and added more regressions tests for reset counter dedup bug. Fixes https://github.com/thanos-io/thanos/issues/2401 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com> — committed to thanos-io/thanos by bwplotka 4 years ago
Fixed and added more regressions tests for reset counter dedup bug. Fixes https://github.com/thanos-io/thanos/issues/2401 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com> — committed to thanos-io/thanos by bwplotka 4 years ago
Fixed and added more regressions tests for reset counter dedup bug. Fixes https://github.com/thanos-io/thanos/issues/2401 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com> — committed to thanos-io/thanos by bwplotka 4 years ago
Fixed and added more regressions tests for reset counter dedup bug. Fixes https://github.com/thanos-io/thanos/issues/2401 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com> — committed to thanos-io/thanos by bwplotka 4 years ago
Fixed and added more regressions tests for reset counter dedup bug. Fixes https://github.com/thanos-io/thanos/issues/2401 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com> — committed to thanos-io/thanos by bwplotka 4 years ago
Fixed and added more regressions tests for reset counter dedup bug. Fixes https://github.com/thanos-io/thanos/issues/2401 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com> — committed to thanos-io/thanos by bwplotka 4 years ago
querier: Fixed and added more regressions tests for counter missed bug. Fixes https://github.com/thanos-io/thanos/issues/2401 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com> — committed to thanos-io/thanos by bwplotka 4 years ago
querier: Fixed and added more regressions tests for counter missed bug. Fixes https://github.com/thanos-io/thanos/issues/2401 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com> — committed to thanos-io/thanos by bwplotka 4 years ago
Added LocalStore and realistic data for querier counter reset bug. (#2522) * Added LocalStore and realistic data for querier counter reset bug. Tries to reproduces: https://github.com/thanos-io/th... — committed to thanos-io/thanos by bwplotka 4 years ago
Added LocalStore and realistic data for querier counter reset bug. (#2522) * Added LocalStore and realistic data for querier counter reset bug. Tries to reproduces: https://github.com/thanos-io/th... — committed to thanos-io/thanos by bwplotka 4 years ago
Added LocalStore and realistic data for querier counter reset bug. (#2522) (#2538) * Added LocalStore and realistic data for querier counter reset bug. Tries to reproduces: https://github.com/than... — committed to thanos-io/thanos by bwplotka 4 years ago
querier: Added regressions tests for counter missed bug. PR with just tests, not fix yet. Reproduces: https://github.com/thanos-io/thanos/issues/2401 * Added regressions tests for CounterSeriesIter... — committed to thanos-io/thanos by bwplotka 4 years ago
querier: Added regressions tests for counter missed bug. PR with just tests, not fix yet. Reproduces: https://github.com/thanos-io/thanos/issues/2401 * Added regressions tests for CounterSeriesIter... — committed to thanos-io/thanos by bwplotka 4 years ago
querier: Added regressions tests for counter missed bug. PR with just tests, not fix yet. Reproduces: https://github.com/thanos-io/thanos/issues/2401 * Added regressions tests for CounterSeriesIter... — committed to thanos-io/thanos by bwplotka 4 years ago
querier: Added regressions tests for counter missed bug. PR with just tests, not fix yet. Reproduces: https://github.com/thanos-io/thanos/issues/2401 * Added regressions tests for CounterSeriesIter... — committed to thanos-io/thanos by bwplotka 4 years ago
querier: Added regressions tests for counter missed bug. PR with just tests, not fix yet. Reproduces: https://github.com/thanos-io/thanos/issues/2401 * Added regressions tests for CounterSeriesIter... — committed to thanos-io/thanos by bwplotka 4 years ago
querier: Added regressions tests for counter missed bug. PR with just tests, not fix yet. Reproduces: https://github.com/thanos-io/thanos/issues/2401 * Added regressions tests for CounterSeriesIter... — committed to thanos-io/thanos by bwplotka 4 years ago

Most upvoted comments

It actually saddens me that Prometheus “by design” doesn’t really cope with scrape intervals >2m. I’d love to see future Prometheus versions lifting that arbitrary limit, and I’d therefore prefer if Thanos didn’t bake in that limit into its own design, too.

Interestingly, I’d also love to see future Prometheus version to have 1st class support for metric types. That would then also solve your problem of how to safely recognize a counter.

beorn7 on May 5, 2020

@SuperQ this repro is so amazing. can explore all details. Definitely we have overlapping and unsorted chunks. We should be able to find a problem in our algorithm soon, thanks!

BTW… I kind of overengineered (as you can imagine) and wrote thanos tools storeapi serve --json=<file x> which can serve JSON (protobuf based) and as Store API 🎉

So I can get your file (actually anything generated by grpcurl and put into thanos tools storeapi serve --json , run querier and connect storeapi serve as --store, and see your results:

bwplotka on Apr 24, 2020

(downside: What if scrape interval changes)

That’s not common, but you could depend on noone having a scrape interval over 2 minutes as that’s not sane for other reasons.

brian-brazil on Apr 30, 2020