thanos: querier: Rate over deduplicated counter from many replicas can lead to double reset account.
Found by GitLab, we were investigating offline with @SuperQ
Their issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9293
This can be only reproducible with large rates [30m+]
which means it has to do with chunks ordering or overlaps.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 22 (22 by maintainers)
Commits related to this issue
- Added DebugLocalStore and repro test for querier counter reset bug. Reproduces: https://github.com/thanos-io/thanos/issues/2401 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com> — committed to thanos-io/thanos by bwplotka 4 years ago
- Added LocalStore and realistic data for querier counter reset bug. Tries to reproduces: https://github.com/thanos-io/thanos/issues/2401 I would still merge as it is a great test, and allows us to qu... — committed to thanos-io/thanos by bwplotka 4 years ago
- Added LocalStore and realistic data for querier counter reset bug. Tries to reproduces: https://github.com/thanos-io/thanos/issues/2401 I would still merge as it is a great test, and allows us to qu... — committed to thanos-io/thanos by bwplotka 4 years ago
- Fixed and added more regressions tests for reset counter dedup bug. Fixes https://github.com/thanos-io/thanos/issues/2401 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com> — committed to thanos-io/thanos by bwplotka 4 years ago
- Fixed and added more regressions tests for reset counter dedup bug. Fixes https://github.com/thanos-io/thanos/issues/2401 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com> — committed to thanos-io/thanos by bwplotka 4 years ago
- Fixed and added more regressions tests for reset counter dedup bug. Fixes https://github.com/thanos-io/thanos/issues/2401 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com> — committed to thanos-io/thanos by bwplotka 4 years ago
- Fixed and added more regressions tests for reset counter dedup bug. Fixes https://github.com/thanos-io/thanos/issues/2401 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com> — committed to thanos-io/thanos by bwplotka 4 years ago
- Fixed and added more regressions tests for reset counter dedup bug. Fixes https://github.com/thanos-io/thanos/issues/2401 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com> — committed to thanos-io/thanos by bwplotka 4 years ago
- Fixed and added more regressions tests for reset counter dedup bug. Fixes https://github.com/thanos-io/thanos/issues/2401 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com> — committed to thanos-io/thanos by bwplotka 4 years ago
- querier: Fixed and added more regressions tests for counter missed bug. Fixes https://github.com/thanos-io/thanos/issues/2401 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com> — committed to thanos-io/thanos by bwplotka 4 years ago
- querier: Fixed and added more regressions tests for counter missed bug. Fixes https://github.com/thanos-io/thanos/issues/2401 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com> — committed to thanos-io/thanos by bwplotka 4 years ago
- Added LocalStore and realistic data for querier counter reset bug. (#2522) * Added LocalStore and realistic data for querier counter reset bug. Tries to reproduces: https://github.com/thanos-io/th... — committed to thanos-io/thanos by bwplotka 4 years ago
- Added LocalStore and realistic data for querier counter reset bug. (#2522) * Added LocalStore and realistic data for querier counter reset bug. Tries to reproduces: https://github.com/thanos-io/th... — committed to thanos-io/thanos by bwplotka 4 years ago
- Added LocalStore and realistic data for querier counter reset bug. (#2522) (#2538) * Added LocalStore and realistic data for querier counter reset bug. Tries to reproduces: https://github.com/than... — committed to thanos-io/thanos by bwplotka 4 years ago
- querier: Added regressions tests for counter missed bug. PR with just tests, not fix yet. Reproduces: https://github.com/thanos-io/thanos/issues/2401 * Added regressions tests for CounterSeriesIter... — committed to thanos-io/thanos by bwplotka 4 years ago
- querier: Added regressions tests for counter missed bug. PR with just tests, not fix yet. Reproduces: https://github.com/thanos-io/thanos/issues/2401 * Added regressions tests for CounterSeriesIter... — committed to thanos-io/thanos by bwplotka 4 years ago
- querier: Added regressions tests for counter missed bug. PR with just tests, not fix yet. Reproduces: https://github.com/thanos-io/thanos/issues/2401 * Added regressions tests for CounterSeriesIter... — committed to thanos-io/thanos by bwplotka 4 years ago
- querier: Added regressions tests for counter missed bug. PR with just tests, not fix yet. Reproduces: https://github.com/thanos-io/thanos/issues/2401 * Added regressions tests for CounterSeriesIter... — committed to thanos-io/thanos by bwplotka 4 years ago
- querier: Added regressions tests for counter missed bug. PR with just tests, not fix yet. Reproduces: https://github.com/thanos-io/thanos/issues/2401 * Added regressions tests for CounterSeriesIter... — committed to thanos-io/thanos by bwplotka 4 years ago
- querier: Added regressions tests for counter missed bug. PR with just tests, not fix yet. Reproduces: https://github.com/thanos-io/thanos/issues/2401 * Added regressions tests for CounterSeriesIter... — committed to thanos-io/thanos by bwplotka 4 years ago
It actually saddens me that Prometheus “by design” doesn’t really cope with scrape intervals >2m. I’d love to see future Prometheus versions lifting that arbitrary limit, and I’d therefore prefer if Thanos didn’t bake in that limit into its own design, too.
Interestingly, I’d also love to see future Prometheus version to have 1st class support for metric types. That would then also solve your problem of how to safely recognize a counter.
@SuperQ this repro is so amazing. can explore all details. Definitely we have overlapping and unsorted chunks. We should be able to find a problem in our algorithm soon, thanks!
BTW… I kind of overengineered (as you can imagine) and wrote
thanos tools storeapi serve --json=<file x>
which can serve JSON (protobuf based) and as Store API 🎉So I can get your file (actually anything generated by
grpcurl
and put intothanos tools storeapi serve --json
, run querier and connectstoreapi serve
as --store, and see your results:That’s not common, but you could depend on noone having a scrape interval over 2 minutes as that’s not sane for other reasons.