thanos: querier: Rate over deduplicated counter from many replicas can lead to double reset account.

Found by GitLab, we were investigating offline with @SuperQ

Their issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9293

This can be only reproducible with large rates [30m+] which means it has to do with chunks ordering or overlaps.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 22 (22 by maintainers)

Commits related to this issue

Most upvoted comments

It actually saddens me that Prometheus “by design” doesn’t really cope with scrape intervals >2m. I’d love to see future Prometheus versions lifting that arbitrary limit, and I’d therefore prefer if Thanos didn’t bake in that limit into its own design, too.

Interestingly, I’d also love to see future Prometheus version to have 1st class support for metric types. That would then also solve your problem of how to safely recognize a counter.

@SuperQ this repro is so amazing. can explore all details. Definitely we have overlapping and unsorted chunks. We should be able to find a problem in our algorithm soon, thanks!

BTW… I kind of overengineered (as you can imagine) and wrote thanos tools storeapi serve --json=<file x> which can serve JSON (protobuf based) and as Store API 🎉

So I can get your file (actually anything generated by grpcurl and put into thanos tools storeapi serve --json , run querier and connect storeapi serve as --store, and see your results:

image

(downside: What if scrape interval changes)

That’s not common, but you could depend on noone having a scrape interval over 2 minutes as that’s not sane for other reasons.