thanos: Compaction failing due to out-of-order chunks

The global compactor is halting due to the following error:

invalid result block /app/data/compact/01CAAX3A54YPTXPMGMQTJT9T7F: 119381/1087562 series have an average of 1.340 out-of-order chunks

I haven’t dug into the issue yet, but I wonder if it’s related to @gouthamve’s comment that out-of-order appends are not prevented in the TSDB library when out-of-order samples are appended to the same appender: https://github.com/prometheus/tsdb/pull/258#issuecomment-378153699

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 33 (26 by maintainers)

Most upvoted comments

So the bug you mentioned before was observed in Prometheus?

No, I just meant that we are using the vanilla TSDB compactor and the input we give it is entirely on-overlapping, just like in Prometheus.

To be sure, do you suspect the issue is in the TSDB library?

That’s the question. Seems like there shouldn’t be anything Thanos specific in that path, so a TSDB issue is possible.
That’s what we’ve to find out now I guess 😃

If you can please send me some pointers, I’m happy to debug and maybe can share some more specific metadata.

You are probably very familiar with the TSDB library, so that helps 😃 I think your best bet would be pulling the blocks in question onto your machine and writing a standalone program compacting them. Then running Thanos’ block verification (that makes it halt) on the result but a bit more verbose to see how chunks are actually out of order. Inspecting the same chunks in the input blocks may reveal something. If everything looks normal some simple debug printing in the compactor may reveal something.

You can probably just focus on a single series that is affected to avoid the noise – it will likely be the same issue for all of them.

fabxc on Apr 5, 2018

If we hit this issue, is there anyway to recover the block?

alvinlin123 on Feb 3, 2021

We’ve since upgraded to v0.14.0, will let you know if it reproduces again…

XDex on Aug 4, 2020