thanos: Compactor: upgrading to Prometheus 2.47.0 breaks the compactor

Thanos, Prometheus and Golang version used: Prometheus: 2.47.0 Thanos: 0.32.2

Object Storage Provider: Azure Blob

What happened: After upgrading Prometheus to vesion 2.47.0 the compactor stopped working. All blocks after the upgrade have out-of-order chunks. The compactor fails when it reaches a new block with error:

err="compaction: group 0@6061977523826161203: blocks with out-of-order chunks are dropped from compaction:  /data/compact/0@6061977523826161203/01HA7QRGQDB0Z4SV0BM2S4687R: 1157/361848 series have an average of 1.000 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)"

What you expected to happen: The compactor succeeds in compacting the blocks.

How to reproduce it (as minimally and precisely as possible): Run Prometheus 2.47.0 with the compactor enabled.

Full logs to relevant components:

k exec --stdin --tty prometheus-stack-thanos-storegateway-0 -- thanos tools bucket verify --objstore.config-file=/conf/objstore.yml

ts=2023-09-15T08:01:10.601379049Z caller=index_issue.go:61 level=warn verifiers=overlapped_blocks,index_known_issues verifier=index_known_issues msg="detected issue" id=01HA90YWB92D35CFAS57RQESJ7 err="538/230171 series have an average of 1.186 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)"
ts=2023-09-15T08:01:12.606537326Z caller=index_issue.go:61 level=warn verifiers=overlapped_blocks,index_known_issues verifier=index_known_issues msg="detected issue" id=01HA8CBPM19NR869TN69ANARQV err="674/229044 series have an average of 1.221 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)"
ts=2023-09-15T08:01:14.526769349Z caller=index_issue.go:61 level=warn verifiers=overlapped_blocks,index_known_issues verifier=index_known_issues msg="detected issue" id=01HA8K7DVGHQJ04JTEETER4TP6 err="596/228614 series have an average of 1.215 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)"

Anything else we need to know: After reverting back to 2.46.0 the newly created blocks done give the error anymore and it seems to be solved.

Already discussed this problem on the CNCF slack: https://cloud-native.slack.com/archives/CK5RSSC10/p1694681247238809

About this issue

  • Original URL
  • State: open
  • Created 10 months ago
  • Reactions: 15
  • Comments: 25 (12 by maintainers)

Most upvoted comments

@saswatamcode After taking another look, this seems a Prometheus issue only. Thanos is still using an old version of Prometheus so not affected by this bug.

It is just the bad blocks created by Prometheus causing compactor to fail.

Same issue for me, so I added log prints to find out what the problem was, and found that the problem had been solved in https://github.com/prometheus/prometheus/pull/12874 , but It hasn’t been released yet. 076f46e7-759c-4569-88f4-b1f4cff60cc9 image

I don’t want to delete block because of this problem, I want a tool that fixes the problem, and now I’m going to add a directive to the thanos tool to handle the problem.

CC @saswatamcode I guess we can do a v0.32.4 release for this fix https://github.com/prometheus/prometheus/pull/12874 and previous fixes

When you rolled back the versions, did you also delete the problematic chunks, which halted the compaction?

We have compactor with “–compact.skip-block-with-out-of-order-chunks”, so the blocks are marked for no compaction, we did not delete anything to prevent data loss

Its a rollback, not a solution 😃