thanos: Thanos Compactor Failure : overlaps found while gathering blocks.
Thanos, Prometheus and Golang version used Docker images version: master-2018-08-04-8b7169b (Was also affecting an older version before)
What happened The thanos compactor is failing to run (crash)
What you expected to happen The thanos comptactor to compact 😃
How to reproduce it (as minimally and precisely as possible): Good question, it was running before and when checking it on the server I found it restarting in loop (Because of the docker restart policy that is to restart on crash)
Full logs to relevant components
evel=info ts=2018-08-08T14:49:39.577203448Z caller=compact.go:231 msg="starting compact node"
level=info ts=2018-08-08T14:49:39.578379004Z caller=compact.go:126 msg="start sync of metas"
level=info ts=2018-08-08T14:49:39.579995202Z caller=main.go:243 msg="Listening for metrics" address=0.0.0.0:10902
level=info ts=2018-08-08T14:49:50.947989923Z caller=compact.go:132 msg="start of GC"
level=error ts=2018-08-08T14:49:50.954091484Z caller=main.go:160 msg="running command failed" err="compaction: pre compaction overlap check: overlaps found while gathering blocks. [mint: 1532649600000, maxt: 1532656800000, range: 2h0m0s, blocks: 2]: <ulid: 01CKCTVGR19AJBKHFC01YV2248, mint: 1532649600000, maxt: 1532656800000, range: 2h0m0s>, <ulid: 01CKCTVG2HS86DFDTZZ2GJM26W, mint: 1532649600000, maxt: 1532656800000, range: 2h0m0s>\n[mint: 1532685600000, maxt: 1532692800000, range: 2h0m0s, blocks: 2]: <ulid: 01CKDX64BMRYBJY5V9YY3WRYFS, mint: 1532685600000, maxt: 1532692800000, range: 2h0m0s>, <ulid: 01CKDX6534B4P3XF6EZ49MC4ND, mint: 1532685600000, maxt: 1532692800000, range: 2h0m0s>\n[mint: 1532613600000, maxt: 1532620800000, range: 2h0m0s, blocks: 2]: <ulid: 01CKBRGX8BN9S8SWG4KJ0XW3GM, mint: 1532613600000, maxt: 1532620800000, range: 2h0m0s>, <ulid: 01CKBRGVVF5RENZ12K5XYMPS3E, mint: 1532613600000, maxt: 1532620800000, range: 2h0m0s>\n[mint: 1532628000000, maxt: 1532635200000, range: 2h0m0s, blocks: 2]: <ulid: 01CKC68AZYW5Z1M2XZNW95KPXZ, mint: 1532628000000, maxt: 1532635200000, range: 2h0m0s>, <ulid: 01CKC68ACJ1VHATJ5Y39T61CNE, mint: 1532628000000, maxt: 1532635200000, range: 2h0m0s>\n[mint: 1532642400000, maxt: 1532649600000, range: 2h0m0s, blocks: 2]: <ulid: 01CKCKZRVJQ77NE05XB89R4VM8, mint: 1532642400000, maxt: 1532649600000, range: 2h0m0s>, <ulid: 01CKCKZSJ0QKEVE9N83HMT20T0, mint: 1532642400000, maxt: 1532649600000, range: 2h0m0s>\n[mint: 1532635200000, maxt: 1532642400000, range: 2h0m0s, blocks: 2]: <ulid: 01CKCD42YHRE9T7Y3XEJEBTM1M, mint: 1532635200000, maxt: 1532642400000, range: 2h0m0s>, <ulid: 01CKCD41MF1BBJQSC0QRQGWFEQ, mint: 1532635200000, maxt: 1532642400000, range: 2h0m0s>\n[mint: 1532656800000, maxt: 1532664000000, range: 2h0m0s, blocks: 2]: <ulid: 01CKD1Q7CGTFYRNG8ZQ88GYCJZ, mint: 1532656800000, maxt: 1532664000000, range: 2h0m0s>, <ulid: 01CKD1Q8250YVMT67JX14V9984, mint: 1532656800000, maxt: 1532664000000, range: 2h0m0s>\n[mint: 1532664000000, maxt: 1532671200000, range: 2h0m0s, blocks: 2]: <ulid: 01CKD8JZ827S7QXPM3B8B15R3N, mint: 1532664000000, maxt: 1532671200000, range: 2h0m0s>, <ulid: 01CKD8JYMFKQYBGEHST34F2VWS, mint: 1532664000000, maxt: 1532671200000, range: 2h0m0s>\n[mint: 1532671200000, maxt: 1532678400000, range: 2h0m0s, blocks: 2]: <ulid: 01CKDFENVK5JFKF7XCSXZYTADY, mint: 1532671200000, maxt: 1532678400000, range: 2h0m0s>, <ulid: 01CKDFEPJ0Y5K1KMJ1N2RWEQCR, mint: 1532671200000, maxt: 1532678400000, range: 2h0m0s>\n[mint: 1532678400000, maxt: 1532685600000, range: 2h0m0s, blocks: 2]: <ulid: 01CKDPADT2RQF99MTCX3KSFF2K, mint: 1532678400000, maxt: 1532685600000, range: 2h0m0s>, <ulid: 01CKDPAD3MV32M8YGN86WC0KH2, mint: 1532678400000, maxt: 1532685600000, range: 2h0m0s>\n[mint: 1532599200000, maxt: 1532606400000, range: 2h0m0s, blocks: 2]: <ulid: 01CKBDJ2R789A38KRMQFM7HQSG, mint: 1532599200000, maxt: 1532606400000, range: 2h0m0s>, <ulid: 01CKBASDZX4YKAKV42A86VTPM5, mint: 1532599200000, maxt: 1532606400000, range: 2h0m0s>\n[mint: 1532606400000, maxt: 1532613600000, range: 2h0m0s, blocks: 2]: <ulid: 01CKBHN60CPVCJW9R3BVNR8GD7, mint: 1532606400000, maxt: 1532613600000, range: 2h0m0s>, <ulid: 01CKBHN4KMZ3BNSB7EE4KH7ACT, mint: 1532606400000, maxt: 1532613600000, range: 2h0m0s>\n[mint: 1532620800000, maxt: 1532628000000, range: 2h0m0s, blocks: 2]: <ulid: 01CKBZCMG6BCZHT143FHEHV85X, mint: 1532620800000, maxt: 1532628000000, range: 2h0m0s>, <ulid: 01CKBZCK3FWKTZW8Z5PWH0PXSS, mint: 1532620800000, maxt: 1532628000000, range: 2h0m0s>"
Related configuration from our DSC (Salt)
Pillar / Variables
prometheus:
retention: 1d
version: v2.3.2
thanos:
version: master-2018-08-04-8b7169b
Configuration
Make sure Prometheus is running in Docker:
docker_container.running:
- name: prometheus
- image: prom/prometheus:{{ pillar["prometheus"]["version"] }}
- user: 4000
- restart_policy: always
- network_mode: host
- command: >
--config.file=/etc/prometheus/prometheus-{{ grains["prometheus"]["environment"] }}.yml
--storage.tsdb.path=/prometheus
--web.enable-admin-api
--web.console.libraries=/etc/prometheus/console_libraries
--web.console.templates=/etc/prometheus/consoles
--storage.tsdb.retention {{ pillar["prometheus"]["retention"] }}
--storage.tsdb.min-block-duration=2h
--storage.tsdb.max-block-duration=2h
--web.enable-lifecycle
- environment:
- GOOGLE_APPLICATION_CREDENTIALS: /etc/prometheus/gcloud.json
- binds:
- "/etc/monitoring/prometheus:/etc/prometheus:Z"
- "/var/data/prometheus:/prometheus:Z"
Make sure Thanos Sidecar is running in Docker:
docker_container.running:
- name: thanos-sidecar
- image: improbable/thanos:{{ pillar["thanos"]["version"] }}
- user: 4000
- restart_policy: always
- network_mode: host
- command: >
sidecar
--prometheus.url http://localhost:9090
--tsdb.path /prometheus
--gcs.bucket xxx-thanos
--cluster.peers query.thanos.internal.xxx:10906
--cluster.peers store.thanos.internal.xxx:10903
--cluster.address 0.0.0.0:10900
--cluster.advertise-address {{ grains["ip4_interfaces"]["eth0"][0] }}:10900
--grpc-address 0.0.0.0:10901
--grpc-advertise-address {{ grains["ip4_interfaces"]["eth0"][0] }}:10901
--http-address 0.0.0.0:10902
- environment:
- GOOGLE_APPLICATION_CREDENTIALS: /etc/prometheus/gcloud.json
- binds:
- "/etc/monitoring/prometheus:/etc/prometheus:Z"
- "/var/data/prometheus:/prometheus:Z"
Make sure Thanos Query is running in Docker:
docker_container.running:
- name: thanos-query
- image: improbable/thanos:{{ pillar["thanos"]["version"] }}
- user: 4002
- restart_policy: always
- network_mode: host
- command: >
query
--cluster.peers store.thanos.internal.xxx:10903
--query.replica-label replica
--cluster.address 0.0.0.0:10906
--cluster.advertise-address {{ grains["ip4_interfaces"]["eth0"][0] }}:10906
--grpc-address 0.0.0.0:10907
--grpc-advertise-address {{ grains["ip4_interfaces"]["eth0"][0] }}:10907
--http-address 0.0.0.0:10908
--http-advertise-address {{ grains["ip4_interfaces"]["eth0"][0] }}:10908
- binds:
- "/var/data/thanos:/var/data/thanos:Z"
- "/etc/monitoring/prometheus:/etc/prometheus:Z"
- environment:
- GOOGLE_APPLICATION_CREDENTIALS: /etc/prometheus/gcloud.json
Make sure Thanos Store is running in Docker:
docker_container.running:
- name: thanos-store
- image: improbable/thanos:{{ pillar["thanos"]["version"] }}
- user: 4002
- restart_policy: always
- network_mode: host
- command: >
store
--tsdb.path /var/data/thanos/store
--cluster.peers query.thanos.internal.xxx:10900
--gcs.bucket xxx-thanos
--cluster.address 0.0.0.0:10903
--cluster.advertise-address {{ grains["ip4_interfaces"]["eth0"][0] }}:10903
--grpc-address 0.0.0.0:10904
--grpc-advertise-address {{ grains["ip4_interfaces"]["eth0"][0] }}:10904
--http-address 0.0.0.0:10905
- binds:
- "/var/data/thanos:/var/data/thanos:Z"
- "/etc/monitoring/prometheus:/etc/prometheus:Z"
- environment:
- GOOGLE_APPLICATION_CREDENTIALS: /etc/prometheus/gcloud.json
Make sure Thanos Compactor is running in Docker:
docker_container.running:
- name: thanos-compactor
- image: improbable/thanos:{{ pillar["thanos"]["version"] }}
- user: 4002
- restart_policy: on-failure
- command: >
compact
--data-dir /var/data/thanos/compact
--gcs.bucket xxx-thanos
- binds:
- "/var/data/thanos:/var/data/thanos:Z"
- "/etc/monitoring/prometheus:/etc/prometheus:Z"
- environment:
- GOOGLE_APPLICATION_CREDENTIALS: /etc/prometheus/gcloud.json
Set Thanos compactor cron task:
cron.present:
- name: docker restart thanos-compactor
- user: root
- minute: 0
Let me know if you need any more information.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 24 (9 by maintainers)
Getting the same issue on the versions v0.8.1 and v0.9.0
same here for thanos version 0.6.0
Hey everyone 👋🏼 I would like to explore the possibility of adding some sort of fixing command to the
thanos tools bucket
group of subcommands to facilitate handling this kind of issues.and
@bwplotka Thanks for all the details shared on this issue! While I understand why the team decided to avoid fixing the problem without knowing the origin of the issue, in the end, users do need to act somehow to clean up the bucket so Thanos compactor can get back to work in the newly added data and everything else that isn’t overlapping even if the root cause of the problem was a misconfiguration. Depending on the time window and the amount of data stored in the bucket, the effort required to clean it can get pretty big, forcing users to write their own scripts, and risking ending up in an even worse situation.
I would like to propose the possibility of adding a complementary command, or even evolving the current one (thanos tools bucket verify --repair) with the ability to move the affected blocks to a backup bucket, or something similar to that, in a way that users could get compactor back to running and then decide what to do with the affected data. We could also consider implementing a way to move the data back once it’s sorted out 🤔
If that’s something that makes sense for the project, I would love to explore the topic and contribute. I would appreciate some feedback and guidance on this (should I open a new issue?)
Thanks
@bwplotka I’d like to revisit this issue, if you have a moment. We have a very simple stack set up using this
docker-compose
configuration. It has:After running for just a couple of days, we’re running into the “error executing compaction: compaction failed: compaction failed for group …: pre compaction overlap check: overlaps found while gathering blocks.” error.
The troubleshooting documents suggests the following reasons for this error:
Misconfiguraiton of sidecar/ruler: Same external labels or no external labels across many block producers.
We only have a single sidecar (and no ruler).
Running multiple compactors for single block “stream”, even for short duration.
We only have a single compactor.
Manually uploading blocks to the bucket.
This never happened.
Eventually consistent block storage until we fully implement RW for bucket
I’m not entirely sure what this is suggesting. Given only a single producer of information and a single bucket for storage, I’m not sure how eventual consistency could be a problem.
If you have a minute (or @daixiang0 or someone else) I would appreciate some insight into what could be causing this problem.
We’re running:
With: