thanos: Thanos Compactor Failure : overlaps found while gathering blocks.

Thanos, Prometheus and Golang version used Docker images version: master-2018-08-04-8b7169b (Was also affecting an older version before)

What happened The thanos compactor is failing to run (crash)

What you expected to happen The thanos comptactor to compact 😃

How to reproduce it (as minimally and precisely as possible): Good question, it was running before and when checking it on the server I found it restarting in loop (Because of the docker restart policy that is to restart on crash)

Full logs to relevant components

evel=info ts=2018-08-08T14:49:39.577203448Z caller=compact.go:231 msg="starting compact node"
level=info ts=2018-08-08T14:49:39.578379004Z caller=compact.go:126 msg="start sync of metas"
level=info ts=2018-08-08T14:49:39.579995202Z caller=main.go:243 msg="Listening for metrics" address=0.0.0.0:10902
level=info ts=2018-08-08T14:49:50.947989923Z caller=compact.go:132 msg="start of GC"
level=error ts=2018-08-08T14:49:50.954091484Z caller=main.go:160 msg="running command failed" err="compaction: pre compaction overlap check: overlaps found while gathering blocks. [mint: 1532649600000, maxt: 1532656800000, range: 2h0m0s, blocks: 2]: <ulid: 01CKCTVGR19AJBKHFC01YV2248, mint: 1532649600000, maxt: 1532656800000, range: 2h0m0s>, <ulid: 01CKCTVG2HS86DFDTZZ2GJM26W, mint: 1532649600000, maxt: 1532656800000, range: 2h0m0s>\n[mint: 1532685600000, maxt: 1532692800000, range: 2h0m0s, blocks: 2]: <ulid: 01CKDX64BMRYBJY5V9YY3WRYFS, mint: 1532685600000, maxt: 1532692800000, range: 2h0m0s>, <ulid: 01CKDX6534B4P3XF6EZ49MC4ND, mint: 1532685600000, maxt: 1532692800000, range: 2h0m0s>\n[mint: 1532613600000, maxt: 1532620800000, range: 2h0m0s, blocks: 2]: <ulid: 01CKBRGX8BN9S8SWG4KJ0XW3GM, mint: 1532613600000, maxt: 1532620800000, range: 2h0m0s>, <ulid: 01CKBRGVVF5RENZ12K5XYMPS3E, mint: 1532613600000, maxt: 1532620800000, range: 2h0m0s>\n[mint: 1532628000000, maxt: 1532635200000, range: 2h0m0s, blocks: 2]: <ulid: 01CKC68AZYW5Z1M2XZNW95KPXZ, mint: 1532628000000, maxt: 1532635200000, range: 2h0m0s>, <ulid: 01CKC68ACJ1VHATJ5Y39T61CNE, mint: 1532628000000, maxt: 1532635200000, range: 2h0m0s>\n[mint: 1532642400000, maxt: 1532649600000, range: 2h0m0s, blocks: 2]: <ulid: 01CKCKZRVJQ77NE05XB89R4VM8, mint: 1532642400000, maxt: 1532649600000, range: 2h0m0s>, <ulid: 01CKCKZSJ0QKEVE9N83HMT20T0, mint: 1532642400000, maxt: 1532649600000, range: 2h0m0s>\n[mint: 1532635200000, maxt: 1532642400000, range: 2h0m0s, blocks: 2]: <ulid: 01CKCD42YHRE9T7Y3XEJEBTM1M, mint: 1532635200000, maxt: 1532642400000, range: 2h0m0s>, <ulid: 01CKCD41MF1BBJQSC0QRQGWFEQ, mint: 1532635200000, maxt: 1532642400000, range: 2h0m0s>\n[mint: 1532656800000, maxt: 1532664000000, range: 2h0m0s, blocks: 2]: <ulid: 01CKD1Q7CGTFYRNG8ZQ88GYCJZ, mint: 1532656800000, maxt: 1532664000000, range: 2h0m0s>, <ulid: 01CKD1Q8250YVMT67JX14V9984, mint: 1532656800000, maxt: 1532664000000, range: 2h0m0s>\n[mint: 1532664000000, maxt: 1532671200000, range: 2h0m0s, blocks: 2]: <ulid: 01CKD8JZ827S7QXPM3B8B15R3N, mint: 1532664000000, maxt: 1532671200000, range: 2h0m0s>, <ulid: 01CKD8JYMFKQYBGEHST34F2VWS, mint: 1532664000000, maxt: 1532671200000, range: 2h0m0s>\n[mint: 1532671200000, maxt: 1532678400000, range: 2h0m0s, blocks: 2]: <ulid: 01CKDFENVK5JFKF7XCSXZYTADY, mint: 1532671200000, maxt: 1532678400000, range: 2h0m0s>, <ulid: 01CKDFEPJ0Y5K1KMJ1N2RWEQCR, mint: 1532671200000, maxt: 1532678400000, range: 2h0m0s>\n[mint: 1532678400000, maxt: 1532685600000, range: 2h0m0s, blocks: 2]: <ulid: 01CKDPADT2RQF99MTCX3KSFF2K, mint: 1532678400000, maxt: 1532685600000, range: 2h0m0s>, <ulid: 01CKDPAD3MV32M8YGN86WC0KH2, mint: 1532678400000, maxt: 1532685600000, range: 2h0m0s>\n[mint: 1532599200000, maxt: 1532606400000, range: 2h0m0s, blocks: 2]: <ulid: 01CKBDJ2R789A38KRMQFM7HQSG, mint: 1532599200000, maxt: 1532606400000, range: 2h0m0s>, <ulid: 01CKBASDZX4YKAKV42A86VTPM5, mint: 1532599200000, maxt: 1532606400000, range: 2h0m0s>\n[mint: 1532606400000, maxt: 1532613600000, range: 2h0m0s, blocks: 2]: <ulid: 01CKBHN60CPVCJW9R3BVNR8GD7, mint: 1532606400000, maxt: 1532613600000, range: 2h0m0s>, <ulid: 01CKBHN4KMZ3BNSB7EE4KH7ACT, mint: 1532606400000, maxt: 1532613600000, range: 2h0m0s>\n[mint: 1532620800000, maxt: 1532628000000, range: 2h0m0s, blocks: 2]: <ulid: 01CKBZCMG6BCZHT143FHEHV85X, mint: 1532620800000, maxt: 1532628000000, range: 2h0m0s>, <ulid: 01CKBZCK3FWKTZW8Z5PWH0PXSS, mint: 1532620800000, maxt: 1532628000000, range: 2h0m0s>"

Related configuration from our DSC (Salt)

Pillar / Variables

prometheus:
  retention: 1d
  version: v2.3.2

thanos:
  version: master-2018-08-04-8b7169b

Configuration

Make sure Prometheus is running in Docker:
  docker_container.running:
    - name: prometheus
    - image: prom/prometheus:{{ pillar["prometheus"]["version"] }}
    - user: 4000
    - restart_policy: always
    - network_mode: host
    - command: >
          --config.file=/etc/prometheus/prometheus-{{ grains["prometheus"]["environment"] }}.yml
          --storage.tsdb.path=/prometheus
          --web.enable-admin-api
          --web.console.libraries=/etc/prometheus/console_libraries
          --web.console.templates=/etc/prometheus/consoles
          --storage.tsdb.retention {{ pillar["prometheus"]["retention"] }}
          --storage.tsdb.min-block-duration=2h
          --storage.tsdb.max-block-duration=2h
          --web.enable-lifecycle
    - environment:
      - GOOGLE_APPLICATION_CREDENTIALS: /etc/prometheus/gcloud.json 
    - binds:
      - "/etc/monitoring/prometheus:/etc/prometheus:Z"
      - "/var/data/prometheus:/prometheus:Z"

Make sure Thanos Sidecar is running in Docker:
  docker_container.running:
    - name: thanos-sidecar
    - image: improbable/thanos:{{ pillar["thanos"]["version"] }}
    - user: 4000
    - restart_policy: always
    - network_mode: host
    - command: >
        sidecar
        --prometheus.url http://localhost:9090
        --tsdb.path /prometheus
        --gcs.bucket xxx-thanos
        --cluster.peers query.thanos.internal.xxx:10906
        --cluster.peers store.thanos.internal.xxx:10903
        --cluster.address 0.0.0.0:10900
        --cluster.advertise-address {{ grains["ip4_interfaces"]["eth0"][0] }}:10900
        --grpc-address 0.0.0.0:10901
        --grpc-advertise-address {{ grains["ip4_interfaces"]["eth0"][0] }}:10901
        --http-address 0.0.0.0:10902
    - environment:
      - GOOGLE_APPLICATION_CREDENTIALS: /etc/prometheus/gcloud.json 
    - binds:
      - "/etc/monitoring/prometheus:/etc/prometheus:Z"
      - "/var/data/prometheus:/prometheus:Z"

Make sure Thanos Query is running in Docker:
  docker_container.running:
    - name: thanos-query
    - image: improbable/thanos:{{ pillar["thanos"]["version"] }}
    - user: 4002
    - restart_policy: always
    - network_mode: host
    - command: >
        query
        --cluster.peers store.thanos.internal.xxx:10903
        --query.replica-label replica
        --cluster.address 0.0.0.0:10906
        --cluster.advertise-address {{ grains["ip4_interfaces"]["eth0"][0] }}:10906
        --grpc-address 0.0.0.0:10907
        --grpc-advertise-address {{ grains["ip4_interfaces"]["eth0"][0] }}:10907
        --http-address 0.0.0.0:10908
        --http-advertise-address {{ grains["ip4_interfaces"]["eth0"][0] }}:10908
    - binds:
      - "/var/data/thanos:/var/data/thanos:Z"
      - "/etc/monitoring/prometheus:/etc/prometheus:Z"
    - environment:
      - GOOGLE_APPLICATION_CREDENTIALS: /etc/prometheus/gcloud.json

Make sure Thanos Store is running in Docker:
  docker_container.running:
    - name: thanos-store
    - image: improbable/thanos:{{ pillar["thanos"]["version"] }}
    - user: 4002
    - restart_policy: always
    - network_mode: host
    - command: >
        store
        --tsdb.path /var/data/thanos/store
        --cluster.peers query.thanos.internal.xxx:10900
        --gcs.bucket xxx-thanos
        --cluster.address 0.0.0.0:10903
        --cluster.advertise-address {{ grains["ip4_interfaces"]["eth0"][0] }}:10903
        --grpc-address 0.0.0.0:10904
        --grpc-advertise-address {{ grains["ip4_interfaces"]["eth0"][0] }}:10904
        --http-address 0.0.0.0:10905
    - binds:
      - "/var/data/thanos:/var/data/thanos:Z"
      - "/etc/monitoring/prometheus:/etc/prometheus:Z"
    - environment:
      - GOOGLE_APPLICATION_CREDENTIALS: /etc/prometheus/gcloud.json 

Make sure Thanos Compactor is running in Docker:
  docker_container.running:
    - name: thanos-compactor
    - image: improbable/thanos:{{ pillar["thanos"]["version"] }}
    - user: 4002
    - restart_policy: on-failure
    - command: >
        compact
        --data-dir /var/data/thanos/compact
        --gcs.bucket xxx-thanos
    - binds:
      - "/var/data/thanos:/var/data/thanos:Z"
      - "/etc/monitoring/prometheus:/etc/prometheus:Z"
    - environment:
      - GOOGLE_APPLICATION_CREDENTIALS: /etc/prometheus/gcloud.json 

Set Thanos compactor cron task:
  cron.present:
    - name: docker restart thanos-compactor
    - user: root
    - minute: 0

Let me know if you need any more information.

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 24 (9 by maintainers)

Most upvoted comments

Getting the same issue on the versions v0.8.1 and v0.9.0

anoop2503 on Dec 5, 2019

same here for thanos version 0.6.0

pkrishnath on Aug 12, 2019

Hey everyone 👋🏼 I would like to explore the possibility of adding some sort of fixing command to the thanos tools bucket group of subcommands to facilitate handling this kind of issues.

This is a very good question. We wanted the repair job to be a “whitebox” repair. So it is not repairing if it does not know to what exact issue it related. This is to avoid the problem of removing the block without thinking what exactly was wrong, which is important. Otherwise, how to fix the SOURCE of the issue?

and

All known causes of overlaps are misconfiguraiton. W tried our best to explain all potential problems and solutions here: https://thanos.io/operating/troubleshooting.md/ . No automatic repair is possible in this case. (:

@bwplotka Thanks for all the details shared on this issue! While I understand why the team decided to avoid fixing the problem without knowing the origin of the issue, in the end, users do need to act somehow to clean up the bucket so Thanos compactor can get back to work in the newly added data and everything else that isn’t overlapping even if the root cause of the problem was a misconfiguration. Depending on the time window and the amount of data stored in the bucket, the effort required to clean it can get pretty big, forcing users to write their own scripts, and risking ending up in an even worse situation.

I would like to propose the possibility of adding a complementary command, or even evolving the current one (thanos tools bucket verify --repair) with the ability to move the affected blocks to a backup bucket, or something similar to that, in a way that users could get compactor back to running and then decide what to do with the affected data. We could also consider implementing a way to move the data back once it’s sorted out 🤔

If that’s something that makes sense for the project, I would love to explore the topic and contribute. I would appreciate some feedback and guidance on this (should I open a new issue?)

Thanks

B0go on May 15, 2023

@bwplotka I’d like to revisit this issue, if you have a moment. We have a very simple stack set up using this docker-compose configuration. It has:

One Prometheus instance
One Grafana instance
One Thanos sidecar
One Thanos store server
One Thanos query server
Object storage in OpenStack

After running for just a couple of days, we’re running into the “error executing compaction: compaction failed: compaction failed for group …: pre compaction overlap check: overlaps found while gathering blocks.” error.

The troubleshooting documents suggests the following reasons for this error:

Misconfiguraiton of sidecar/ruler: Same external labels or no external labels across many block producers.

We only have a single sidecar (and no ruler).
Running multiple compactors for single block “stream”, even for short duration.

We only have a single compactor.
Manually uploading blocks to the bucket.

This never happened.
Eventually consistent block storage until we fully implement RW for bucket

I’m not entirely sure what this is suggesting. Given only a single producer of information and a single bucket for storage, I’m not sure how eventual consistency could be a problem.

If you have a minute (or @daixiang0 or someone else) I would appreciate some insight into what could be causing this problem.

We’re running:

$ docker-compose exec thanos_store thanos --version
thanos, version 0.12.0-dev (branch: master, revision: d18e1aec64ca3de143930a87d60bc52fe733e682)
  build user:       circleci@db15d248cc4d
  build date:       20200304-16:25:49
  go version:       go1.13.1

With:

$ docker-compose exec prom_main prometheus --version
prometheus, version 2.16.0 (branch: HEAD, revision: b90be6f32a33c03163d700e1452b54454ddce0ec)
  build user:       root@7ea0ae865f12
  build date:       20200213-23:50:02
  go version:       go1.13.8

larsks on Apr 8, 2020