thanos: S3/GCS: Upload/Delete inconsistency (missing chunk file)
Thanos, Prometheus and Golang version used
- thanos: improbable/thanos:v0.4.0
- go: 1.12.4
- prometheus: quay.io/prometheus/prometheus:v2.7.2
- go: 1.11.5
What happened
We use an on-premise S3 store (in thanos via type: S3
in the object store
config) that experienced availability issues for a 3 - 15 hour period (3 hours
of common connection issues, 15 hours of less common connection issues).
Multiple thanos components (shipper/sidecar, compactor) experienced timeouts
while awaiting responses from the S3 service.
This resulted in the shipper/sidecar not writing complete blocks. We observed the following types of partially written blocks:
- block directory present, meta.json present, index and chunks directory missing.
- block directory present, meta.json present, index present, chunks directory missing.
- block directory present, meta.json present, index present, chunks directory present, but not all chunk files being present.
Those partially written (one might call them corrupted) blocks caused subsequent issues:
- store components failed queries over a timespan that would contain a partially written block
- compactor component crashed with non-zero exitcode (log messages before the crash below)
What you expected to happen
- sidecars do not give up in completing partially written blocks
- compactors don’t crash on partially written blocks (either skipping them or removing them outright)
- stores don’t fail queries on partially written blocks (e.g. by ignoring those blocks)
How to reproduce it (as minimally and precisely as possible):
We have been running into issues trying to build a minimal reproducible scenario. It would seem that it should be enough to have blocks with the mentioned criteria:
- block directory present, meta.json present, index and chunks directory missing.
- block directory present, meta.json present, index present, chunks directory present, but not all chunk files being present.
When trying this however we ran into the situation that these blocks did not run into the compaction plan (see here).
It seems however that once this is considered in the compaction plan and GatherIndexIssueStats is executed the compactor will fail if the index file is not present. If chunks are missing it will fail later in the prometheus tsdb compaction code.
Partial logs to relevant components
level=error
ts=2019-07-03T13:02:34.85314565Z caller=main.go:182
msg="running command failed"
err="error executing compaction: compaction failed: compaction failed for group 0@{prometheus=\"XXX-prometheus-name-XXX\",prometheus_replica=\"prometheus-thanos-system-1\"}: compact blocks [/var/thanos/store/compact/0@{prometheus=\"XXX-prometheus-name-XXX\",prometheus_replica=\"prometheus-thanos-system-1\"}/01DE9Q2EWM7Y5JZPB7V4MC4BBC /var/thanos/store/compact/0@{prometheus=\"XXX-prometheus-name-XXX\",prometheus_replica=\"prometheus-thanos-system-1\"}/01DE9XY64NVKB7Y62B9DH5A0AP /var/thanos/store/compact/0@{prometheus=\"XXX-prometheus-name-XXX\",prometheus_replica=\"prometheus-thanos-system-1\"}/01DEA4SXCNKCZ06C83Q45N6C2C /var/thanos/store/compact/0@{prometheus=\"XXX-prometheus-name-XXX\",prometheus_replica=\"prometheus-thanos-system-1\"}/01DEABNMMQFGXKXZMTKJF9T42Z]: write compaction: chunk 8 not found: reference sequence 0 out of range"
Currently this is the only log we have. Should we run into the issue again, i’ll make sure to attach more logs of the other components and cases.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 27 (15 by maintainers)
Commits related to this issue
- add stats for partially written blocks see https://github.com/thanos-io/thanos/issues/1331 This enables us to identify partially written blocks that observe these issues: * missing index * chunks r... — committed to 0robustus1/thanos by 0robustus1 5 years ago
- Fixed partial delete issues on compactor; Added Upload/Delete tests. Fixes https://github.com/thanos-io/thanos/issues/1331 Problem that we are fixing is explained in the linked issue. Signed-off-by... — committed to thanos-io/thanos by bwplotka 5 years ago
- Fixed partial delete issues on compactor; Added Upload/Delete tests. Fixes https://github.com/thanos-io/thanos/issues/1331 Problem that we are fixing is explained in the linked issue. Signed-off-by... — committed to thanos-io/thanos by bwplotka 5 years ago
- Fixed partial delete issues on compactor; Added Upload/Delete tests. Fixes https://github.com/thanos-io/thanos/issues/1331 Problem that we are fixing is explained in the linked issue. Signed-off-by... — committed to thanos-io/thanos by bwplotka 5 years ago
- Fixed partial delete issues on compactor; Added Upload/Delete tests. (#1525) Fixes https://github.com/thanos-io/thanos/issues/1331 Problem that we are fixing is explained in the linked issue. S... — committed to thanos-io/thanos by bwplotka 5 years ago
- Fixed partial delete issues on compactor; Added Upload/Delete tests. (#1525) Fixes https://github.com/thanos-io/thanos/issues/1331 Problem that we are fixing is explained in the linked issue. S... — committed to wbh1/thanos by bwplotka 5 years ago
- Fixed partial delete issues on compactor; Added Upload/Delete tests. (#1525) Fixes https://github.com/thanos-io/thanos/issues/1331 Problem that we are fixing is explained in the linked issue. S... — committed to brancz/objstore by bwplotka 5 years ago
I think I found the bug guys.
The problem is most likely here: https://github.com/thanos-io/thanos/blob/2c5f2cde11f5cd100f147ad2e5d4dbeccbd604c5/pkg/objstore/objstore.go#L95
Thanos is resilient on partial uploads in most cases. It’s done based on small meta.json file. If it’s present and block has more than X minutes. We consider this block ready to be used. Delay is for eventual consistency buckets. If there is no meta.json after X minutes we assumed it is partial upload and compactor removes the block.
Now this works well as we always upload meta.json at the end. However, we don’t do deletions in a proper way. In the linked code we delete in lexicographical order. This means chunks go first, then index then meta… If we restart compactor or sidecar in the middle of this we have choked Compactor.
Fixing this now.
This also means that NONE of those blocks that were blocking compactor were not important e.g removing them should drop no metrics overall. Let me know if that makes sense (:
And thanks all the reports that helped us to identify problem - especially https://github.com/thanos-io/thanos/issues/1331#issuecomment-526001460
sidecar logs are gone 😦 will capture next time if issue occurs again