thanos: store: Fails to sync with object store and stops responding to queries
Thanos, Prometheus and Golang version used:
Thanos: “thanosio/thanos:v0.10.0” & “thanosio/thanos:master-2020-01-31-9b17ba18” Prometheus: 2.15.2 Golang: 1.13.1
Object Storage Provider: GCS
What happened:
Two things:
1 - Thanos Store fails to sync with object store (iter bucket
operations fail continuosly).
2 - Thanos Store stops responding to queries.
What you expected to happen:
Thanos Store reflects the current state of the blocks in object storage. Thanos Store returns metrics being asked for.
How to reproduce it (as minimally and precisely as possible):
Not found a reliable way to reproduce this so far. Seems to occur several hours after Store startup and only on Stores fronting larger buckets (> 40k blocks).
Full logs to relevant components (Store):
level=warn ts=2020-02-05T01:14:46.520268878Z caller=store.go:272 msg=\"syncing blocks failed\" err=\"MetaFetcher: iter bucket: Get https://storage.googleapis.com/storage/v1/b/thanos-01/o?alt=json&delimiter=%2F&pageToken=&prefix=&prettyPrint=false&projection=full&versions=false: write tcp 10.16.17.4:60590->209.85.145.128:443: write: broken pipe
level=warn ts=2020-02-05T01:17:46.520149565Z caller=store.go:272 msg=\"syncing blocks failed\" err=\"MetaFetcher: iter bucket: Get https://storage.googleapis.com/storage/v1/b/thanos-01/o?alt=json&delimiter=%2F&pageToken=&prefix=&prettyPrint=false&projection=full&versions=false: write tcp 10.16.17.4:60590->209.85.145.128:443: write: broken pipe
level=warn ts=2020-02-05T01:20:46.52003521Z caller=store.go:272 msg=\"syncing blocks failed\" err=\"MetaFetcher: iter bucket: Get https://storage.googleapis.com/storage/v1/b/thanos-01/o?alt=json&delimiter=%2F&pageToken=&prefix=&prettyPrint=false&projection=full&versions=false: write tcp 10.16.17.4:60590->209.85.145.128:443: write: broken pipe
Anything else we need to know:
Thanos Store parameters:
--log.level=debug
--data-dir=/var/thanos/store
--index-cache-size=2GB
--chunk-pool-size=4GB
--experimental.enable-index-header
--objstore.config=type: GCS
config:
bucket: "thanos-01"
Once Store enters this state, it doesn’t recover for at least 10 hours (haven’t waited longer). A restart is the only workaround.
thanos_objstore_bucket_operation_failures_total{bucket="thanos-01",operation="iter"}
counter steadily increases at sync-block-duration
frequency (we’re running with the default 3m interval).
Metrics
Store metrics when this occurred https://dpaste.org/A4zA/raw.
pprof
I collected all available profiling information from affected Store instance, this maybe is a bit excessive, but seemed prudent given that I’ve not found a way to reproduce this at will - store-sync-failure-debug.zip.
Let me know if there’s anything else I can add.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 15 (5 by maintainers)
Just experienced this same issue. Trying to sort out a solution.
We just experienced this exact issue with the GCS objstore client in Cortex (using the Thanos objstore client): https://github.com/cortexproject/cortex/issues/2703
Don’t scale down just yet. The baseline memory might be reduced significantly, but requests will take the same amount as before + bits that were previously cached. So expect spikes. We are working to improve read path as well, and make sure it’s more deterministic (:
Make sure you do some load tests before scaling down.
On Fri, 21 Feb 2020 at 09:44, Žygis Škulteckis notifications@github.com wrote: