thanos: store: Fails to sync with object store and stops responding to queries

Thanos, Prometheus and Golang version used:

Thanos: “thanosio/thanos:v0.10.0” & “thanosio/thanos:master-2020-01-31-9b17ba18” Prometheus: 2.15.2 Golang: 1.13.1

Object Storage Provider: GCS

What happened:

Two things:

1 - Thanos Store fails to sync with object store (iter bucket operations fail continuosly). 2 - Thanos Store stops responding to queries.

What you expected to happen:

Thanos Store reflects the current state of the blocks in object storage. Thanos Store returns metrics being asked for.

How to reproduce it (as minimally and precisely as possible):

Not found a reliable way to reproduce this so far. Seems to occur several hours after Store startup and only on Stores fronting larger buckets (> 40k blocks).

Full logs to relevant components (Store):


level=warn ts=2020-02-05T01:14:46.520268878Z caller=store.go:272 msg=\"syncing blocks failed\" err=\"MetaFetcher: iter bucket: Get https://storage.googleapis.com/storage/v1/b/thanos-01/o?alt=json&delimiter=%2F&pageToken=&prefix=&prettyPrint=false&projection=full&versions=false: write tcp 10.16.17.4:60590->209.85.145.128:443: write: broken pipe
level=warn ts=2020-02-05T01:17:46.520149565Z caller=store.go:272 msg=\"syncing blocks failed\" err=\"MetaFetcher: iter bucket: Get https://storage.googleapis.com/storage/v1/b/thanos-01/o?alt=json&delimiter=%2F&pageToken=&prefix=&prettyPrint=false&projection=full&versions=false: write tcp 10.16.17.4:60590->209.85.145.128:443: write: broken pipe
level=warn ts=2020-02-05T01:20:46.52003521Z caller=store.go:272 msg=\"syncing blocks failed\" err=\"MetaFetcher: iter bucket: Get https://storage.googleapis.com/storage/v1/b/thanos-01/o?alt=json&delimiter=%2F&pageToken=&prefix=&prettyPrint=false&projection=full&versions=false: write tcp 10.16.17.4:60590->209.85.145.128:443: write: broken pipe

Anything else we need to know:

Thanos Store parameters:
      --log.level=debug
      --data-dir=/var/thanos/store
      --index-cache-size=2GB
      --chunk-pool-size=4GB
      --experimental.enable-index-header
      --objstore.config=type: GCS
      config:
        bucket: "thanos-01"

Once Store enters this state, it doesn’t recover for at least 10 hours (haven’t waited longer). A restart is the only workaround.

thanos_objstore_bucket_operation_failures_total{bucket="thanos-01",operation="iter"} counter steadily increases at sync-block-duration frequency (we’re running with the default 3m interval).

Metrics

Store metrics when this occurred https://dpaste.org/A4zA/raw.

pprof

I collected all available profiling information from affected Store instance, this maybe is a bit excessive, but seemed prudent given that I’ve not found a way to reproduce this at will - store-sync-failure-debug.zip.

Let me know if there’s anything else I can add.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 15 (5 by maintainers)

Most upvoted comments

Just experienced this same issue. Trying to sort out a solution.

We just experienced this exact issue with the GCS objstore client in Cortex (using the Thanos objstore client): https://github.com/cortexproject/cortex/issues/2703

Don’t scale down just yet. The baseline memory might be reduced significantly, but requests will take the same amount as before + bits that were previously cached. So expect spikes. We are working to improve read path as well, and make sure it’s more deterministic (:

Make sure you do some load tests before scaling down.

On Fri, 21 Feb 2020 at 09:44, Žygis Škulteckis notifications@github.com wrote:

Closed #2098 https://github.com/thanos-io/thanos/issues/2098.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/thanos-io/thanos/issues/2098?email_source=notifications&email_token=ABVA3O34JDKA5Z3DZEVNQATRD6PBNA5CNFSM4KQHCZBKYY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOWZQ36OA#event-3059859256, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVA3O3ZXE7TRKCL53EHUBLRD6PBNANCNFSM4KQHCZBA .