cortex: store-gateway pods lose connection to memcached and end up crashing

Describe the bug store-gateway pods will fail to connect to memcached with an i/o timeout error as seen in: https://github.com/thanos-io/thanos/issues/1979

{"caller":"memcached_client.go:382","err":"read tcp 192.168.68.132:59858->10.232.216.84:11211: i/o timeout","firstKey":"P:01F4AB8BMEH03WEXNMY516NMC8:mauaq0QeA5VHuyjZHSsF5P_DU_MPW7BLqAbRy42_Z2I","level":"warn","msg":"failed to fetch items from memcached","numKeys":2,"ts":"2021-04-27T23:27:10.227890794Z"}

To Reproduce Steps to reproduce the behavior:

  1. Start Cortex (1.8.0)
  2. Perform Operations(Read/Write/Others)
  3. Ingest 1.1 million metrics per second
  4. run store-gateway as a stateful set hooked up to s3
  5. wait for them to fail by losing connection to memcached

Expected behavior I expect store-gateway pods to run for a long time without crashing

Environment:

  • Infrastructure: Kubernets / EKS / s3
  • Deployment tool: helm

Storage Engine

  • Blocks
  • Chunks

Additional Context

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 19 (5 by maintainers)

Most upvoted comments

I am facing this same issue. When I run queries with a long duration i.e. queries >= 2 days, the store-gateways become unhealthy and the queries fail with the following logs.

[cortex-infra03-store-gateway-0] level=error ts=2022-02-05T13:20:51.953964508Z caller=client.go:241 msg="error getting path" key=collectors/store-gateway err="Get \"http://consul-infra03-consul-server.cortex.svc.cluster.local:8500/v1/kv/collectors/store-gateway?index=25381&stale=&wait=10000ms\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
[cortex-infra03-store-gateway-4] level=error ts=2022-02-05T13:20:52.039745418Z caller=client.go:241 msg="error getting path" key=collectors/store-gateway err="Get \"http://consul-infra03-consul-server.cortex.svc.cluster.local:8500/v1/kv/collectors/store-gateway?index=25381&stale=&wait=10000ms\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
[cortex-infra03-store-gateway-1] level=error ts=2022-02-05T13:20:58.368233783Z caller=client.go:241 msg="error getting path" key=collectors/store-gateway err="Get \"http://consul-infra03-consul-server.cortex.svc.cluster.local:8500/v1/kv/collectors/store-gateway?index=25381&stale=&wait=10000ms\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
[cortex-infra03-store-gateway-2] level=warn ts=2022-02-05T13:21:00.509296919Z caller=grpc_logging.go:55 method=/gatewaypb.StoreGateway/Series duration=138.245µs err="failed to wait for turn: context canceled" msg="gRPC\n"
[cortex-infra03-store-gateway-2] level=warn ts=2022-02-05T13:21:00.628533562Z caller=grpc_logging.go:55 method=/gatewaypb.StoreGateway/Series duration=17.46441443s err="rpc error: code = Aborted desc = fetch series for block 01FTK5M869239D58YR1SH23TGG: expanded matching posting: get postings: read postings range: get range reader: Get \"https://eu01-infra03-cortex.s3.dualstack.eu-west-1.amazonaws.com/logistics/01FTK5M869239D58YR1SH23TGG/index\": context canceled" msg="gRPC\n"
[cortex-infra03-store-gateway-3] level=warn ts=2022-02-05T13:21:02.617874325Z caller=grpc_logging.go:55 method=/gatewaypb.StoreGateway/Series duration=16.472239539s err="rpc error: code = Aborted desc = fetch series for block 01FTK5KFYPE25RARWDCK8D207D: expanded matching posting: get postings: read postings range: get range reader: Get \"https://eu01-infra03-cortex.s3.dualstack.eu-west-1.amazonaws.com/logistics/01FTK5KFYPE25RARWDCK8D207D/index\": context canceled" msg="gRPC\n"
[cortex-infra03-store-gateway-4] level=warn ts=2022-02-05T13:21:03.055869414Z caller=grpc_logging.go:55 method=/gatewaypb.StoreGateway/Series duration=18.095855025s err="rpc error: code = Aborted desc = fetch series for block 01FTK5MYDJ63R1CJ5P4CSNX1NQ: expanded matching posting: get postings: read postings range: get range reader: Get \"https://eu01-infra03-cortex.s3.dualstack.eu-west-1.amazonaws.com/logistics/01FTK5MYDJ63R1CJ5P4CSNX1NQ/index\": context canceled" msg="gRPC\n"

@cabrinha I checked compactor, but there is no issue with compaction. when I try to query large size query, there is always crash in store-gateway showing memcache time out and context headline exceed. I checked memcached metric, but there were no issues. did you configured query frontend scheduler? we don’t have this component, so I try to configure it.

i’ll post some error messages tomorrow for debug this issue.

I’m going to re-open this issue because it seems that the default memcached settings for index_cache aren’t as good as they could be.

I found this issue from last year, Thanos: https://github.com/thanos-io/thanos/issues/1979

It seems like there are some suggestions around what to set memcached_config settings at, but I’m not sure anyone reported back a working example.

I’d like to review the defaults for each cache, make sure the defaults are setup to handle what Cortex requires and then provide an example of handling more load.

The current index-cache and metadata-cache pages lack working examples of a performant configuration.

a little over 38,000 – I run compactors with 5 replicas. 0 reports 38,000 and 4 reports 380.

with the default config, we compact blocks up to 24h so, after compaction, you should have 1 block per day. The number of compacted blocks depends on how many tenants you have in your cluster, but 38k looks suspicious to me, unless you have a very large number of tenants. I would suggest you to investigate which tenant has the highest number of blocks and check if the compactor can catch up with the compaction.

Looks like compactions were failing for our largest tenant, causing a slew of problems.

We’re now alerting on block with not healthy index found as this halts compactions for the tenant until the block is deleted.