cortex: store-gateway pods lose connection to memcached and end up crashing
Describe the bug store-gateway pods will fail to connect to memcached with an i/o timeout error as seen in: https://github.com/thanos-io/thanos/issues/1979
{"caller":"memcached_client.go:382","err":"read tcp 192.168.68.132:59858->10.232.216.84:11211: i/o timeout","firstKey":"P:01F4AB8BMEH03WEXNMY516NMC8:mauaq0QeA5VHuyjZHSsF5P_DU_MPW7BLqAbRy42_Z2I","level":"warn","msg":"failed to fetch items from memcached","numKeys":2,"ts":"2021-04-27T23:27:10.227890794Z"}
To Reproduce Steps to reproduce the behavior:
- Start Cortex (1.8.0)
- Perform Operations(Read/Write/Others)
- Ingest 1.1 million metrics per second
- run store-gateway as a stateful set hooked up to s3
- wait for them to fail by losing connection to memcached
Expected behavior I expect store-gateway pods to run for a long time without crashing
Environment:
- Infrastructure: Kubernets / EKS / s3
- Deployment tool: helm
Storage Engine
- Blocks
- Chunks
Additional Context
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 19 (5 by maintainers)
I am facing this same issue. When I run queries with a long duration i.e. queries >= 2 days, the store-gateways become unhealthy and the queries fail with the following logs.
@cabrinha I checked compactor, but there is no issue with compaction. when I try to query large size query, there is always crash in store-gateway showing memcache time out and context headline exceed. I checked memcached metric, but there were no issues. did you configured query frontend scheduler? we don’t have this component, so I try to configure it.
i’ll post some error messages tomorrow for debug this issue.
I’m going to re-open this issue because it seems that the default memcached settings for index_cache aren’t as good as they could be.
I found this issue from last year, Thanos: https://github.com/thanos-io/thanos/issues/1979
It seems like there are some suggestions around what to set memcached_config settings at, but I’m not sure anyone reported back a working example.
I’d like to review the defaults for each cache, make sure the defaults are setup to handle what Cortex requires and then provide an example of handling more load.
The current index-cache and metadata-cache pages lack working examples of a performant configuration.
Looks like compactions were failing for our largest tenant, causing a slew of problems.
We’re now alerting on
block with not healthy index found
as this halts compactions for the tenant until the block is deleted.