thanos: store: Store gateway consuming lots of memory / OOMing
Thanos, Prometheus and Golang version used thanos v0.1.0rc2
What happened Thanos-store is consuming 50gb of memory during startup
What you expected to happen Thanos-store does not consume so much memory for starting up
Full logs to relevant components store:
level=debug ts=2018-07-27T15:51:21.415788856Z caller=cluster.go:132 component=cluster msg="resolved peers to following addresses" peers=100.96.232.51:10900,100.99.70.149:10900,100.110.182.241:10900,100.126.12.148:10900
level=debug ts=2018-07-27T15:51:21.416254389Z caller=store.go:112 msg="initializing bucket store"
level=warn ts=2018-07-27T15:52:05.28837034Z caller=bucket.go:240 msg="loading block failed" id=01CKE41VDSJMSAJMN6N6K8SABE err="new bucket block: load index cache: download index file: copy object to file: write /var/thanos/store/01CKE41VDSJMSAJMN6N6K8SABE/index: cannot allocate memory"
level=warn ts=2018-07-27T15:52:05.293692332Z caller=bucket.go:240 msg="loading block failed" id=01CKE41VE4XXTN9N55YPCJSPP2 err="new bucket block: load index cache: download index file: copy object to file: write /var/thanos/store/01CKE41VE4XXTN9N55YPCJSPP2/index: cannot allocate memory"
Anything else we need to know Some time after initialization the ram usage goes down to normal levels, something around 8Gb
Another thing that’s happening is that my thanos-compactor consumer way too much ram memory as well, the last time it ran, it used up to 60Gb of memory.
I run store with this args:
containers:
- args:
- store
- --log.level=debug
- --tsdb.path=/var/thanos/store
- --s3.endpoint=s3.amazonaws.com
- --s3.access-key=xxx
- --s3.bucket=xxx
- --cluster.peers=thanos-peers.monitoring.svc.cluster.local:10900
- --index-cache-size=2GB
- --chunk-pool-size=8GB
Environment:
- OS (e.g. from /etc/os-release): kubernetes running on debian
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 9
- Comments: 41 (24 by maintainers)
Commits related to this issue
- docs: add references to `Remote Write Storage Wars` Also mention than VictoriaMetrics uses less RAM than Thanos Store Gateway - see https://github.com/thanos-io/thanos/issues/448 for details. — committed to VictoriaMetrics/VictoriaMetrics by valyala 4 years ago
- docs: add references to `Remote Write Storage Wars` Also mention than VictoriaMetrics uses less RAM than Thanos Store Gateway - see https://github.com/thanos-io/thanos/issues/448 for details. — committed to VictoriaMetrics/VictoriaMetrics by valyala 4 years ago
I think it would be a lot improved already by just providing guidance on sizing of chunk pool and index cache sizes. If the grafana dashboards provided also included enough to figure out more what was going on and how close one was to limits, that would also be helpful.
☝️ Deleted the comment as it does not help to resolve this particular issue for the community (:
Let’s get back to this.
We need better OOM flow for our store gateway. Some improvements that needs to be done:
chunk Pool size + index cache size
which is unexpected. This means “leak” somewhere else or byte ranges getting out of chunk pool hardcoded ranges. We need to take a look on this as well.Lot’s of work, so help is wanted (: In separate thread we are working on Querier cache, but that’s just hiding the actual problem (:
cc @mjd95 @devnev
In case people still needs this, you can now test with the container v0.11.0-rc.1. It’s working correctly for us on AWS.
Got this
master-2020-01-25-cf4e4500
running for some time. 50% memory improvement. Great work and thanks to all people involved.We need to move Thanos to Go 1.12.5: https://github.com/prometheus/prometheus/issues/5524
TL;DR - We are currently seeing thanos-store consuming incredibly large amounts of memory during initial sync and then being OOM killed. It is not releasing any memory as it is performing the initial sync and there is very likely a memory leak. Memory leak is likely to be occurring in https://github.com/improbable-eng/thanos/blob/v0.1.0/pkg/block/index.go#L105-L154
Thanos, Prometheus and Golang version used thanos-store 0.1.0, Golang 1.11 (built with
quay.io/prometheus/golang-builder:1.11-base
)What happened thanos-store is consuming 32gb of memory during initial sync, then being OOM (out of memory) killed
What you expected to happen thanos-store not to use this much memory on initial sync and to progress past the initial sync
Full logs to relevant components No logs are emitted whilst the initial sync is occurring, see graphs below
Anything else we need to know Here is a graph of the total memory usage (cache + rss), rss memory usage and cache memory usage:
We have Kubernetes memory limits on the thanos-store container set to 32Gb, which is why it is eventually killed when it reaches this point.
Our thanos S3 bucket is currently 488.54404481872916G, 15078 objects in size.
We’ve noticed that thanos-store doesn’t progress past the
InitialSync
function - https://github.com/improbable-eng/thanos/blob/v0.1.0/cmd/thanos/store.go#L113 and exceed the memory limits of the container before finishing.We’ve modified the goroutine count for how many blocks are being processed concurrently. It is currently hardcoded
20
, but by changing it to a much lower number, e.g.1
, we can have thanos-store last longer before being OOM killed. Although it does take longer to do theInitialSync
- https://github.com/improbable-eng/thanos/blob/v0.1.0/pkg/store/bucket.go#L231The goroutine count for
SyncBlocks
should really be a configurable option as well, hard coding it to 20 is not ideal.Through some debugging, we’ve identified the loading of the index cache as the location of the memory leak - https://github.com/improbable-eng/thanos/blob/v0.1.0/pkg/store/bucket.go#L1070
By commenting out that function from the
newBucketBlock
function, thanos-store is able to progress past theInitialSync
, (albeit without any index caches) and consumes very little memory.We then ran some pprof heap analysis on the thanos-store as the memory leak was occurring and it identified
block.ReadIndexCache
as consuming alot of memory, see image below of the pprof heap graphThe function in question - https://github.com/improbable-eng/thanos/blob/v0.1.0/pkg/block/index.go#L105-L154. The heap graph above suggests that the leak is in the json encoding/decoding of the index file and for some reason is not releasing memory.
Any update on this? We can’t use Thanos at the scale that we want to because of this.
hi @Bplotka I do actually, it also uses tons of memory (~60Gb) in the last run, is this normal?
Try just a new release without the flag. This error which is really client not being able to talk to S3 does not have anything to do with the experimental feature. (: It might be miconfiguration.
I am getting the below error with thanos store when using latest master branch docker image(quay.io/thanos/thanos:master-2020-01-25-cf4e4500) and by enabling --experimental.enable-index-header flag. I am having kubernetes for thanos deployment deployment.
I am using the same bucket before when the thanos-store docker image improbable/thanos:v0.3.2 and there were no this
access denied
error but the initial syncing was got stuck and eventually the pod got OOM killed. 😦@caarlos0 Hi, this feature is not included in v0.10.1 release. You can use the latest master branch docker image to try it.
FYI: This issue was closed as the major rewrite happened on master above 0.10.0. It’s still experimental but you can enable it via https://github.com/thanos-io/thanos/blob/master/cmd/thanos/store.go#L78 (
--experimental.enable-index-header
).We are still working on various benchmarks especially around query resource usage, but functionally it should work! (:
Please try it our on dev/testing/staging environments and give us feedback! ❤️