thanos: store: sometimes pkg/store.(*indexCache).ensureFits spins forever, blocking other goroutines
Thanos components: improbable/thanos:v0.1.0 Prometheus: prom/prometheus:v2.4.2
What happened
Number of goroutines in Thanos Store instances started growing inexorably. It never came back down and Thanos Store was essentially unresponsive to all queries during this.
What you expected to happen Goroutine number to stay around the same number i.e. have a low standard deviation, and Thanos Store to be responsive.
How to reproduce it (as minimally and precisely as possible): To be done. since this is hard to reproduce and depends on the data.
Full logs to relevant components Thanos Query started timing out:
level=error ts=2018-11-29T13:20:41.368751553Z caller=proxy.go:117 err="fetch series for [{monitor prtoss} {replica a}]: rpc error: code = Canceled desc = context canceled"
level=error ts=2018-11-29T13:20:41.368755808Z caller=proxy.go:117 err="fetch series for [{monitor prtoss} {replica b}]: rpc error: code = Canceled desc = context canceled"
level=error ts=2018-11-29T13:20:41.368808725Z caller=proxy.go:117 err="fetch series for []: rpc error: code = Canceled desc = context canceled"
level=warn ts=2018-11-29T13:20:41.368849077Z caller=proxy.go:126 err="No store matched for this query"
level=error ts=2018-11-29T13:20:41.368848275Z caller=proxy.go:117 err="fetch series for []: rpc error: code = Canceled desc = context canceled"
level=warn ts=2018-11-29T13:20:41.368909531Z caller=proxy.go:126 err="No store matched for this query"
level=error ts=2018-11-29T13:20:41.369833973Z caller=engine.go:499 msg="error expanding series set" err="context canceled"
No more series query processed
messages in Thanos Store whatsoever, only Blocks source resolution
. Only this one got through somehow in a period of about an hour:
Lap 29 14:07:42 foobar docker[21891]: level=debug ts=2018-11-29T13:07:42.784986336Z caller=bucket.go:802 msg="series query processed" stats="&{blocksQueried:0 postingsTouched:0 postingsTouchedSizeSum:0 postingsFetched:0 postingsFetchedSizeSum:0 postingsFetchCount:0 postingsFetchDurationSum:0 seriesTouched:0 seriesTouchedSizeSum:0 seriesFetched:0 seriesFetchedSizeSum:0 seriesFetchCount:0 seriesFetchDurationSum:0 chunksTouched:0 chunksTouchedSizeSum:0 chunksFetched:0 chunksFetchedSizeSum:0 chunksFetchCount:0 chunksFetchDurationSum:0 getAllDuration:877 mergedSeriesCount:0 mergedChunksCount:0 mergeDuration:1273}"
When something happened but exactly zero data was pulled down so maybe this is a separate issue?
The goroutine dump of Thanos Store at the moment of this freeze: goroutine_dmp.zip
Notice that exactly one goroutine is stuck in ensureFits
and it locks the lock, thus preventing other goroutines from performing their work.
Workaround: increased the index cache significantly and the problem went away. Perhaps this has to do with something that some index cache files are really large:
$ du -t 250M -hs /var/lib/thanos/*/index.cache.json
279M /var/lib/thanos/01CXD021G0JHNRQWJKVKCHD3QR/index.cache.json
1,6G /var/lib/thanos/01CXF4WQE8H27NZK0W4VEQB3XP/index.cache.json
1,5G /var/lib/thanos/01CXFJT3R8J59MD4YC9AC0TPE5/index.cache.json
...
Anything else we need to know Environment:
- OS: CentOS Linux 7 (Core)
- Kernel: 3.10.0-862.11.6.el7.x86_64
- Docker: 18.03.1-ce, build 9ee9f40
- Memory: 256 GB
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 1
- Comments: 16 (16 by maintainers)
Commits related to this issue
- Ensured index cache is best effort, refactored tests, validated edge cases. Fixes https://github.com/improbable-eng/thanos/issues/651 Signed-off-by: Bartek Plotka <bwplotka@gmail.com> — committed to thanos-io/thanos by bwplotka 5 years ago
- Ensured index cache is best effort, refactored tests, validated edge cases. Fixes https://github.com/improbable-eng/thanos/issues/651 Signed-off-by: Bartek Plotka <bwplotka@gmail.com> — committed to thanos-io/thanos by bwplotka 5 years ago
- Ensured index cache is best effort, refactored tests, validated edge cases. Fixes https://github.com/improbable-eng/thanos/issues/651 Signed-off-by: Bartek Plotka <bwplotka@gmail.com> — committed to thanos-io/thanos by bwplotka 5 years ago
- Ensured index cache is best effort, refactored tests, validated edge cases. Fixes https://github.com/improbable-eng/thanos/issues/651 Signed-off-by: Bartek Plotka <bwplotka@gmail.com> — committed to thanos-io/thanos by bwplotka 5 years ago
- Ensured index cache is best effort, refactored tests, validated edge cases. Fixes https://github.com/improbable-eng/thanos/issues/651 Current size also includes slice header. Signed-off-by: Bartek ... — committed to thanos-io/thanos by bwplotka 5 years ago
- Ensured index cache is best effort, refactored tests, validated edge cases. Fixes https://github.com/improbable-eng/thanos/issues/651 Current size also includes slice header. Signed-off-by: Bartek ... — committed to thanos-io/thanos by bwplotka 5 years ago
- Ensured index cache is best effort, refactored tests, validated edge cases. Fixes https://github.com/improbable-eng/thanos/issues/651 Current size also includes slice header. Signed-off-by: Bartek ... — committed to thanos-io/thanos by bwplotka 5 years ago
- Ensured index cache is best effort, refactored tests, validated edge cases. Fixes https://github.com/improbable-eng/thanos/issues/651 Current size also includes slice header. Signed-off-by: Bartek ... — committed to thanos-io/thanos by bwplotka 5 years ago
- Ensured index cache is best effort, refactored tests, validated edge cases. (#1073) Fixes https://github.com/improbable-eng/thanos/issues/651 Current size also includes slice header. Signed-off... — committed to thanos-io/thanos by bwplotka 5 years ago
- Rebase upstream 0.4.0 (#9) * query: cleanup store statuses as they come and go (#910) Signed-off-by: Adrien Fillon <adrien.fillon@cdiscount.com> * [docs] Example of using official prometheus he... — committed to smalldirector/thanos by deleted user 5 years ago
Weird, I’m on
master
and I’ve added some more tests: https://github.com/GiedriusS/thanos/commit/42449a820cd77aae54f65faa0e86e6d8744f0ac8, and they pass without an issue. Could you think of another edge case that we are missing? 😞