thanos: store: sometimes pkg/store.(*indexCache).ensureFits spins forever, blocking other goroutines

Thanos components: improbable/thanos:v0.1.0 Prometheus: prom/prometheus:v2.4.2

What happened image Number of goroutines in Thanos Store instances started growing inexorably. It never came back down and Thanos Store was essentially unresponsive to all queries during this.

What you expected to happen Goroutine number to stay around the same number i.e. have a low standard deviation, and Thanos Store to be responsive.

How to reproduce it (as minimally and precisely as possible): To be done. since this is hard to reproduce and depends on the data.

Full logs to relevant components Thanos Query started timing out:

level=error ts=2018-11-29T13:20:41.368751553Z caller=proxy.go:117 err="fetch series for [{monitor prtoss} {replica a}]: rpc error: code = Canceled desc = context canceled"
level=error ts=2018-11-29T13:20:41.368755808Z caller=proxy.go:117 err="fetch series for [{monitor prtoss} {replica b}]: rpc error: code = Canceled desc = context canceled"
level=error ts=2018-11-29T13:20:41.368808725Z caller=proxy.go:117 err="fetch series for []: rpc error: code = Canceled desc = context canceled"
level=warn ts=2018-11-29T13:20:41.368849077Z caller=proxy.go:126 err="No store matched for this query"
level=error ts=2018-11-29T13:20:41.368848275Z caller=proxy.go:117 err="fetch series for []: rpc error: code = Canceled desc = context canceled"
level=warn ts=2018-11-29T13:20:41.368909531Z caller=proxy.go:126 err="No store matched for this query"
level=error ts=2018-11-29T13:20:41.369833973Z caller=engine.go:499 msg="error expanding series set" err="context canceled"

No more series query processed messages in Thanos Store whatsoever, only Blocks source resolution. Only this one got through somehow in a period of about an hour:

Lap 29 14:07:42 foobar docker[21891]: level=debug ts=2018-11-29T13:07:42.784986336Z caller=bucket.go:802 msg="series query processed" stats="&{blocksQueried:0 postingsTouched:0 postingsTouchedSizeSum:0 postingsFetched:0 postingsFetchedSizeSum:0 postingsFetchCount:0 postingsFetchDurationSum:0 seriesTouched:0 seriesTouchedSizeSum:0 seriesFetched:0 seriesFetchedSizeSum:0 seriesFetchCount:0 seriesFetchDurationSum:0 chunksTouched:0 chunksTouchedSizeSum:0 chunksFetched:0 chunksFetchedSizeSum:0 chunksFetchCount:0 chunksFetchDurationSum:0 getAllDuration:877 mergedSeriesCount:0 mergedChunksCount:0 mergeDuration:1273}"

When something happened but exactly zero data was pulled down so maybe this is a separate issue?

The goroutine dump of Thanos Store at the moment of this freeze: goroutine_dmp.zip

Notice that exactly one goroutine is stuck in ensureFits and it locks the lock, thus preventing other goroutines from performing their work.

Workaround: increased the index cache significantly and the problem went away. Perhaps this has to do with something that some index cache files are really large:

$ du -t 250M -hs /var/lib/thanos/*/index.cache.json
279M    /var/lib/thanos/01CXD021G0JHNRQWJKVKCHD3QR/index.cache.json
1,6G    /var/lib/thanos/01CXF4WQE8H27NZK0W4VEQB3XP/index.cache.json
1,5G    /var/lib/thanos/01CXFJT3R8J59MD4YC9AC0TPE5/index.cache.json
...

Anything else we need to know Environment:

  • OS: CentOS Linux 7 (Core)
  • Kernel: 3.10.0-862.11.6.el7.x86_64
  • Docker: 18.03.1-ce, build 9ee9f40
  • Memory: 256 GB

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 1
  • Comments: 16 (16 by maintainers)

Commits related to this issue

Most upvoted comments

Weird, I’m on master and I’ve added some more tests: https://github.com/GiedriusS/thanos/commit/42449a820cd77aae54f65faa0e86e6d8744f0ac8, and they pass without an issue. Could you think of another edge case that we are missing? 😞