thanos: store (s3, gcs): invalid memory address or nil pointer dereference
Since I migrated the data from a 3 node prometheus cluster to S3 buckets with the thanos sidecar and I run the thanos store node agains that data, a strange issue occurred:
panic: runtime error: invalid memory address or nil pointer dereference
after a query for a long period of time.
Since initially Prometheus was using the default min/max-block-duration options, the data I migrated was compressed (to level 5 mostly) so in order to migrate it I manually changed the meta.json files to:
"compaction": {
"level": 1,
This migrated the data successfully, but may be in part of the issue.
When I executed this query:
sum(node_memory_MemTotal{job="node-exporter", failure_domain_beta_kubernetes_io_zone="eu-west-1a", kops_k8s_io_instancegroup=~".*"}) - sum(node_memory_MemFree{job="node-exporter", failure_domain_beta_kubernetes_io_zone="eu-west-1a", kops_k8s_io_instancegroup=~".*"}) - sum(node_memory_Buffers{job="node-exporter", failure_domain_beta_kubernetes_io_zone="eu-west-1a", kops_k8s_io_instancegroup=~".*"}) - sum(node_memory_Cached{job="node-exporter", failure_domain_beta_kubernetes_io_zone="eu-west-1a", kops_k8s_io_instancegroup=~".*"})
For a period of 8w, I got this error and the thanos store pod got restarted:
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xb6e912]
goroutine 921 [running]:
github.com/improbable-eng/thanos/pkg/store.(*lazyPostings).Next(0xc42e90e040, 0xc45b2f00c0)
autogenerated:1 +0x32
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Next(0xc47d64c900, 0xc4816f2d70)
/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:312 +0x33github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Next(0xc47d64c960, 0xc45b2f00c0)\u0009/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:312 +0x33\ngithub.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.ExpandPostings(0xf137a0, 0xc47d64c960, 0x0, 0x4, 0x4, 0xf137a0, 0xc47d64c960)\u0009/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:221 +0x57
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).blockSeries(0xc42002ed90, 0xf12ce0, 0xc425b6a300, 0x9cb7c84a295c6301, 0xe320b7d6feb78d94, 0xc4203c9440, 0xc44b3faa10, 0xc42008c5a0, 0xc425b6a280, 0x4, ...)
/go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:482 +0x152
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).Series.func1(0x0, 0x432e88)
/go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:667 +0xe3
github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc42008d200, 0xc44b3faa80, 0xc4237c6850)
/go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
/go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xa8
After second execution, however, the same query returned a satisfying result without any other panic errors.
I managed to find this pattern (however strange it may seem).
When I start to make a graph from a small period of time and use the “+” (Grow the time range) to rapidly extend the time period, I get nil pointer
for even a small periods like 1d.
If I run the query directly for a large period of time (12w or more), I receive a prompt reply without an error.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 61 (39 by maintainers)
Commits related to this issue
- Added additional check for lazy Postings. Debugging #335. Potentially broken preloadPosting is missing some postings, because of bad partitioning. Unit tests are there, but we might not get all edge... — committed to thanos-io/thanos by bwplotka 6 years ago
- Added additional check for lazy Postings. Debugging #335. Potentially broken preloadPosting is missing some postings, because of bad partitioning. Unit tests are there, but we might not get all edge... — committed to thanos-io/thanos by bwplotka 6 years ago
- Added additional check for lazy Postings. Debugging #335. Potentially broken preloadPosting is missing some postings, because of bad partitioning. Unit tests are there, but we might not get all edge... — committed to thanos-io/thanos by bwplotka 6 years ago
- Added additional check for lazy Postings. Debugging #335. Potentially broken preloadPosting is missing some postings, because of bad partitioning. Unit tests are there, but we might not get all edge... — committed to thanos-io/thanos by bwplotka 6 years ago
- Added additional check for lazy Postings. (#627) Debugging #335. Potentially broken preloadPosting is missing some postings, because of bad partitioning. Unit tests are there, but we might not ge... — committed to thanos-io/thanos by bwplotka 6 years ago
It happens here every hour or two but no one wants to upload a core dump that is around 200GB (at least in our case since we have a bucket which is >1TB). And I haven’t had enough time to dig into it and research the warts of the Go’s memory handling code myself. FWIW,
sysctl vm.overcommit_memory=2
which disables overcommit immediately leads to Thanos Store getting killed by the OOM killer here since there is no memory left available. The default value is 0 and the Linux kernel uses a heuristic to check if there is enough memory available to fulfill any memory allocation request. And because the nil pointers come in places where they shouldn’t it leads me to think that the allocations fail because the Kernel had decided that there isn’t enough RAM. Also, if you haven’t noticed but Go’s specification, unfortunately, doesn’t specify anything exact about what happens whenmake()
orappend()
fails when it tries to create a new object underneath in the heap… so who knows? 😦 Plus, in the Thanos code we are only checking for errors which are represented by theError
interface. Lastly, https://github.com/golang/go/issues/16843 this has been open for almost 3 years by now and funnily enough even some Prometheus developers weighed in onto this issue. Which just leads me to think that the language itself doesn’t provide a way to control this or check this. And this is why https://github.com/improbable-eng/thanos/pull/798 was born since in our case random users sometimes were sending queries which overload the serversSame issue found here using latest version. It seems to occur if I run a “big” query against thanos-store right after restarting it:
Yes, we checked Golang memory management and the behaviour you mentioned should not happen (nil on lack of mem).
We recently found race condition on this: https://github.com/improbable-eng/thanos/blob/master/pkg/store/bucket.go#L1627 Now it’s inside lock, but it was before lock and we fixed that recently:
here: https://github.com/improbable-eng/thanos/commit/1b6f6dae946fb023710dbbd9e154630aadf623b2#diff-a75f50a9f5bf5b21a862e4e7c6bd1576
Can we check master if this is still reproducible? Highly plausible that we fixed this.
@bwplotka can those changes be merged into master so I can build against the latest version, it seems to have fixed the issue for now. We can reopen this if it occurs again.
revision:
ca11572099cb4f64c916c13c7b411b97cccff24a
is exactly thev0.2.0-rc.0-lazy-postings
https://github.com/improbable-eng/thanos/pull/627/commits/ca11572099cb4f64c916c13c7b411b97cccff24a
@davidhiendl you might want to update to the latest master. There were fixes pushed to it since RC2 was released.
Hello,
I’ve just hit the same issue (I think), my env is:
This seems to always be preceeded my messages complaining about missing blocks:
Ok, will try to fix this finding for now.