thanos: store (s3, gcs): invalid memory address or nil pointer dereference

Since I migrated the data from a 3 node prometheus cluster to S3 buckets with the thanos sidecar and I run the thanos store node agains that data, a strange issue occurred: panic: runtime error: invalid memory address or nil pointer dereference after a query for a long period of time.

Since initially Prometheus was using the default min/max-block-duration options, the data I migrated was compressed (to level 5 mostly) so in order to migrate it I manually changed the meta.json files to:

	"compaction": {
		"level": 1,

This migrated the data successfully, but may be in part of the issue.

When I executed this query:

sum(node_memory_MemTotal{job="node-exporter", failure_domain_beta_kubernetes_io_zone="eu-west-1a", kops_k8s_io_instancegroup=~".*"}) - sum(node_memory_MemFree{job="node-exporter", failure_domain_beta_kubernetes_io_zone="eu-west-1a", kops_k8s_io_instancegroup=~".*"}) - sum(node_memory_Buffers{job="node-exporter", failure_domain_beta_kubernetes_io_zone="eu-west-1a", kops_k8s_io_instancegroup=~".*"}) - sum(node_memory_Cached{job="node-exporter", failure_domain_beta_kubernetes_io_zone="eu-west-1a", kops_k8s_io_instancegroup=~".*"})

For a period of 8w, I got this error and the thanos store pod got restarted:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xb6e912]

goroutine 921 [running]:
github.com/improbable-eng/thanos/pkg/store.(*lazyPostings).Next(0xc42e90e040, 0xc45b2f00c0)
autogenerated:1 +0x32
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Next(0xc47d64c900, 0xc4816f2d70)
/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:312 +0x33github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Next(0xc47d64c960, 0xc45b2f00c0)\u0009/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:312 +0x33\ngithub.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.ExpandPostings(0xf137a0, 0xc47d64c960, 0x0, 0x4, 0x4, 0xf137a0, 0xc47d64c960)\u0009/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:221 +0x57
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).blockSeries(0xc42002ed90, 0xf12ce0, 0xc425b6a300, 0x9cb7c84a295c6301, 0xe320b7d6feb78d94, 0xc4203c9440, 0xc44b3faa10, 0xc42008c5a0, 0xc425b6a280, 0x4, ...)
/go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:482 +0x152
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).Series.func1(0x0, 0x432e88)
/go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:667 +0xe3
github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc42008d200, 0xc44b3faa80, 0xc4237c6850)
/go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
/go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xa8

After second execution, however, the same query returned a satisfying result without any other panic errors.

I managed to find this pattern (however strange it may seem). When I start to make a graph from a small period of time and use the “+” (Grow the time range) to rapidly extend the time period, I get nil pointer for even a small periods like 1d. screen shot 2018-05-14 at 13 47 09

If I run the query directly for a large period of time (12w or more), I receive a prompt reply without an error.

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 61 (39 by maintainers)

Commits related to this issue

Added additional check for lazy Postings. Debugging #335. Potentially broken preloadPosting is missing some postings, because of bad partitioning. Unit tests are there, but we might not get all edge... — committed to thanos-io/thanos by bwplotka 6 years ago
Added additional check for lazy Postings. Debugging #335. Potentially broken preloadPosting is missing some postings, because of bad partitioning. Unit tests are there, but we might not get all edge... — committed to thanos-io/thanos by bwplotka 6 years ago
Added additional check for lazy Postings. Debugging #335. Potentially broken preloadPosting is missing some postings, because of bad partitioning. Unit tests are there, but we might not get all edge... — committed to thanos-io/thanos by bwplotka 6 years ago
Added additional check for lazy Postings. Debugging #335. Potentially broken preloadPosting is missing some postings, because of bad partitioning. Unit tests are there, but we might not get all edge... — committed to thanos-io/thanos by bwplotka 6 years ago
Added additional check for lazy Postings. (#627) Debugging #335. Potentially broken preloadPosting is missing some postings, because of bad partitioning. Unit tests are there, but we might not ge... — committed to thanos-io/thanos by bwplotka 6 years ago

Most upvoted comments

It happens here every hour or two but no one wants to upload a core dump that is around 200GB (at least in our case since we have a bucket which is >1TB). And I haven’t had enough time to dig into it and research the warts of the Go’s memory handling code myself. FWIW, sysctl vm.overcommit_memory=2 which disables overcommit immediately leads to Thanos Store getting killed by the OOM killer here since there is no memory left available. The default value is 0 and the Linux kernel uses a heuristic to check if there is enough memory available to fulfill any memory allocation request. And because the nil pointers come in places where they shouldn’t it leads me to think that the allocations fail because the Kernel had decided that there isn’t enough RAM. Also, if you haven’t noticed but Go’s specification, unfortunately, doesn’t specify anything exact about what happens when make() or append() fails when it tries to create a new object underneath in the heap… so who knows? 😦 Plus, in the Thanos code we are only checking for errors which are represented by the Error interface. Lastly, https://github.com/golang/go/issues/16843 this has been open for almost 3 years by now and funnily enough even some Prometheus developers weighed in onto this issue. Which just leads me to think that the language itself doesn’t provide a way to control this or check this. And this is why https://github.com/improbable-eng/thanos/pull/798 was born since in our case random users sometimes were sending queries which overload the servers

GiedriusS on Feb 6, 2019

Same issue found here using latest version. It seems to occur if I run a “big” query against thanos-store right after restarting it:

level=info ts=2018-11-06T16:30:03.796433603Z caller=flags.go:83 msg="StoreAPI address that will be propagated through gossip" address=172.17.0.2:10901
level=info ts=2018-11-06T16:30:03.806818153Z caller=factory.go:34 msg="loading bucket configuration"
level=info ts=2018-11-06T16:30:33.756304687Z caller=main.go:245 msg="disabled TLS, key and cert must be set to enable"
level=info ts=2018-11-06T16:30:33.756577907Z caller=store.go:191 msg="starting store node"
level=info ts=2018-11-06T16:30:33.756766907Z caller=main.go:297 msg="Listening for metrics" address=0.0.0.0:10902
level=info ts=2018-11-06T16:30:33.756836776Z caller=store.go:160 msg="Listening for StoreAPI gRPC" address=0.0.0.0:10901
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0xc71da2]

goroutine 4706 [running]:
github.com/improbable-eng/thanos/pkg/store.(*lazyPostings).Seek(0xc0094b2740, 0x1746, 0x5)
	<autogenerated>:1 +0x32
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*mergedPostings).Seek(0xc0006bf290, 0x1746, 0xc006ddb6c0)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:402 +0x59
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).doNext(0xc0006bf2c0, 0x1746, 0xc0094b2880)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:294 +0x47
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Next(0xc0006bf2c0, 0xc0007a82d0)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:315 +0x61
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Next(0xc0006bf320, 0xc0094b2880)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:312 +0x33
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.ExpandPostings(0x1094540, 0xc0006bf320, 0x0, 0x4, 0x5, 0x1094540, 0xc0006bf320)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:221 +0x56
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).blockSeries(0xc0002f4580, 0x1093ac0, 0xc007d2f4c0, 0x4a93cbde3cdf6601, 0x455c782baa981795, 0xc000976090, 0xc000114bd0, 0xc00090fd40, 0xc0009965f0, 0x4, ...)
	/go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:489 +0x152
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).Series.func1(0x4345e9, 0xfb6cc0)
	/go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:715 +0xe3
github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc00a180c60, 0xc000114c40, 0xc0007c3280)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
/go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xbe

kadern0 on Nov 8, 2018

Yes, we checked Golang memory management and the behaviour you mentioned should not happen (nil on lack of mem).

We recently found race condition on this: https://github.com/improbable-eng/thanos/blob/master/pkg/store/bucket.go#L1627 Now it’s inside lock, but it was before lock and we fixed that recently:

here: https://github.com/improbable-eng/thanos/commit/1b6f6dae946fb023710dbbd9e154630aadf623b2#diff-a75f50a9f5bf5b21a862e4e7c6bd1576

Can we check master if this is still reproducible? Highly plausible that we fixed this.

bwplotka on Mar 1, 2019

@bwplotka can those changes be merged into master so I can build against the latest version, it seems to have fixed the issue for now. We can reopen this if it occurs again.

jdfalk on Nov 29, 2018

@xjewer can you use v0.2.0-rc.0-lazy-postings to have some idea when this thing will happen?

revision: ca11572099cb4f64c916c13c7b411b97cccff24a is exactly the v0.2.0-rc.0-lazy-postings

https://github.com/improbable-eng/thanos/pull/627/commits/ca11572099cb4f64c916c13c7b411b97cccff24a

xjewer on Nov 12, 2018

@davidhiendl you might want to update to the latest master. There were fixes pushed to it since RC2 was released.

msuterski on Aug 27, 2018

Hello,

I’ve just hit the same issue (I think), my env is:

S3 (DO Spaces)
image: improbable/thanos:master-2018-06-15-5b66b72
This currently keeps reoccurring after each store node restart, node hangs for a while and Grafana attempts to query it, then it crashes. The short-term metrics (24h retention that Prometheus provides) seem unaffected.
If you need any further details to find the cause please advice and I’ll try to provide it.

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0xb65d02]
 goroutine 13638 [running]:
github.com/improbable-eng/thanos/pkg/store.(*lazyPostings).Seek(0xc443c9eb00, 0x5b159, 0xc427aef7c0)
	&lt;autogenerated&gt;:1 +0x32
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*mergedPostings).Seek(0xc44ed9a090, 0x5b159, 0x5b159)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:402 +0x59
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*mergedPostings).Seek(0xc44ed9a150, 0x5b159, 0x5b101)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:402 +0x59
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*removedPostings).Seek(0xc44ed9a180, 0x5b159, 0x1)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:481 +0x82
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Seek(0xc43b2e9c20, 0x5b159, 0x5b0fa)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:319 +0x3d
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).doNext(0xc43b2e9c50, 0x5b159, 0xc42b8bba10)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:294 +0x47
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Next(0xc43b2e9c50, 0xc4462edb00)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:315 +0x60
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.ExpandPostings(0xf13d60, 0xc43b2e9c50, 0x0, 0x4, 0x4, 0xf13d60, 0xc43b2e9c50)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:221 +0x57
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).blockSeries(0xc4200b4880, 0xf132a0, 0xc43315cc80, 0xe7c6c5ca9b3c6401, 0x7f9007d1d4d97ad9, 0xc4223baa20, 0xc433171260, 0xc43312d680, 0xc43315c7c0, 0x4, ...)
	/go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:489 +0x152
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).Series.func1(0x0, 0x0)
	/go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:713 +0xe3
github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc43312dec0, 0xc4331712d0, 0xc4250714f0)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xa8

This seems to always be preceeded my messages complaining about missing blocks:

level=warn ts=2018-06-28T10:37:05.317699323Z caller=bucket.go:240 msg="loading block failed" id=01CH0W4A7XW4FTQZYCMN59GYXE err="new bucket block: load meta: download meta.json: get file: The specified key does not exist."
level=warn ts=2018-06-28T10:37:06.251591622Z caller=bucket.go:240 msg="loading block failed" id=01CGW52BEK1X6Q0KS4KN7EAT4M err="new bucket block: load index cache: download index file: get file: The specified key does not exist."
level=warn ts=2018-06-28T10:37:06.283778119Z caller=bucket.go:240 msg="loading block failed" id=01CGXE8QTDMC1BYWBC35PVBJM7 err="new bucket block: load index cache: download index file: get file: The specified key does not exist."

davidhiendl on Jun 28, 2018

Ok, will try to fix this finding for now.

bwplotka on May 17, 2018