thanos: thanos compactor crashes with "write compaction: chunk 8 not found: reference sequence 0 out of range"

Thanos, Prometheus and Golang version used

thanos, version 0.5.0 (branch: HEAD, revision: 72820b3f41794140403fd04d6da82299f2c16447) build user: circleci@eeac5eb36061 build date: 20190606-10:53:12 go version: go1.12.5 What happened

thanos compactor crashes with "“write compaction: chunk 8 not found: reference sequence 0 out of range”

What you expected to happen

Should work fine 😃

How to reproduce it (as minimally and precisely as possible):

Not sure 😕

Full logs to relevant components

Out of the list of objects dumped along with error message I’ve found one without chunks

$  gsutil ls -r gs://REDACTED/01DBZNNTM2557YW8T35RBM676P
gs://REDACTED/01DBZNNTM2557YW8T35RBM676P/:
gs://REDACTED/01DBZNNTM2557YW8T35RBM676P/index
gs://REDACTED/01DBZNNTM2557YW8T35RBM676P/meta.json

meta.json contents:

{
	"version": 1,
	"ulid": "01DBZNNTM2557YW8T35RBM676P",
	"minTime": 1557993600000,
	"maxTime": 1558022400000,
	"stats": {
		"numSamples": 35591778,
		"numSeries": 39167,
		"numChunks": 299446
	},
	"compaction": {
		"level": 2,
		"sources": [
			"01DB04S079GCEBKMTWZBH8HQA3",
			"01DB0BMQG3W7M12M8DE3V9QW5C",
			"01DB0JGEQD5RCZ50JAS2NENHQ6",
			"01DB0SC6071QW08JWQVG000AKF"
		],
		"parents": [
			{
				"ulid": "01DB04S079GCEBKMTWZBH8HQA3",
				"minTime": 1557993600000,
				"maxTime": 1558000800000
			},
			{
				"ulid": "01DB0BMQG3W7M12M8DE3V9QW5C",
				"minTime": 1558000800000,
				"maxTime": 1558008000000
			},
			{
				"ulid": "01DB0JGEQD5RCZ50JAS2NENHQ6",
				"minTime": 1558008000000,
				"maxTime": 1558015200000
			},
			{
				"ulid": "01DB0SC6071QW08JWQVG000AKF",
				"minTime": 1558015200000,
				"maxTime": 1558022400000
			}
		]
	},
	"thanos": {
		"labels": {
			"environment": "devint",
			"instance_number": "1",
			"location": "REDACTED",
			"prometheus": "monitoring/prometheus-operator-prometheus",
			"prometheus_replica": "prometheus-prometheus-operator-prometheus-1",
			"stack": "data"
		},
		"downsample": {
			"resolution": 0
		},
		"source": "compactor"
	}
}

It is an object created by compactor apparently.

We’ve been running compactor for some time, but after a while (due to lack of local disk storage) it was crashing constantly. After an extended period of crashing storage has been added and compactor was able to go further until it encountered problem described here.

I guess important part is how such object ended up in bucket, although I wonder if it is possible for thanos to ignore such objects and keep processing data rest of data? (exposing data about bad objects in metrics)

I’d guess that it was somehow created during constant crashes we’ve had earlier, but have nothing to support that.

Anything else we need to know

#688 describes similar issue, although it is about much older thanos than we use here. We’ve been running 0.5 and 0.4 before that. I’m not sure, but it is possible that 0.3.2 (compactor) was used at the beginning.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 21 (11 by maintainers)

Commits related to this issue

Most upvoted comments

I’m seeing the same error for blocks uploaded with thanos-compactor 0.6.0 (and then processed by 0.6.0) myself. Backend storage is Ceph cluster via Swift API.

thanos, version 0.6.0 (branch: HEAD, revision: c70b80eb83e52f5013ed5ffab72464989c132883)
  build user:       root@fd15c8276e5c
  build date:       20190718-11:11:58
  go version:       go1.12.5

thanos-compactor has uploaded a compacted block yesterday:

Jul 29 14:03:47 thanos01 thanos[32452]: level=info ts=2019-07-29T14:03:47.456113025Z caller=compact.go:444 msg="compact blocks" count=4 mint=1561622400000 maxt=1561651200000 ulid=01DGZ0PDS2MFX25P9CKG1C1TA3 sources="[01DEC9F6B0BVQHFG7BZDS75N0V 01DECGAXK0CM9HC2QPZ7MMH8M0 01DECQ6MV0K0FR00HHY9906178 01DECY2C2ZX4F6N9GKHKM2PC61]" duration=9.524377767s
Jul 29 14:03:47 thanos01 thanos[32452]: level=warn ts=2019-07-29T14:03:47.752332697Z caller=runutil.go:108 compactionGroup="0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}" msg="detected close error" err="close file /var/lib/thanos/compactor/compact/0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}/01DGZ0PDS2MFX25P9CKG1C1TA3/meta.json: close /var/lib/thanos/compactor/compact/0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}/01DGZ0PDS2MFX25P9CKG1C1TA3/meta.json: file already closed"
Jul 29 14:03:55 thanos01 thanos[32452]: level=warn ts=2019-07-29T14:03:55.115560765Z caller=runutil.go:108 compactionGroup="0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}" msg="detected close error" err="close file /var/lib/thanos/compactor/compact/0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}/01DGZ0PDS2MFX25P9CKG1C1TA3/chunks/000001: close /var/lib/thanos/compactor/compact/0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}/01DGZ0PDS2MFX25P9CKG1C1TA3/chunks/000001: file already closed"
Jul 29 14:03:56 thanos01 thanos[32452]: level=warn ts=2019-07-29T14:03:56.472928348Z caller=runutil.go:108 compactionGroup="0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}" msg="detected close error" err="close file /var/lib/thanos/compactor/compact/0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}/01DGZ0PDS2MFX25P9CKG1C1TA3/index: close /var/lib/thanos/compactor/compact/0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}/01DGZ0PDS2MFX25P9CKG1C1TA3/index: file already closed"
Jul 29 14:03:56 thanos01 thanos[32452]: level=warn ts=2019-07-29T14:03:56.571187127Z caller=runutil.go:108 compactionGroup="0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}" msg="detected close error" err="close file /var/lib/thanos/compactor/compact/0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}/01DGZ0PDS2MFX25P9CKG1C1TA3/index.cache.json: close /var/lib/thanos/compactor/compact/0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}/01DGZ0PDS2MFX25P9CKG1C1TA3/index.cache.json: file already closed"
Jul 29 14:03:56 thanos01 thanos[32452]: level=warn ts=2019-07-29T14:03:56.72380147Z caller=runutil.go:108 compactionGroup="0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}" msg="detected close error" err="close file /var/lib/thanos/compactor/compact/0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}/01DGZ0PDS2MFX25P9CKG1C1TA3/meta.json: close /var/lib/thanos/compactor/compact/0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}/01DGZ0PDS2MFX25P9CKG1C1TA3/meta.json: file already closed"
Jul 29 14:03:56 thanos01 thanos[32452]: level=debug ts=2019-07-29T14:03:56.723861375Z caller=compact.go:906 compactionGroup="0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}" msg="uploaded block" result_block=01DGZ0PDS2MFX25P9CKG1C1TA3 duration=8.994729984s

and now it’s choking on that block:

Jul 30 11:25:38 thanos01 thanos[22465]: level=debug ts=2019-07-30T11:25:38.310116816Z caller=compact.go:252 msg="download meta" block=01DGZ0PDS2MFX25P9CKG1C1TA3
Jul 30 11:25:47 thanos01 thanos[22465]: level=debug ts=2019-07-30T11:25:47.039193509Z caller=compact.go:840 compactionGroup="0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}" msg="downloaded and verified blocks" blocks="[/var/lib/thanos/compactor/compact/0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}/01DEBE09AZXVCNWKYCWP43A7G4 /var/lib/thanos/compactor/compact/0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}/01DGZ0PDS2MFX25P9CKG1C1TA3]" duration=7.277184353s
Jul 30 11:25:47 thanos01 thanos[22465]: level=error ts=2019-07-30T11:25:47.214649027Z caller=main.go:199 msg="running command failed" err="error executing compaction: compaction failed: compaction failed for group 0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}: compact blocks [/var/lib/thanos/compactor/compact/0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}/01DEBE09AZXVCNWKYCWP43A7G4 /var/lib/thanos/compactor/compact/0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}/01DGZ0PDS2MFX25P9CKG1C1TA3]: write compaction: chunk 8 not found: reference sequence 0 out of range"

First, I’d expect it to survive broken blocks, but what’s more concerning is that the block has been uploaded successfully before (unless those warning messages are not there just for show, and there is indeed something wrong).

What’s uploaded:

$ swift list prometheus-storage | grep 01DGZ0PDS2MFX25P9CKG1C1TA3
01DGZ0PDS2MFX25P9CKG1C1TA3/chunks/000001
01DGZ0PDS2MFX25P9CKG1C1TA3/index
01DGZ0PDS2MFX25P9CKG1C1TA3/index.cache.json
01DGZ0PDS2MFX25P9CKG1C1TA3/meta.json
debug/metas/01DGZ0PDS2MFX25P9CKG1C1TA3.json

and meta.json:

{
	"ulid": "01DGZ0PDS2MFX25P9CKG1C1TA3",
	"minTime": 1561622400000,
	"maxTime": 1561651200000,
	"stats": {
		"numSamples": 271799367,
		"numSeries": 141915,
		"numChunks": 2265057
	},
	"compaction": {
		"level": 2,
		"sources": [
			"01DEC9F6B0BVQHFG7BZDS75N0V",
			"01DECGAXK0CM9HC2QPZ7MMH8M0",
			"01DECQ6MV0K0FR00HHY9906178",
			"01DECY2C2ZX4F6N9GKHKM2PC61"
		],
		"parents": [
			{
				"ulid": "01DEC9F6B0BVQHFG7BZDS75N0V",
				"minTime": 1561622400000,
				"maxTime": 1561629600000
			},
			{
				"ulid": "01DECGAXK0CM9HC2QPZ7MMH8M0",
				"minTime": 1561629600000,
				"maxTime": 1561636800000
			},
			{
				"ulid": "01DECQ6MV0K0FR00HHY9906178",
				"minTime": 1561636800000,
				"maxTime": 1561644000000
			},
			{
				"ulid": "01DECY2C2ZX4F6N9GKHKM2PC61",
				"minTime": 1561644000000,
				"maxTime": 1561651200000
			}
		]
	},
	"version": 1,
	"thanos": {
		"labels": {
			"datacenter": "p24",
			"environment": "internal",
			"replica": "02"
		},
		"downsample": {
			"resolution": 0
		},
		"source": "compactor"
	}
}

Hi, I am having this issue using thanos 0.8.1. I have tried moving the directories for blocks it complains about out of the bucket but then every time I run the compactor it just finds some more to be sad about 😦

This is crashing the compactor with 0.8.1 even with --wait flag.

Any help to further debug this would be appreciated! (@bwplotka maybe?)

You should delete the blocks which have duplicated data and only leave one copy. It’s up to you to decide which one it is (: It sounds like you need to delete the one you’ve mentioned but please double check.