thanos: Compactor/S3: sometimes segments are not being fully downloaded ("segment doesn't include enough bytes")

Thanos, Prometheus and Golang version used:

Thanos 0.13 rc2, currently upgrading to 0.13 but commits do not indicate any relevant change since the tag was crafted

Object Storage Provider:

Internal Cloudian S3 service

What happened:

Compactor halts with following error message:

level=error ts=2020-06-25T07:57:55.183632304Z caller=compact.go:375 msg="critical error detected; halting" err="compaction: group 0@9955244503410235132: compact blocks [/var/thanos/compact/data/compact/0@9955244503410235132/01EAQGV34VQ8PWVKXJ4V2K8M5B /var/thanos/compact/data/compact/0@9955244503410235132/01EAQQPTCT5ABJSXRBMG53V8ZC /var/thanos/compact/data/compact/0@9955244503410235132/01EAQYJHMZ9R0V8M5Y3GPG0NWD /var/thanos/compact/data/compact/0@9955244503410235132/01EAR5E8WTX7MJP7PG1DQMPY4W]: write compaction: iterate compaction set: chunk 41943009 not found: segment doesn't include enough bytes to read the chunk - required:41943044, available:41943040"

After restarting compactor, compacting continues successfully (usually hours later, we only get warned if compactor does not run for quite some time – so I cannot tell whether this is a pretty short inconsistency window and lots of bad luck involved).

What you expected to happen:

If this is really a transient issue based on eventual consistency issues, retry and don’t halt.

How to reproduce it (as minimally and precisely as possible):

Slightly overload your S3 service and wait. We have a fail rate of estimated once a week for a single compactor. Mostly happens in installations with some network distance (ping >= 200ms)

Full logs to relevant components:

Logs

[...]
level=info ts=2020-06-25T07:42:56.091077077Z caller=block.go:157 group="0@{monitor=\"prometheus\", replica=\"10\"}" groupKey=0@7398988455995530398 msg="block has been marked for deletion" block=01EAQYJKPJS17YPE28VZ48JF7Q
level=info ts=2020-06-25T07:42:56.11841702Z caller=compact.go:792 group="0@{monitor=\"prometheus\", replica=\"10\"}" groupKey=0@7398988455995530398 msg="marking compacted block for deletion" old_block=01EAR5EAY80Z79JCDWQBAVVGJ4
level=info ts=2020-06-25T07:42:56.358365126Z caller=block.go:157 group="0@{monitor=\"prometheus\", replica=\"10\"}" groupKey=0@7398988455995530398 msg="block has been marked for deletion" block=01EAR5EAY80Z79JCDWQBAVVGJ4
level=info ts=2020-06-25T07:42:57.695050512Z caller=compact.go:633 group="0@{monitor=\"prometheus-ha\", replica=\"0\"}" groupKey=0@9955244503410235132 msg="compaction available and planned; downloading blocks" plan="[/var/thanos/compact/data/compact/0@9955244503410235132/01EAQGV34VQ8PWVKXJ4V2K8M5B /var/thanos/compact/data/compact/0@9955244503410235132/01EAQQPTCT5ABJSXRBMG53V8ZC /var/thanos/compact/data/compact/0@9955244503410235132/01EAQYJHMZ9R0V8M5Y3GPG0NWD /var/thanos/compact/data/compact/0@9955244503410235132/01EAR5E8WTX7MJP7PG1DQMPY4W]"
level=info ts=2020-06-25T07:55:17.496533315Z caller=compact.go:692 group="0@{monitor=\"prometheus-ha\", replica=\"0\"}" groupKey=0@9955244503410235132 msg="downloaded and verified blocks; compacting blocks" plan="[/var/thanos/compact/data/compact/0@9955244503410235132/01EAQGV34VQ8PWVKXJ4V2K8M5B /var/thanos/compact/data/compact/0@9955244503410235132/01EAQQPTCT5ABJSXRBMG53V8ZC /var/thanos/compact/data/compact/0@9955244503410235132/01EAQYJHMZ9R0V8M5Y3GPG0NWD /var/thanos/compact/data/compact/0@9955244503410235132/01EAR5E8WTX7MJP7PG1DQMPY4W]" duration=12m19.801396575s
level=error ts=2020-06-25T07:57:55.183632304Z caller=compact.go:375 msg="critical error detected; halting" err="compaction: group 0@9955244503410235132: compact blocks [/var/thanos/compact/data/compact/0@9955244503410235132/01EAQGV34VQ8PWVKXJ4V2K8M5B /var/thanos/compact/data/compact/0@9955244503410235132/01EAQQPTCT5ABJSXRBMG53V8ZC /var/thanos/compact/data/compact/0@9955244503410235132/01EAQYJHMZ9R0V8M5Y3GPG0NWD /var/thanos/compact/data/compact/0@9955244503410235132/01EAR5E8WTX7MJP7PG1DQMPY4W]: write compaction: iterate compaction set: chunk 41943009 not found: segment doesn't include enough bytes to read the chunk - required:41943044, available:41943040"

Anything else we need to know:

Pretty sure related to #2128, this and prometheus/prometheus#6040 (and our observations) make me think this might be a transient issue. Hard to tell since we cannot really reproduce, is #2637 maybe fixing this already?

About this issue

Original URL
State: open
Created 4 years ago
Reactions: 3
Comments: 16 (10 by maintainers)

Most upvoted comments

My best theory, for now, is that the transfer gets cut off abruptly, and minio-go eats that error up as io.EOF

From a quick look at Minio internals (very quick), I suspect this the case as well.

pracucci on Feb 12, 2021