prometheus: [Intermittent] compaction failed after upgrade from 2.2.1 to 2.3.0

Bug Report

What did you do? Upgrade prom from 2.2.1 to 2.3.0

What did you expect to see? Prom restart without any issue

What did you see instead? Under which circumstances? Jun 19 21:34:16 <host> prometheus[29758]: level=error ts=2018-06-19T21:34:16.143435625Z caller=db.go:277 component=tsdb msg=“compaction failed”…

Environment 000001 - no idea why it was created. Also check the timestamp, something is not right.

-rw-r--r--. 1 root root    800655 Jun 19 19:00 000001
-rw-r--r--. 1 root root 268408703 Jun 19 18:34 000027
-rw-r--r--. 1 root root  13269564 Jun 19 20:24 000028
-rw-r--r--. 1 root root 268435456 Jun 19 21:23 000029

To work around the issue

systemctl stop prometheus
rm ./prometheus/wal/000001
systemctl start prometheus

After removed wal/000001 and restarted, everything was back to normal

prometheus[4606]: level=info ts=2018-06-19T21:36:47.725951171Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=241.063549ms

After restart, we lost 1.5 hours of metrics

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 36 (20 by maintainers)

Most upvoted comments

~~Just had another look and the the web handler is stopped before the database is closed which in busy env this might take a while.~~

so in theory if systemd sends a kill signal to the old instance , doesn’t wait for it to complete the shutdown the web listener will free port 9090 and the new Prom instance will be able to start before the old one is completely shutdown.

Just tried it locally and I was wrong about releasing the port. Although the web handler is stopped before the database the port is occupied until the Prometheus binary has exited completely so it seems the bu is caused by something else.

krasi-georgiev on Jun 21, 2018