prometheus: [Intermittent] compaction failed after upgrade from 2.2.1 to 2.3.0
Bug Report
What did you do? Upgrade prom from 2.2.1 to 2.3.0
What did you expect to see? Prom restart without any issue
What did you see instead? Under which circumstances? Jun 19 21:34:16 <host> prometheus[29758]: level=error ts=2018-06-19T21:34:16.143435625Z caller=db.go:277 component=tsdb msg=“compaction failed”…
Environment
000001
- no idea why it was created. Also check the timestamp, something is not right.
-rw-r--r--. 1 root root 800655 Jun 19 19:00 000001
-rw-r--r--. 1 root root 268408703 Jun 19 18:34 000027
-rw-r--r--. 1 root root 13269564 Jun 19 20:24 000028
-rw-r--r--. 1 root root 268435456 Jun 19 21:23 000029
To work around the issue
systemctl stop prometheus
rm ./prometheus/wal/000001
systemctl start prometheus
After removed wal/000001 and restarted, everything was back to normal
prometheus[4606]: level=info ts=2018-06-19T21:36:47.725951171Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=241.063549ms
After restart, we lost 1.5 hours of metrics
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 36 (20 by maintainers)
Just had another look and the the web handler is stopped before the database is closed which in busy env this might take a while.so in theory if systemd sends a kill signal to the old instance , doesn’t wait for it to complete the shutdown the web listener will free port9090
and the new Prom instance will be able to start before the old one is completely shutdown.Just tried it locally and I was wrong about releasing the port. Although the web handler is stopped before the database the port is occupied until the Prometheus binary has exited completely so it seems the bu is caused by something else.