prometheus: Prometheus 2.0.0-beta5 doesn't recover nicely when running out of disk space
What did you do?
I run a Prometheus server that briefly ran out of disk space earlier today. A colleague of mine made the volume larger.
What did you expect to see?
As soon as disk space comes available, Prometheus should continue its business.
What did you see instead? Under which circumstances?
Prometheus was unable to scrape any targets from then on. The targets page showed “WAL log samples: log series: write /prometheus/wal/000024: file already closed” next to every target in the table.
I tried to do a restart of Prometheus, but what happened then was that Prometheus no longer wanted to start, terminating almost immediately with the message below:
Oct 12 11:48:20 ... docker[2709]: level=error ts=2017-10-12T09:48:20.103205196Z caller=main.go:317 msg="Opening storage failed" err="validate meta \"/prometheus/wal/000025\": EOF"
/prometheus/wal/000025
was a zero-byte file. After doing an rm /prometheus/wal/000025
, Prometheus continued as usual.
In short, there may be two issues here:
- Prometheus cannot recover after disk space becomes available again.
- Prometheus doesn’t like empty files in the
wal/
directory.
Environment
- System information:
Linux 3.16.0-4-amd64 x86_64
- Prometheus version:
2.0.0-beta5
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 4
- Comments: 21 (11 by maintainers)
Saw this as well today
Data dir filled up pretty fast with
.tmp
dirs at a rate of 200G per hour.