prometheus: Prometheus 2.0.0-beta5 doesn't recover nicely when running out of disk space

What did you do?

I run a Prometheus server that briefly ran out of disk space earlier today. A colleague of mine made the volume larger.

What did you expect to see?

As soon as disk space comes available, Prometheus should continue its business.

What did you see instead? Under which circumstances?

Prometheus was unable to scrape any targets from then on. The targets page showed “WAL log samples: log series: write /prometheus/wal/000024: file already closed” next to every target in the table.

I tried to do a restart of Prometheus, but what happened then was that Prometheus no longer wanted to start, terminating almost immediately with the message below:

Oct 12 11:48:20 ... docker[2709]: level=error ts=2017-10-12T09:48:20.103205196Z caller=main.go:317 msg="Opening storage failed" err="validate meta \"/prometheus/wal/000025\": EOF"

/prometheus/wal/000025 was a zero-byte file. After doing an rm /prometheus/wal/000025, Prometheus continued as usual.

In short, there may be two issues here:

Prometheus cannot recover after disk space becomes available again.
Prometheus doesn’t like empty files in the wal/ directory.

Environment

System information:

Linux 3.16.0-4-amd64 x86_64

Prometheus version:

2.0.0-beta5

About this issue

Original URL
State: closed
Created 7 years ago
Reactions: 4
Comments: 21 (11 by maintainers)

Most upvoted comments

Saw this as well today

version="(version=2.0.0, branch=HEAD, revision=0a74f98628a0463dddc90528220c94de5032d1a0)"
build_context="(go=go1.9.2, user=root@615b82cb36b6, date=20171108-07:11:59)"
host_details="(Linux 4.4.0-109-generic #132-Ubuntu SMP Tue Jan 9 19:52:39 UTC 2018 x86_64 hostname (none))"

Data dir filled up pretty fast with .tmp dirs at a rate of 200G per hour.

Mattias- on Jan 12, 2018