prometheus: 'compaction failed' - prometheus suddenly ate up entire disk
What did you do?
Four or five days ago, I upgraded to Prometheus V2, running in a 4-node docker swarm
What did you expect to see?
Prometheus metrics data to grow fairly slowly, at roughly the same rate as with v1.8 (~1gb/month)
What did you see instead? Under which circumstances?
In the past 24 hours, the size of my prometheus data suddenly and inexplicably increased more than 1500x, from ~500mb to 771gb, completely filling up my disk.
Environment
I’m not sure what caused this, as I haven’t modified any of prometheus’s configs since I got v2 up and running smoothly. I’m running prometheus in a docker container in swarm mode, so my best guess is that something got corrupted when its container was killed and subsequently restarted on another host. Prometheus’s data is being stored on an nfs share available to all hosts, which is then mounted to the container. When checking the data folder, the vast majority of folder in it are <randomhash>.tmp
folders - the only other files besides the tmp folders are two folders with hashes for names (but no .tmp
), along with a wal
folder and a lock
file.
-
System information:
Linux 4.13.0-1-amd64 x86_64
-
Prometheus version:
/prometheus $ prometheus --version
prometheus, version 2.0.0 (branch: HEAD, revision: 0a74f98628a0463dddc90528220c94de5032d1a0)
build user: root@615b82cb36b6
build date: 20171108-07:11:59
go version: go1.9.2
- Docker-compose configuration:
version: '3.3'
services:
prom:
image: prom/prometheus:v2.0.0
volumes:
- /docker/prometheus/config:/etc/prometheus
- /docker/prometheus/data:/prometheus
networks:
- monitoring
ports:
- 9090:9090
- Prometheus configuration file:
global:
scrape_interval: 30s
evaluation_interval: 30s
external_labels:
monitor: 'prometheus'
rule_files:
- "alert.rules_nodes.yml"
- "alert.rules_tasks.yml"
- "alert.rules_service-groups.yml"
alerting:
alertmanagers:
- dns_sd_configs:
- names:
- 'alerts'
type: 'A'
port: 9093
scrape_configs:
- job_name: 'prometheus'
dns_sd_configs:
- names:
- 'tasks.prom'
type: 'A'
port: 9090
- job_name: 'cadvisor'
dns_sd_configs:
- names:
- 'tasks.cadvisor'
type: 'A'
port: 8080
- job_name: 'node-exporter'
dns_sd_configs:
- names:
- 'tasks.node-exporter'
type: 'A'
port: 9100
- job_name: 'docker-exporter'
static_configs:
- targets:
- 'node1:4999'
- 'node2:4999'
- 'node3:4999'
- 'node4:4999'
- job_name: 'unifi-exporter'
dns_sd_configs:
- names:
- 'tasks.unifi-exporter'
type: 'A'
port: 9130
- Logs:
(all logs retrieved using
docker service logs monitor_prom
) Logs on prometheus startup
monitor_prom.1.3u83u9j3hljv@node3 | level=info ts=2017-11-16T23:01:46.384460037Z caller=main.go:215 msg="Starting Prometheus" version="(version=2.0.0, branch=HEAD, revision=0a74f98628a0463dddc90528220c94de5032d1a0)"
monitor_prom.1.3u83u9j3hljv@node3 | level=info ts=2017-11-16T23:01:46.384521845Z caller=main.go:216 build_context="(go=go1.9.2, user=root@615b82cb36b6, date=20171108-07:11:59)"
monitor_prom.1.3u83u9j3hljv@node3 | level=info ts=2017-11-16T23:01:46.384544054Z caller=main.go:217 host_details="(Linux 4.13.0-1-amd64 #1 SMP Debian 4.13.4-2 (2017-10-15) x86_64 3434f87590e0 (none))"
monitor_prom.1.3u83u9j3hljv@node3 | level=info ts=2017-11-16T23:01:46.389948893Z caller=web.go:380 component=web msg="Start listening for connections" address=0.0.0.0:9090
monitor_prom.1.3u83u9j3hljv@node3 | level=info ts=2017-11-16T23:01:46.390175381Z caller=main.go:314 msg="Starting TSDB"
monitor_prom.1.3u83u9j3hljv@node3 | level=info ts=2017-11-16T23:01:46.40905656Z caller=targetmanager.go:71 component="target manager" msg="Starting target manager..."
monitor_prom.1.3u83u9j3hljv@node3 | level=warn ts=2017-11-16T23:02:34.148096492Z caller=head.go:317 component=tsdb msg="unknown series references in WAL samples" count=21956
monitor_prom.1.3u83u9j3hljv@node3 | level=info ts=2017-11-16T23:02:34.181636971Z caller=main.go:326 msg="TSDB started"
monitor_prom.1.3u83u9j3hljv@node3 | level=info ts=2017-11-16T23:02:34.181753604Z caller=main.go:394 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
monitor_prom.1.3u83u9j3hljv@node3 | level=info ts=2017-11-16T23:02:34.260471148Z caller=main.go:371 msg="Server is ready to receive requests."
an example of the countless compaction failed
errors from before disk was filled
monitor_prom.1.3u83u9j3hljv@node3 | level=info ts=2017-11-16T23:27:04.730665346Z caller=compact.go:361 component=tsdb msg="compact blocks" count=1 mint=1510704000000 maxt=1510711200000
monitor_prom.1.3u83u9j3hljv@node3 | level=error ts=2017-11-16T23:27:05.400205406Z caller=db.go:260 component=tsdb msg="compaction failed" err="persist head block: write compaction: add series: out-of-order series added with label set \"{__name__=\\\"go_gc_duration_seconds\\\",instance=\\\"10.0.0.244:9090\\\",job=\\\"prometheus\\\",quantile=\\\"0\\\"}\""
monitor_prom.1.3u83u9j3hljv@node3 | level=info ts=2017-11-16T23:28:05.436936297Z caller=compact.go:361 component=tsdb msg="compact blocks" count=1 mint=1510704000000 maxt=1510711200000
monitor_prom.1.3u83u9j3hljv@node3 | level=error ts=2017-11-16T23:28:06.103396123Z caller=db.go:260 component=tsdb msg="compaction failed" err="persist head block: write compaction: add series: out-of-order series added with label set \"{__name__=\\\"go_gc_duration_seconds\\\",instance=\\\"10.0.0.244:9090\\\",job=\\\"prometheus\\\",quantile=\\\"0\\\"}\""
monitor_prom.1.3u83u9j3hljv@node3 | level=info ts=2017-11-16T23:29:06.135866736Z caller=compact.go:361 component=tsdb msg="compact blocks" count=1 mint=1510704000000 maxt=1510711200000
monitor_prom.1.3u83u9j3hljv@node3 | level=error ts=2017-11-16T23:29:06.827149013Z caller=db.go:260 component=tsdb msg="compaction failed" err="persist head block: write compaction: add series: out-of-order series added with label set \"{__name__=\\\"go_gc_duration_seconds\\\",instance=\\\"10.0.0.244:9090\\\",job=\\\"prometheus\\\",quantile=\\\"0\\\"}\""
compaction failed
errors from after disk was filled
monitor_prom.1.3u83u9j3hljv@node3 | level=info ts=2017-11-16T23:55:18.555189787Z caller=compact.go:361 component=tsdb msg="compact blocks" count=1 mint=1510704000000 maxt=1510711200000
monitor_prom.1.3u83u9j3hljv@node3 | level=error ts=2017-11-16T23:55:18.555336942Z caller=db.go:260 component=tsdb msg="compaction failed" err="persist head block: mkdir /prometheus/01BZ3M464VD8YNKY27ZX8HKX2V.tmp: no space left on device"
monitor_prom.1.3u83u9j3hljv@node3 | level=info ts=2017-11-16T23:56:18.567555044Z caller=compact.go:361 component=tsdb msg="compact blocks" count=1 mint=1510704000000 maxt=1510711200000
monitor_prom.1.3u83u9j3hljv@node3 | level=error ts=2017-11-16T23:56:18.567783423Z caller=db.go:260 component=tsdb msg="compaction failed" err="persist head block: mkdir /prometheus/01BZ3M60R7BGV0TDYT9G4A3TRK.tmp: no space left on device"
monitor_prom.1.3u83u9j3hljv@node3 | level=info ts=2017-11-16T23:57:18.580361477Z caller=compact.go:361 component=tsdb msg="compact blocks" count=1 mint=1510704000000 maxt=1510711200000
monitor_prom.1.3u83u9j3hljv@node3 | level=error ts=2017-11-16T23:57:18.580538384Z caller=db.go:260 component=tsdb msg="compaction failed" err="persist head block: mkdir /prometheus/01BZ3M7VBM62PP3B0QAEBADCHK.tmp: no space left on device"
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 5
- Comments: 25 (10 by maintainers)
Commits related to this issue
- Don't retry failed compactions. Fixes prometheus/prometheus#3487 Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in> — committed to gouthamve/tsdb by gouthamve 7 years ago
- Don't retry failed compactions. Fixes prometheus/prometheus#3487 Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in> — committed to gouthamve/tsdb by gouthamve 7 years ago
- Don't retry failed compactions. Fixes prometheus/prometheus#3487 Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in> — committed to gouthamve/tsdb by gouthamve 7 years ago
- Fdatasync on read to flush any unflushed data. This is to handle partial writes from a previous crash. Fixes prometheus/prometheus#3487 Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac... — committed to gouthamve/tsdb by gouthamve 7 years ago
- Fdatasync on read to flush any unflushed data. This is to handle partial writes from a previous crash. Fixes prometheus/prometheus#3487 Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac... — committed to gouthamve/tsdb by gouthamve 7 years ago
@zegl I think 2.1 is coming in the next 1-2 weeks.
This happened to me today, data directory went from ~3GB to ~300GB.
These are the first lines related to compaction
I see that log 586 times, and there were 588
block.tmp
directories.Thanks for reporting. I plan to upgrade Prometheus from 1.8.2 to 2.0, but this issue is so critical, I have to suspend it. Waiting to be solved.