prometheus: 'compaction failed' - prometheus suddenly ate up entire disk

What did you do?

Four or five days ago, I upgraded to Prometheus V2, running in a 4-node docker swarm

What did you expect to see?

Prometheus metrics data to grow fairly slowly, at roughly the same rate as with v1.8 (~1gb/month)

What did you see instead? Under which circumstances?

In the past 24 hours, the size of my prometheus data suddenly and inexplicably increased more than 1500x, from ~500mb to 771gb, completely filling up my disk.

Environment

I’m not sure what caused this, as I haven’t modified any of prometheus’s configs since I got v2 up and running smoothly. I’m running prometheus in a docker container in swarm mode, so my best guess is that something got corrupted when its container was killed and subsequently restarted on another host. Prometheus’s data is being stored on an nfs share available to all hosts, which is then mounted to the container. When checking the data folder, the vast majority of folder in it are <randomhash>.tmp folders - the only other files besides the tmp folders are two folders with hashes for names (but no .tmp), along with a wal folder and a lock file.

System information:

Linux 4.13.0-1-amd64 x86_64
Prometheus version:

/prometheus $ prometheus --version
prometheus, version 2.0.0 (branch: HEAD, revision: 0a74f98628a0463dddc90528220c94de5032d1a0)
  build user:       root@615b82cb36b6
  build date:       20171108-07:11:59
  go version:       go1.9.2

Docker-compose configuration:

version: '3.3'
services:
  prom:
    image: prom/prometheus:v2.0.0
    volumes:
      - /docker/prometheus/config:/etc/prometheus
      - /docker/prometheus/data:/prometheus
    networks:
      - monitoring
    ports:
      - 9090:9090

Prometheus configuration file:

global:
  scrape_interval:     30s
  evaluation_interval: 30s

  external_labels:
    monitor: 'prometheus'

rule_files:
  - "alert.rules_nodes.yml"
  - "alert.rules_tasks.yml"
  - "alert.rules_service-groups.yml"

alerting:
  alertmanagers:
  - dns_sd_configs:
    - names:
      - 'alerts'
      type: 'A'
      port: 9093

scrape_configs:
- job_name: 'prometheus'
  dns_sd_configs:
  - names:
    - 'tasks.prom'
    type: 'A'
    port: 9090

- job_name: 'cadvisor'
  dns_sd_configs:
  - names:
    - 'tasks.cadvisor'
    type: 'A'
    port: 8080

- job_name: 'node-exporter'
  dns_sd_configs:
  - names:
    - 'tasks.node-exporter'
    type: 'A'
    port: 9100

- job_name: 'docker-exporter'
  static_configs:
  - targets:
    - 'node1:4999'
    - 'node2:4999'
    - 'node3:4999'
    - 'node4:4999'

- job_name: 'unifi-exporter'
  dns_sd_configs:
  - names:
    - 'tasks.unifi-exporter'
    type: 'A'
    port: 9130

Logs: (all logs retrieved using docker service logs monitor_prom) Logs on prometheus startup

monitor_prom.1.3u83u9j3hljv@node3    | level=info ts=2017-11-16T23:01:46.384460037Z caller=main.go:215 msg="Starting Prometheus" version="(version=2.0.0, branch=HEAD, revision=0a74f98628a0463dddc90528220c94de5032d1a0)"
monitor_prom.1.3u83u9j3hljv@node3    | level=info ts=2017-11-16T23:01:46.384521845Z caller=main.go:216 build_context="(go=go1.9.2, user=root@615b82cb36b6, date=20171108-07:11:59)"
monitor_prom.1.3u83u9j3hljv@node3    | level=info ts=2017-11-16T23:01:46.384544054Z caller=main.go:217 host_details="(Linux 4.13.0-1-amd64 #1 SMP Debian 4.13.4-2 (2017-10-15) x86_64 3434f87590e0 (none))"
monitor_prom.1.3u83u9j3hljv@node3    | level=info ts=2017-11-16T23:01:46.389948893Z caller=web.go:380 component=web msg="Start listening for connections" address=0.0.0.0:9090
monitor_prom.1.3u83u9j3hljv@node3    | level=info ts=2017-11-16T23:01:46.390175381Z caller=main.go:314 msg="Starting TSDB"
monitor_prom.1.3u83u9j3hljv@node3    | level=info ts=2017-11-16T23:01:46.40905656Z caller=targetmanager.go:71 component="target manager" msg="Starting target manager..."
monitor_prom.1.3u83u9j3hljv@node3    | level=warn ts=2017-11-16T23:02:34.148096492Z caller=head.go:317 component=tsdb msg="unknown series references in WAL samples" count=21956
monitor_prom.1.3u83u9j3hljv@node3    | level=info ts=2017-11-16T23:02:34.181636971Z caller=main.go:326 msg="TSDB started"
monitor_prom.1.3u83u9j3hljv@node3    | level=info ts=2017-11-16T23:02:34.181753604Z caller=main.go:394 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
monitor_prom.1.3u83u9j3hljv@node3    | level=info ts=2017-11-16T23:02:34.260471148Z caller=main.go:371 msg="Server is ready to receive requests."

an example of the countless compaction failed errors from before disk was filled

monitor_prom.1.3u83u9j3hljv@node3    | level=info ts=2017-11-16T23:27:04.730665346Z caller=compact.go:361 component=tsdb msg="compact blocks" count=1 mint=1510704000000 maxt=1510711200000
monitor_prom.1.3u83u9j3hljv@node3    | level=error ts=2017-11-16T23:27:05.400205406Z caller=db.go:260 component=tsdb msg="compaction failed" err="persist head block: write compaction: add series: out-of-order series added with label set \"{__name__=\\\"go_gc_duration_seconds\\\",instance=\\\"10.0.0.244:9090\\\",job=\\\"prometheus\\\",quantile=\\\"0\\\"}\""
monitor_prom.1.3u83u9j3hljv@node3    | level=info ts=2017-11-16T23:28:05.436936297Z caller=compact.go:361 component=tsdb msg="compact blocks" count=1 mint=1510704000000 maxt=1510711200000
monitor_prom.1.3u83u9j3hljv@node3    | level=error ts=2017-11-16T23:28:06.103396123Z caller=db.go:260 component=tsdb msg="compaction failed" err="persist head block: write compaction: add series: out-of-order series added with label set \"{__name__=\\\"go_gc_duration_seconds\\\",instance=\\\"10.0.0.244:9090\\\",job=\\\"prometheus\\\",quantile=\\\"0\\\"}\""
monitor_prom.1.3u83u9j3hljv@node3    | level=info ts=2017-11-16T23:29:06.135866736Z caller=compact.go:361 component=tsdb msg="compact blocks" count=1 mint=1510704000000 maxt=1510711200000
monitor_prom.1.3u83u9j3hljv@node3    | level=error ts=2017-11-16T23:29:06.827149013Z caller=db.go:260 component=tsdb msg="compaction failed" err="persist head block: write compaction: add series: out-of-order series added with label set \"{__name__=\\\"go_gc_duration_seconds\\\",instance=\\\"10.0.0.244:9090\\\",job=\\\"prometheus\\\",quantile=\\\"0\\\"}\""

compaction failed errors from after disk was filled

monitor_prom.1.3u83u9j3hljv@node3    | level=info ts=2017-11-16T23:55:18.555189787Z caller=compact.go:361 component=tsdb msg="compact blocks" count=1 mint=1510704000000 maxt=1510711200000
monitor_prom.1.3u83u9j3hljv@node3    | level=error ts=2017-11-16T23:55:18.555336942Z caller=db.go:260 component=tsdb msg="compaction failed" err="persist head block: mkdir /prometheus/01BZ3M464VD8YNKY27ZX8HKX2V.tmp: no space left on device"
monitor_prom.1.3u83u9j3hljv@node3    | level=info ts=2017-11-16T23:56:18.567555044Z caller=compact.go:361 component=tsdb msg="compact blocks" count=1 mint=1510704000000 maxt=1510711200000
monitor_prom.1.3u83u9j3hljv@node3    | level=error ts=2017-11-16T23:56:18.567783423Z caller=db.go:260 component=tsdb msg="compaction failed" err="persist head block: mkdir /prometheus/01BZ3M60R7BGV0TDYT9G4A3TRK.tmp: no space left on device"
monitor_prom.1.3u83u9j3hljv@node3    | level=info ts=2017-11-16T23:57:18.580361477Z caller=compact.go:361 component=tsdb msg="compact blocks" count=1 mint=1510704000000 maxt=1510711200000
monitor_prom.1.3u83u9j3hljv@node3    | level=error ts=2017-11-16T23:57:18.580538384Z caller=db.go:260 component=tsdb msg="compaction failed" err="persist head block: mkdir /prometheus/01BZ3M7VBM62PP3B0QAEBADCHK.tmp: no space left on device"

About this issue

Original URL
State: closed
Created 7 years ago
Reactions: 5
Comments: 25 (10 by maintainers)

Commits related to this issue

Don't retry failed compactions. Fixes prometheus/prometheus#3487 Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in> — committed to gouthamve/tsdb by gouthamve 7 years ago
Don't retry failed compactions. Fixes prometheus/prometheus#3487 Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in> — committed to gouthamve/tsdb by gouthamve 7 years ago
Don't retry failed compactions. Fixes prometheus/prometheus#3487 Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in> — committed to gouthamve/tsdb by gouthamve 7 years ago
Fdatasync on read to flush any unflushed data. This is to handle partial writes from a previous crash. Fixes prometheus/prometheus#3487 Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac... — committed to gouthamve/tsdb by gouthamve 7 years ago
Fdatasync on read to flush any unflushed data. This is to handle partial writes from a previous crash. Fixes prometheus/prometheus#3487 Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac... — committed to gouthamve/tsdb by gouthamve 7 years ago

Most upvoted comments

@zegl I think 2.1 is coming in the next 1-2 weeks.

krasi-georgiev on Jan 12, 2018

This happened to me today, data directory went from ~3GB to ~300GB.

These are the first lines related to compaction

level=error ts=2017-11-20T06:28:05.435890644Z caller=db.go:260 component=tsdb msg="compaction failed" err="persist head block: write compaction: add series: symbol entry for \"9109\" does not exist"

I see that log 586 times, and there were 588 block.tmp directories.

TimSimmons on Nov 20, 2017

Thanks for reporting. I plan to upgrade Prometheus from 1.8.2 to 2.0, but this issue is so critical, I have to suspend it. Waiting to be solved.

yinchuan on Nov 23, 2017