prometheus: Apparent memory leak & WAL file accumulation after unclean shutdown

Bug Report

What did you do? Reboot host running Prometheus.

What did you expect to see? Normal operation.

What did you see instead? Under which circumstances? Apparent memory leak, WAL files accumulation.

I suspect that we had a few unclean shutdowns, and after the crash recovery (which incidentally eats a LOT of memory), things seemed to be back to normal, until the first WAL checkpoint.

We run retention=30d, min-block-duration=1h, max-block-duration=24h. The WAL checkpoint appeared to become somehow deadlocked. Metrics were still being ingested and could still be queried, however RAM was being steadily consumed, and the WAL files were accumulating.

Prometheus inevitably crashed when it ran out of memory or disk space, whichever happened first.

The only workaround I found was to most importantly catch Prometheus in such a state before it crashed, and perform a clean shutdown / restart. Once restarted, it began to process the backlog of WAL files, disk usage receded, and eventually everything got back to normal.

Environment

System information: Linux 4.18.0-0.bpo.1-amd64 x86_64
Prometheus version: prometheus, version 2.4.3+ds (branch: debian/sid, revision: 2.4.3+ds-2) build user: pkg-go-maintainers@lists.alioth.debian.org build date: 20181022-04:09:52 go version: go1.10.4
Alertmanager version: n/a
Prometheus configuration file: n/a (but can be added if really necessary)
Alertmanager configuration file: n/a
Logs:

(originally responding to @krasi-georgiev in #4695)

I didn’t quite understand why did you need to lower the default block duration?

This was partly motivated after reading the article written by @PeterZaitsev (https://www.percona.com/blog/2018/09/20/prometheus-2-times-series-storage-performance-analyses/). We found that the RAM usage was about 10-20% lower using the fixed, lower max-block-duration=24h, instead of the 10% calculated from our retention period. The cold startup time also appeared to be lower.

any unusual log messages when this happens?

The systemd unit file in the Debian Prometheus package has TimeoutStopSec=20s, which I suspect is too short for a graceful shutdown of a large instance. It also has SendSIGKILL=false, so at least it it will allow Prometheus to continue its shutdown despite overshooting the timeout. However if this is occurring as part of a reboot, it is likely that the host will already by trying to unmount filesystems and reboot. I think this is where we are getting a corrupt DB / WAL.

Upon restarting, we see the usual log messages, but nothing too scary:

Nov  5 14:53:35 fra-prom1 prometheus[465]: level=info ts=2018-11-05T14:53:35.834318726Z caller=main.go:231 msg="Starting Prometheus" version="(version=2.4.3+ds, branch=debian/sid, revision=2.4.3+ds-2)"
Nov  5 14:53:35 fra-prom1 prometheus[465]: level=info ts=2018-11-05T14:53:35.834442207Z caller=main.go:232 build_context="(go=go1.10.4, user=pkg-go-maintainers@lists.alioth.debian.org, date=20181022-04:09:52)"
Nov  5 14:53:35 fra-prom1 prometheus[465]: level=info ts=2018-11-05T14:53:35.834483654Z caller=main.go:233 host_details="(Linux 4.18.0-0.bpo.1-amd64 #1 SMP Debian 4.18.6-1~bpo9+1 (2018-09-13) x86_64 fra-prom1 (none))"
Nov  5 14:53:35 fra-prom1 prometheus[465]: level=info ts=2018-11-05T14:53:35.834519284Z caller=main.go:234 fd_limits="(soft=8192, hard=8192)"
Nov  5 14:53:35 fra-prom1 prometheus[465]: level=info ts=2018-11-05T14:53:35.834550991Z caller=main.go:235 vm_limits="(soft=unlimited, hard=unlimited)"
Nov  5 14:53:35 fra-prom1 prometheus[465]: level=info ts=2018-11-05T14:53:35.835817375Z caller=web.go:392 component=web msg="Start listening for connections" address=0.0.0.0:9090
Nov  5 14:53:35 fra-prom1 prometheus[465]: level=info ts=2018-11-05T14:53:35.835822928Z caller=main.go:547 msg="Starting TSDB ..."
Nov  5 14:53:35 fra-prom1 prometheus[465]: level=info ts=2018-11-05T14:53:35.909715042Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538676000000 maxt=1538870400000 ulid=01CS67N83WJ7322TS7T0BTBGJT
Nov  5 14:53:35 fra-prom1 prometheus[465]: level=info ts=2018-11-05T14:53:35.912787882Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538870400000 maxt=1539064800000 ulid=01CSC11QHWV4BZNNEP46DTZR30
...
Nov  5 14:56:17 fra-prom1 prometheus[465]: level=warn ts=2018-11-05T14:56:17.020693598Z caller=head.go:371 component=tsdb msg="unknown series references" count=25
Nov  5 14:56:17 fra-prom1 prometheus[465]: level=info ts=2018-11-05T14:56:17.951110468Z caller=main.go:557 msg="TSDB started"
Nov  5 14:56:17 fra-prom1 prometheus[465]: level=info ts=2018-11-05T14:56:17.951267769Z caller=main.go:617 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
Nov  5 14:56:17 fra-prom1 prometheus[465]: level=info ts=2018-11-05T14:56:17.989881165Z caller=main.go:643 msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml
Nov  5 14:56:17 fra-prom1 prometheus[465]: level=info ts=2018-11-05T14:56:17.990048657Z caller=main.go:516 msg="Server is ready to receive web requests."

Even the next WAL commit and compactions look normal:

Nov  5 15:31:28 fra-prom1 prometheus[465]: level=info ts=2018-11-05T15:31:28.199164694Z caller=compact.go:398 component=tsdb msg="write block" mint=1541426400000 maxt=1541430000000 ulid=01CVJ7XC6W73TH0TAK9JQS09QB
Nov  5 15:31:40 fra-prom1 prometheus[465]: level=info ts=2018-11-05T15:31:40.624026Z caller=head.go:446 component=tsdb msg="head GC completed" duration=6.465479154s
Nov  5 15:32:13 fra-prom1 prometheus[465]: level=info ts=2018-11-05T15:32:13.964443157Z caller=head.go:477 component=tsdb msg="WAL checkpoint complete" low=8290 high=8304 duration=33.339447257s
...
Nov  5 16:31:26 fra-prom1 prometheus[465]: level=info ts=2018-11-05T16:31:26.39099609Z caller=compact.go:398 component=tsdb msg="write block" mint=1541430000000 maxt=1541433600000 ulid=01CVJBB7TXVYDQ9AS7GPKFJWR8
Nov  5 16:31:36 fra-prom1 prometheus[465]: level=info ts=2018-11-05T16:31:36.482639095Z caller=head.go:446 component=tsdb msg="head GC completed" duration=4.919497363s
Nov  5 16:32:06 fra-prom1 prometheus[465]: level=info ts=2018-11-05T16:32:06.97950364Z caller=head.go:477 component=tsdb msg="WAL checkpoint complete" low=8305 high=8319 duration=30.496711647s
Nov  5 16:33:31 fra-prom1 prometheus[465]: level=info ts=2018-11-05T16:33:31.835765448Z caller=compact.go:352 component=tsdb msg="compact blocks" count=3 mint=1541419200000 maxt=1541430000000 ulid=01CVJBF8YW56AME8TYY1A9AVW0 sources="[01CVJ11MZFNB1ZSRWTTXP11Z1M 01CVJ4FGKEFY42CEADCAAZR88N 01CVJ7XC6W73TH0TAK9JQS09QB]"

But by this time, it is already starting to “leak” memory and the WAL files are piling up. Several hours later, Prometheus runs out of memory:

Nov  6 00:58:09 fra-prom1 prometheus[465]: fatal error: runtime: out of memory
Nov  6 00:58:09 fra-prom1 prometheus[465]: runtime stack:
Nov  6 00:58:09 fra-prom1 prometheus[465]: runtime.throw(0x1449017, 0x16)
Nov  6 00:58:09 fra-prom1 prometheus[465]: #011/usr/lib/go-1.10/src/runtime/panic.go:616 +0x81
...

can you try to describe a minimal setup and step to reproduce this so I can spend some time to find the culprit for this weird behaviour.

I think you should be able to reproduce it if you SIGKILL Prometheus whilst it is still flushing / shutting down. It may take a while after restarting before you actually see the issue however.

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 45 (34 by maintainers)

Most upvoted comments

fixed in https://github.com/prometheus/prometheus/pull/4922

@dswarbrick, @AndorCS thanks for the team work!

krasi-georgiev on Nov 28, 2018