prometheus: Memory Leak in v2.27.1: tsdb.txRing consumes much heap memory
What did you do? We ran into the same problem as: https://github.com/prometheus/prometheus/issues/7120
What did you expect to see? The memory usage is as stable as other Prometheus instances
What did you see instead? Under which circumstances?
The available memory of the node which hosts the Prometheus 2.27.1 instance gets decreasing like below:
Environment
-
System information:
Linux 4.9.65-netease x86_64
-
Prometheus version: { “status”: “success”, “data”: { “version”: “2.27.1”, “revision”: “db7f0bcec27bd8aeebad6b08ac849516efa9ae02”, “branch”: “HEAD”, “buildUser”: “root@fd804fbd4f25”, “buildDate”: “20210518-14:17:54”, “goVersion”: “go1.16.4” } }
-
Alertmanager version:
insert output of
alertmanager --version
here (if relevant to the issue) -
Prometheus configuration file:
{
"status": "success",
"data": {
"alertmanager.notification-queue-capacity": "10000",
"alertmanager.timeout": "",
"config.file": "/etc/prometheus/prometheus.yml",
"enable-feature": "",
"log.format": "logfmt",
"log.level": "info",
"query.lookback-delta": "5m",
"query.max-concurrency": "20",
"query.max-samples": "50000000",
"query.timeout": "2m",
"rules.alert.for-grace-period": "10m",
"rules.alert.for-outage-tolerance": "1h",
"rules.alert.resend-delay": "1m",
"scrape.adjust-timestamps": "true",
"storage.exemplars.exemplars-limit": "0",
"storage.remote.flush-deadline": "1m",
"storage.remote.read-concurrent-limit": "10",
"storage.remote.read-max-bytes-in-frame": "1048576",
"storage.remote.read-sample-limit": "50000000",
"storage.tsdb.allow-overlapping-blocks": "false",
"storage.tsdb.max-block-chunk-segment-size": "0B",
"storage.tsdb.max-block-duration": "2h",
"storage.tsdb.min-block-duration": "2h",
"storage.tsdb.no-lockfile": "false",
"storage.tsdb.path": "/prometheus",
"storage.tsdb.retention": "0s",
"storage.tsdb.retention.size": "256GiB",
"storage.tsdb.retention.time": "1w",
"storage.tsdb.wal-compression": "true",
"storage.tsdb.wal-segment-size": "0B",
"web.config.file": "",
"web.console.libraries": "/usr/share/prometheus/console_libraries",
"web.console.templates": "/usr/share/prometheus/consoles",
"web.cors.origin": ".*",
"web.enable-admin-api": "false",
"web.enable-lifecycle": "false",
"web.external-url": "",
"web.listen-address": ":9090",
"web.max-connections": "512",
"web.page-title": "Prometheus Time Series Collection and Processing Server",
"web.read-timeout": "5m",
"web.route-prefix": "/",
"web.user-assets": ""
}
}
- Alertmanager configuration file:
insert configuration here (if relevant to the issue)
- Logs:
insert Prometheus and Alertmanager logs relevant to the issue here
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 36 (26 by maintainers)
We had two Prometheus OOM at 400GB of RAM yesterday, so I spent about five hours studying metrics and profiles, and I think I understand what is happening.
First, head compaction holds an
isolationState
, acquired here: https://github.com/prometheus/prometheus/blob/ec6c1f17d1be530d77482a9d9becfe0012239319/tsdb/head.go#L1058called from
LeveledCompactor).populateBlock
; example stack trace:This
isolationState
holds the ‘low watermark’ flat until compaction is finished. Chart showingprometheus_tsdb_isolation_high_watermark - prometheus_tsdb_isolation_low_watermark
for the two Prometheus:Each scrape creates an
Appender
. Any appenders created while thisisolationState
exists, i.e. while compaction is running, are added totxRing
of series they touch. So the memory fortxRing
grows bigger and bigger.Head compaction can take multiple hours. I found it quite difficult to pin down when exactly it is happening, but an increase in
prometheus_tsdb_compaction_duration_seconds_sum
followed by an increase inprometheus_tsdb_head_gc_duration_seconds_sum
marks the end:Chunks older than the most recent one have been written out to memory-mapped files, and must be paged back in. As the rest of Prometheus gets bigger, available RAM to page in memory-mapped files gets tighter so the page must come from disk and compaction slows down dramatically. And compaction takes place on a single goroutine. Here is a chart of hard page-faults per second (
rate(container_memory_failures_total{failure_type="pgmajfault"}[5m])
:When Prometheus OOMs before completing the head compaction, it will restart and re-read everything from WAL, then start the head compaction again from the same point as before. Now the data it wants to read is even older and likely to need paging in from disk, so most likely this compaction will take even longer.
Awesome investigation. Now we only need to find a fix. 😮