prometheus: Memory Leak in v2.27.1: tsdb.txRing consumes much heap memory

What did you do? We ran into the same problem as: https://github.com/prometheus/prometheus/issues/7120

What did you expect to see? The memory usage is as stable as other Prometheus instances

What did you see instead? Under which circumstances? The available memory of the node which hosts the Prometheus 2.27.1 instance gets decreasing like below:

Environment

System information:

Linux 4.9.65-netease x86_64
Prometheus version: { “status”: “success”, “data”: { “version”: “2.27.1”, “revision”: “db7f0bcec27bd8aeebad6b08ac849516efa9ae02”, “branch”: “HEAD”, “buildUser”: “root@fd804fbd4f25”, “buildDate”: “20210518-14:17:54”, “goVersion”: “go1.16.4” } }
Alertmanager version:

insert output of alertmanager --version here (if relevant to the issue)
Prometheus configuration file:

{
  "status": "success",
  "data": {
    "alertmanager.notification-queue-capacity": "10000",
    "alertmanager.timeout": "",
    "config.file": "/etc/prometheus/prometheus.yml",
    "enable-feature": "",
    "log.format": "logfmt",
    "log.level": "info",
    "query.lookback-delta": "5m",
    "query.max-concurrency": "20",
    "query.max-samples": "50000000",
    "query.timeout": "2m",
    "rules.alert.for-grace-period": "10m",
    "rules.alert.for-outage-tolerance": "1h",
    "rules.alert.resend-delay": "1m",
    "scrape.adjust-timestamps": "true",
    "storage.exemplars.exemplars-limit": "0",
    "storage.remote.flush-deadline": "1m",
    "storage.remote.read-concurrent-limit": "10",
    "storage.remote.read-max-bytes-in-frame": "1048576",
    "storage.remote.read-sample-limit": "50000000",
    "storage.tsdb.allow-overlapping-blocks": "false",
    "storage.tsdb.max-block-chunk-segment-size": "0B",
    "storage.tsdb.max-block-duration": "2h",
    "storage.tsdb.min-block-duration": "2h",
    "storage.tsdb.no-lockfile": "false",
    "storage.tsdb.path": "/prometheus",
    "storage.tsdb.retention": "0s",
    "storage.tsdb.retention.size": "256GiB",
    "storage.tsdb.retention.time": "1w",
    "storage.tsdb.wal-compression": "true",
    "storage.tsdb.wal-segment-size": "0B",
    "web.config.file": "",
    "web.console.libraries": "/usr/share/prometheus/console_libraries",
    "web.console.templates": "/usr/share/prometheus/consoles",
    "web.cors.origin": ".*",
    "web.enable-admin-api": "false",
    "web.enable-lifecycle": "false",
    "web.external-url": "",
    "web.listen-address": ":9090",
    "web.max-connections": "512",
    "web.page-title": "Prometheus Time Series Collection and Processing Server",
    "web.read-timeout": "5m",
    "web.route-prefix": "/",
    "web.user-assets": ""
  }
}

Alertmanager configuration file:

insert configuration here (if relevant to the issue)

Logs:

insert Prometheus and Alertmanager logs relevant to the issue here

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 36 (26 by maintainers)

Most upvoted comments

We had two Prometheus OOM at 400GB of RAM yesterday, so I spent about five hours studying metrics and profiles, and I think I understand what is happening.

First, head compaction holds an isolationState, acquired here: https://github.com/prometheus/prometheus/blob/ec6c1f17d1be530d77482a9d9becfe0012239319/tsdb/head.go#L1058

called from LeveledCompactor).populateBlock; example stack trace:

#	0x231eb17	github.com/prometheus/prometheus/tsdb/chunks.(*Writer).WriteChunks+0x137		/app/tsdb/chunks/chunks.go:399
#	0x238eb6e	github.com/prometheus/prometheus/tsdb.(*instrumentedChunkWriter).WriteChunks+0x4e	/app/tsdb/compact.go:545
#	0x23915aa	github.com/prometheus/prometheus/tsdb.(*LeveledCompactor).populateBlock+0x120a		/app/tsdb/compact.go:784
#	0x238f2e8	github.com/prometheus/prometheus/tsdb.(*LeveledCompactor).write+0x648			/app/tsdb/compact.go:597
#	0x238e584	github.com/prometheus/prometheus/tsdb.(*LeveledCompactor).Write+0x224			/app/tsdb/compact.go:504
#	0x23969cc	github.com/prometheus/prometheus/tsdb.(*DB).compactHead+0x6c				/app/tsdb/db.go:970
#	0x2395e92	github.com/prometheus/prometheus/tsdb.(*DB).Compact+0x2b2				/app/tsdb/db.go:928
#	0x239564e	github.com/prometheus/prometheus/tsdb.(*DB).run+0x3ee					/app/tsdb/db.go:828

This isolationState holds the ‘low watermark’ flat until compaction is finished. Chart showing prometheus_tsdb_isolation_high_watermark - prometheus_tsdb_isolation_low_watermark for the two Prometheus:

Each scrape creates an Appender. Any appenders created while this isolationState exists, i.e. while compaction is running, are added to txRing of series they touch. So the memory for txRing grows bigger and bigger.

Head compaction can take multiple hours. I found it quite difficult to pin down when exactly it is happening, but an increase in prometheus_tsdb_compaction_duration_seconds_sum followed by an increase in prometheus_tsdb_head_gc_duration_seconds_sum marks the end:

Chunks older than the most recent one have been written out to memory-mapped files, and must be paged back in. As the rest of Prometheus gets bigger, available RAM to page in memory-mapped files gets tighter so the page must come from disk and compaction slows down dramatically. And compaction takes place on a single goroutine. Here is a chart of hard page-faults per second (rate(container_memory_failures_total{failure_type="pgmajfault"}[5m]):

When Prometheus OOMs before completing the head compaction, it will restart and re-read everything from WAL, then start the head compaction again from the same point as before. Now the data it wants to read is even older and likely to need paging in from disk, so most likely this compaction will take even longer.

bboreham on Sep 13, 2022

Awesome investigation. Now we only need to find a fix. 😮

beorn7 on Sep 13, 2022