prometheus: Memory Leak in v2.27.1: tsdb.txRing consumes much heap memory

What did you do? We ran into the same problem as: https://github.com/prometheus/prometheus/issues/7120

What did you expect to see? The memory usage is as stable as other Prometheus instances

What did you see instead? Under which circumstances? The available memory of the node which hosts the Prometheus 2.27.1 instance gets decreasing like below: image image image image

Environment

  • System information:

    Linux 4.9.65-netease x86_64

  • Prometheus version: { “status”: “success”, “data”: { “version”: “2.27.1”, “revision”: “db7f0bcec27bd8aeebad6b08ac849516efa9ae02”, “branch”: “HEAD”, “buildUser”: “root@fd804fbd4f25”, “buildDate”: “20210518-14:17:54”, “goVersion”: “go1.16.4” } }

  • Alertmanager version:

    insert output of alertmanager --version here (if relevant to the issue)

  • Prometheus configuration file:

{
  "status": "success",
  "data": {
    "alertmanager.notification-queue-capacity": "10000",
    "alertmanager.timeout": "",
    "config.file": "/etc/prometheus/prometheus.yml",
    "enable-feature": "",
    "log.format": "logfmt",
    "log.level": "info",
    "query.lookback-delta": "5m",
    "query.max-concurrency": "20",
    "query.max-samples": "50000000",
    "query.timeout": "2m",
    "rules.alert.for-grace-period": "10m",
    "rules.alert.for-outage-tolerance": "1h",
    "rules.alert.resend-delay": "1m",
    "scrape.adjust-timestamps": "true",
    "storage.exemplars.exemplars-limit": "0",
    "storage.remote.flush-deadline": "1m",
    "storage.remote.read-concurrent-limit": "10",
    "storage.remote.read-max-bytes-in-frame": "1048576",
    "storage.remote.read-sample-limit": "50000000",
    "storage.tsdb.allow-overlapping-blocks": "false",
    "storage.tsdb.max-block-chunk-segment-size": "0B",
    "storage.tsdb.max-block-duration": "2h",
    "storage.tsdb.min-block-duration": "2h",
    "storage.tsdb.no-lockfile": "false",
    "storage.tsdb.path": "/prometheus",
    "storage.tsdb.retention": "0s",
    "storage.tsdb.retention.size": "256GiB",
    "storage.tsdb.retention.time": "1w",
    "storage.tsdb.wal-compression": "true",
    "storage.tsdb.wal-segment-size": "0B",
    "web.config.file": "",
    "web.console.libraries": "/usr/share/prometheus/console_libraries",
    "web.console.templates": "/usr/share/prometheus/consoles",
    "web.cors.origin": ".*",
    "web.enable-admin-api": "false",
    "web.enable-lifecycle": "false",
    "web.external-url": "",
    "web.listen-address": ":9090",
    "web.max-connections": "512",
    "web.page-title": "Prometheus Time Series Collection and Processing Server",
    "web.read-timeout": "5m",
    "web.route-prefix": "/",
    "web.user-assets": ""
  }
}
  • Alertmanager configuration file:
insert configuration here (if relevant to the issue)
  • Logs:
insert Prometheus and Alertmanager logs relevant to the issue here

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 36 (26 by maintainers)

Most upvoted comments

We had two Prometheus OOM at 400GB of RAM yesterday, so I spent about five hours studying metrics and profiles, and I think I understand what is happening.

First, head compaction holds an isolationState, acquired here: https://github.com/prometheus/prometheus/blob/ec6c1f17d1be530d77482a9d9becfe0012239319/tsdb/head.go#L1058

called from LeveledCompactor).populateBlock; example stack trace:

#	0x231eb17	github.com/prometheus/prometheus/tsdb/chunks.(*Writer).WriteChunks+0x137		/app/tsdb/chunks/chunks.go:399
#	0x238eb6e	github.com/prometheus/prometheus/tsdb.(*instrumentedChunkWriter).WriteChunks+0x4e	/app/tsdb/compact.go:545
#	0x23915aa	github.com/prometheus/prometheus/tsdb.(*LeveledCompactor).populateBlock+0x120a		/app/tsdb/compact.go:784
#	0x238f2e8	github.com/prometheus/prometheus/tsdb.(*LeveledCompactor).write+0x648			/app/tsdb/compact.go:597
#	0x238e584	github.com/prometheus/prometheus/tsdb.(*LeveledCompactor).Write+0x224			/app/tsdb/compact.go:504
#	0x23969cc	github.com/prometheus/prometheus/tsdb.(*DB).compactHead+0x6c				/app/tsdb/db.go:970
#	0x2395e92	github.com/prometheus/prometheus/tsdb.(*DB).Compact+0x2b2				/app/tsdb/db.go:928
#	0x239564e	github.com/prometheus/prometheus/tsdb.(*DB).run+0x3ee					/app/tsdb/db.go:828

This isolationState holds the ‘low watermark’ flat until compaction is finished. Chart showing prometheus_tsdb_isolation_high_watermark - prometheus_tsdb_isolation_low_watermark for the two Prometheus:

image

Each scrape creates an Appender. Any appenders created while this isolationState exists, i.e. while compaction is running, are added to txRing of series they touch. So the memory for txRing grows bigger and bigger.

Head compaction can take multiple hours. I found it quite difficult to pin down when exactly it is happening, but an increase in prometheus_tsdb_compaction_duration_seconds_sum followed by an increase in prometheus_tsdb_head_gc_duration_seconds_sum marks the end:

image

Chunks older than the most recent one have been written out to memory-mapped files, and must be paged back in. As the rest of Prometheus gets bigger, available RAM to page in memory-mapped files gets tighter so the page must come from disk and compaction slows down dramatically. And compaction takes place on a single goroutine. Here is a chart of hard page-faults per second (rate(container_memory_failures_total{failure_type="pgmajfault"}[5m]):

image

When Prometheus OOMs before completing the head compaction, it will restart and re-read everything from WAL, then start the head compaction again from the same point as before. Now the data it wants to read is even older and likely to need paging in from disk, so most likely this compaction will take even longer.

Awesome investigation. Now we only need to find a fix. 😮