prometheus: Memory leak in samples deletion procedure
Edits by @codesome: Actions to be taken
- [low hanging fruit] Call Delete() on Head only if delete time ranges overlap with it (code reference)
- [low hanging fruit] Head GC to also GC tombstones (code reference)
- [not really a low hanging fruit] Clean tombstones API to also clean tombstones from the Head (TBD if it is worth it, need to look for any historical reasons or if it is a regression. Not a requirement to fix this leak.)
What did you do?
Context: Basically we are trying to downsample the data. In order to reduce disk space consumption (we have pretty strict disk space limits for stats), we are “decimating” samples. In other words we are removing samples as they got older, but not all of them but we leave some samples there to have a longer history.
We implemented the described above thing by calling the /delete_series endpoint every minute with a specific start
and end
for all the metrics.
What did you expect to see?
We expected to see lower disk space usage with longer stats history and stable (flat) memory consumption.
What did you see instead? Under which circumstances?
Prometheus instance consumes more and more memory and eventually gets killed due to OOM after a few days of work. If I disable the “decimation” procedure (stop calling the /delete_series basically), the problem disappears and memory consumption plateaus.
On the staging instance in the aws we see it consumes roughly 3-4G per day and crashes at 25G. Number of time series there: 30819.
I reproduced it locally on mac by the following simple script:
$ cat ~/test/mem_issue.sh
#!/bin/bash
set -e
while true; do
timestamp=`date +"%s"`
startts="$((timestamp - 60))"
endts="$((timestamp - 52))"
url="http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={instance=\"localhost:9090\"}&start=$startts&end=$endts"
echo "$url"
curl -X POST -g "$url"
sleep 10
done
After one day of this script running the difference in mem consumption can be seen clearly:
With periodic deletions:
Without deletions:
Environment
It reproduced on aws linux and on darwin. It seems like it does not depend on the env much.
- System information:
$ uname -srm
Darwin 19.6.0 x86_64
- Prometheus version:
prometheus, version 2.27.1 (branch: HEAD, revision: db7f0bcec27bd8aeebad6b08ac849516efa9ae02)
build user: root@bb2fc99061d0
build date: 20210518-14:07:30
go version: go1.16.4
platform: darwin/amd64
- Prometheus configuration file:
Our staging instance has more complicated config but it can be reproduced on the simplest config like this (but it takes longer time as the number of metrics is lower):
$ cat prometheus.yml
global:
scrape_interval: 1s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- Logs:
On staging where it crashed because of OOM the last logs were:
57255 level=info ts=2021-07-15T10:16:44.792Z caller=compact.go:494 component=tsdb msg="write block" mint=1626220800000 maxt=1626285600000 ulid=01FAMTP8Y0X5RGTYZ5DAT4K5Y6 duration=1.59143539s
57256 level=info ts=2021-07-15T10:16:45.625Z caller=compact.go:494 component=tsdb msg="write block" mint=1626307200000 maxt=1626328800000 ulid=01FAMTPAFRPRE5QDBG2DAW9QQH duration=833.052171ms
57257 level=info ts=2021-07-15T10:16:46.246Z caller=compact.go:494 component=tsdb msg="write block" mint=1626328800000 maxt=1626336000000 ulid=01FAMTPB9SJ22GRQBDGR9VVD5K duration=620.793956ms
57258 level=info ts=2021-07-15T10:16:46.253Z caller=db.go:1152 component=tsdb msg="Deleting obsolete block" block=01FAMTMGQ540BPX3TDP3Q9HBV5
57259 level=info ts=2021-07-15T10:16:46.270Z caller=db.go:1152 component=tsdb msg="Deleting obsolete block" block=01FAMTMEB34YNCG5XZ2VKYBK8Z
57260 level=info ts=2021-07-15T10:16:46.277Z caller=db.go:1152 component=tsdb msg="Deleting obsolete block" block=01FAMTMFWT0WHKWWN76BR0GVEK
57261 level=info ts=2021-07-15T10:17:44.757Z caller=compact.go:494 component=tsdb msg="write block" mint=1626220800000 maxt=1626285600000 ulid=01FAMTR3GW7WWJGKNF37FDXK60 duration=1.560656142s
57262 level=info ts=2021-07-15T10:17:45.579Z caller=compact.go:494 component=tsdb msg="write block" mint=1626307200000 maxt=1626328800000 ulid=01FAMTR51NKY7WPP56W2201MMV duration=821.192934ms
57263 level=info ts=2021-07-15T10:17:46.160Z caller=compact.go:494 component=tsdb msg="write block" mint=1626328800000 maxt=1626336000000 ulid=01FAMTR5VBSYVN4G5XEWFRXZPZ duration=581.230981ms
57264 level=info ts=2021-07-15T10:17:46.166Z caller=db.go:1152 component=tsdb msg="Deleting obsolete block" block=01FAMTPB9SJ22GRQBDGR9VVD5K
57265 level=info ts=2021-07-15T10:17:46.173Z caller=db.go:1152 component=tsdb msg="Deleting obsolete block" block=01FAMTPAFRPRE5QDBG2DAW9QQH
57266 level=info ts=2021-07-15T10:17:46.190Z caller=db.go:1152 component=tsdb msg="Deleting obsolete block" block=01FAMTP8Y0X5RGTYZ5DAT4K5Y6
57267 level=info ts=2021-07-15T10:18:46.004Z caller=compact.go:494 component=tsdb msg="write block" mint=1626220800000 maxt=1626285600000 ulid=01FAMTSZAK06BNTRBMA8XSGHXH duration=1.569405117s
57268 level=info ts=2021-07-15T10:18:46.826Z caller=compact.go:494 component=tsdb msg="write block" mint=1626307200000 maxt=1626328800000 ulid=01FAMTT0VNVMV2149RSNW0M4GW duration=821.47751ms
57269 level=info ts=2021-07-15T10:18:47.407Z caller=compact.go:494 component=tsdb msg="write block" mint=1626328800000 maxt=1626336000000 ulid=01FAMTT1NAT3G7FPGE729R1PBR duration=580.965225ms
57270 level=info ts=2021-07-15T10:18:47.416Z caller=db.go:1152 component=tsdb msg="Deleting obsolete block" block=01FAMTR51NKY7WPP56W2201MMV
57271 level=info ts=2021-07-15T10:18:47.420Z caller=db.go:1152 component=tsdb msg="Deleting obsolete block" block=01FAMTR5VBSYVN4G5XEWFRXZPZ
57272 level=info ts=2021-07-15T10:18:47.436Z caller=db.go:1152 component=tsdb msg="Deleting obsolete block" block=01FAMTR3GW7WWJGKNF37FDXK60
57273 fatal error: runtime: out of memory
Let me know we you need stacktraces that are printed right after the out of memory
error.
Locally (I didn’t wait long enough for it to crash) but logs look pretty ok to me:
level=info ts=2021-07-21T01:56:06.249Z caller=main.go:426 msg="Starting Prometheus" version="(version=2.27.1, branch=HEAD, revision=db7f0bcec27bd8aeebad6b08ac849516efa9ae02)"
level=info ts=2021-07-21T01:56:06.249Z caller=main.go:431 build_context="(go=go1.16.4, user=root@bb2fc99061d0, date=20210518-14:07:30)"
level=info ts=2021-07-21T01:56:06.249Z caller=main.go:432 host_details=(darwin)
level=info ts=2021-07-21T01:56:06.249Z caller=main.go:433 fd_limits="(soft=256, hard=unlimited)"
level=info ts=2021-07-21T01:56:06.249Z caller=main.go:434 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2021-07-21T01:56:06.253Z caller=web.go:540 component=web msg="Start listening for connections" address=127.0.0.1:9090
level=info ts=2021-07-21T01:56:06.254Z caller=main.go:803 msg="Starting TSDB ..."
level=info ts=2021-07-21T01:56:06.254Z caller=tls_config.go:191 component=web msg="TLS is disabled." http2=false
level=info ts=2021-07-21T01:56:06.259Z caller=head.go:741 component=tsdb msg="Replaying on-disk memory mappable chunks if any"
level=info ts=2021-07-21T01:56:06.297Z caller=head.go:755 component=tsdb msg="On-disk memory mappable chunks replay completed" duration=38.613571ms
level=info ts=2021-07-21T01:56:06.298Z caller=head.go:761 component=tsdb msg="Replaying WAL, this may take a while"
level=info ts=2021-07-21T01:56:06.503Z caller=head.go:813 component=tsdb msg="WAL segment loaded" segment=0 maxSegment=1
level=info ts=2021-07-21T01:56:06.503Z caller=head.go:813 component=tsdb msg="WAL segment loaded" segment=1 maxSegment=1
level=info ts=2021-07-21T01:56:06.503Z caller=head.go:818 component=tsdb msg="WAL replay completed" checkpoint_replay_duration=114.193µs wal_replay_duration=205.739985ms total_replay_duration=244.517237ms
level=info ts=2021-07-21T01:56:06.505Z caller=main.go:828 fs_type=19
level=info ts=2021-07-21T01:56:06.505Z caller=main.go:831 msg="TSDB started"
level=debug ts=2021-07-21T01:56:06.505Z caller=main.go:832 msg="TSDB options" MinBlockDuration=2h MaxBlockDuration=31d MaxBytes=1GiB NoLockfile=false RetentionDuration=1y WALSegmentSize=0B AllowOverlappingBlocks=false WALCompression=true
level=info ts=2021-07-21T01:56:06.505Z caller=main.go:957 msg="Loading configuration file" filename=./prometheus.yml
level=debug ts=2021-07-21T01:56:06.534Z caller=manager.go:195 component="discovery manager scrape" msg="Starting provider" provider=static/0 subs=[prometheus]
level=info ts=2021-07-21T01:56:06.534Z caller=main.go:988 msg="Completed loading of configuration file" filename=./prometheus.yml totalDuration=29.024755ms remote_storage=5.122µs web_handler=307ns query_engine=2.197µs scrape=28.645242ms scrape_sd=97.764µs notify=919ns notify_sd=1.947µs rules=3.782µs
level=info ts=2021-07-21T01:56:06.534Z caller=main.go:775 msg="Server is ready to receive web requests."
level=debug ts=2021-07-21T01:56:06.534Z caller=manager.go:213 component="discovery manager scrape" msg="Discoverer channel closed" provider=static/0
level=info ts=2021-07-21T02:03:16.139Z caller=compact.go:513 component=tsdb msg="write block" mint=1626822195592 maxt=1626825600000 ulid=01FB3CV18QVNQKCBA72YMH59H9 duration=531.926557ms
level=info ts=2021-07-21T02:03:16.141Z caller=head.go:925 component=tsdb msg="Head GC completed" duration=1.362425ms
level=info ts=2021-07-21T03:00:01.570Z caller=compact.go:513 component=tsdb msg="write block" mint=1626825600634 maxt=1626832800000 ulid=01FB3G2YG8FSF19F42ZXZPP04H duration=921.485765ms
level=info ts=2021-07-21T03:00:01.571Z caller=head.go:925 component=tsdb msg="Head GC completed" duration=1.230755ms
level=info ts=2021-07-21T05:00:01.394Z caller=compact.go:513 component=tsdb msg="write block" mint=1626832800602 maxt=1626840000000 ulid=01FB3PYNT89V6G68FEZQC3ZCHZ duration=682.470478ms
level=info ts=2021-07-21T05:00:01.396Z caller=head.go:925 component=tsdb msg="Head GC completed" duration=1.214479ms
level=info ts=2021-07-21T05:00:01.416Z caller=checkpoint.go:97 component=tsdb msg="Creating checkpoint" from_segment=0 to_segment=1 mint=1626840000000
level=info ts=2021-07-21T05:00:01.621Z caller=head.go:1022 component=tsdb msg="WAL checkpoint complete" first=0 last=1 duration=205.179238ms
level=info ts=2021-07-21T07:00:01.541Z caller=compact.go:513 component=tsdb msg="write block" mint=1626840000673 maxt=1626847200000 ulid=01FB3XTD4DMMNEEQZGP5XX8XTP duration=759.6467ms
level=info ts=2021-07-21T07:00:01.543Z caller=head.go:925 component=tsdb msg="Head GC completed" duration=1.11078ms
level=info ts=2021-07-21T09:00:01.623Z caller=compact.go:513 component=tsdb msg="write block" mint=1626847200743 maxt=1626854400000 ulid=01FB44P4EWAFHMPNYNXA98NHQN duration=763.291727ms
level=info ts=2021-07-21T09:00:01.625Z caller=head.go:925 component=tsdb msg="Head GC completed" duration=1.134786ms
level=info ts=2021-07-21T09:00:01.646Z caller=checkpoint.go:97 component=tsdb msg="Creating checkpoint" from_segment=2 to_segment=3 mint=1626854400000
level=info ts=2021-07-21T09:00:01.840Z caller=head.go:1022 component=tsdb msg="WAL checkpoint complete" first=2 last=3 duration=193.251952ms
level=info ts=2021-07-21T11:00:01.820Z caller=compact.go:513 component=tsdb msg="write block" mint=1626854400814 maxt=1626861600000 ulid=01FB4BHVS52B8QVQPDDK29SGK1 duration=887.303356ms
level=info ts=2021-07-21T11:00:01.822Z caller=head.go:925 component=tsdb msg="Head GC completed" duration=1.172295ms
level=info ts=2021-07-21T11:00:02.343Z caller=compact.go:454 component=tsdb msg="compact blocks" count=3 mint=1626825600634 maxt=1626847200000 ulid=01FB4BHWP4QSXTZ3GVV8MKQ12S sources="[01FB3G2YG8FSF19F42ZXZPP04H 01FB3PYNT89V6G68FEZQC3ZCHZ 01FB3XTD4DMMNEEQZGP5XX8XTP]" duration=483.185294ms
level=info ts=2021-07-21T11:00:02.348Z caller=db.go:1239 component=tsdb msg="Deleting obsolete block" block=01FB3G2YG8FSF19F42ZXZPP04H
level=info ts=2021-07-21T11:00:02.352Z caller=db.go:1239 component=tsdb msg="Deleting obsolete block" block=01FB3PYNT89V6G68FEZQC3ZCHZ
level=info ts=2021-07-21T11:00:02.360Z caller=db.go:1239 component=tsdb msg="Deleting obsolete block" block=01FB3XTD4DMMNEEQZGP5XX8XTP
level=info ts=2021-07-21T13:00:01.786Z caller=compact.go:513 component=tsdb msg="write block" mint=1626861600898 maxt=1626868800000 ulid=01FB4JDK3DMM2XHNQ2BDFE6FJE duration=780.957959ms
level=info ts=2021-07-21T13:00:01.788Z caller=head.go:925 component=tsdb msg="Head GC completed" duration=1.152235ms
level=info ts=2021-07-21T13:00:01.810Z caller=checkpoint.go:97 component=tsdb msg="Creating checkpoint" from_segment=4 to_segment=5 mint=1626868800000
level=info ts=2021-07-21T13:00:02.080Z caller=head.go:1022 component=tsdb msg="WAL checkpoint complete" first=4 last=5 duration=270.499468ms
level=info ts=2021-07-21T15:00:01.855Z caller=compact.go:513 component=tsdb msg="write block" mint=1626868800967 maxt=1626876000000 ulid=01FB4S9ADE7QFYQ10FSTM6ATK9 duration=784.542986ms
level=info ts=2021-07-21T15:00:01.857Z caller=head.go:925 component=tsdb msg="Head GC completed" duration=1.169772ms
level=info ts=2021-07-21T17:00:01.132Z caller=compact.go:513 component=tsdb msg="write block" mint=1626876000038 maxt=1626883200000 ulid=01FB5050RA8WA05N1CZ85C3816 duration=994.400081ms
level=info ts=2021-07-21T17:00:01.134Z caller=head.go:925 component=tsdb msg="Head GC completed" duration=1.377731ms
level=info ts=2021-07-21T17:00:01.209Z caller=checkpoint.go:97 component=tsdb msg="Creating checkpoint" from_segment=6 to_segment=7 mint=1626883200000
level=info ts=2021-07-21T17:00:01.482Z caller=head.go:1022 component=tsdb msg="WAL checkpoint complete" first=6 last=7 duration=273.022182ms
level=info ts=2021-07-21T17:00:02.380Z caller=compact.go:454 component=tsdb msg="compact blocks" count=3 mint=1626847200743 maxt=1626868800000 ulid=01FB50522B13G7WSRP6TC214W8 sources="[01FB44P4EWAFHMPNYNXA98NHQN 01FB4BHVS52B8QVQPDDK29SGK1 01FB4JDK3DMM2XHNQ2BDFE6FJE]" duration=897.574439ms
level=info ts=2021-07-21T17:00:02.386Z caller=db.go:1239 component=tsdb msg="Deleting obsolete block" block=01FB44P4EWAFHMPNYNXA98NHQN
level=info ts=2021-07-21T17:00:02.393Z caller=db.go:1239 component=tsdb msg="Deleting obsolete block" block=01FB4BHVS52B8QVQPDDK29SGK1
level=info ts=2021-07-21T17:00:02.404Z caller=db.go:1239 component=tsdb msg="Deleting obsolete block" block=01FB4JDK3DMM2XHNQ2BDFE6FJE
level=info ts=2021-07-21T17:00:03.307Z caller=compact.go:454 component=tsdb msg="compact blocks" count=3 mint=1626822195592 maxt=1626868800000 ulid=01FB5052Z4D34GFGDB9JZPE39W sources="[01FB3CV18QVNQKCBA72YMH59H9 01FB4BHWP4QSXTZ3GVV8MKQ12S 01FB50522B13G7WSRP6TC214W8]" duration=902.759909ms
level=info ts=2021-07-21T17:00:03.313Z caller=db.go:1239 component=tsdb msg="Deleting obsolete block" block=01FB50522B13G7WSRP6TC214W8
level=info ts=2021-07-21T17:00:03.325Z caller=db.go:1239 component=tsdb msg="Deleting obsolete block" block=01FB3CV18QVNQKCBA72YMH59H9
level=info ts=2021-07-21T17:00:03.336Z caller=db.go:1239 component=tsdb msg="Deleting obsolete block" block=01FB4BHWP4QSXTZ3GVV8MKQ12S
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 19 (9 by maintainers)
@darshanime thanks! Please see the issue description for the 3 different action items. You can ignore the 3rd one for now.
You found another issue! I just checked the code and looks like we are not skipping the Head block if the delete call does not overlap with it (while we take care of it properly for persistent blocks). So looks like tombstones will grow in memory for any delete call. I have updated the issue description with possible fixes to do.
If we fix the above (i.e. not touching Head if it does not overlap), then 4 hour is good enough to not touch Head block at all. By default min block size should be 2h.