prometheus: OOM :killed process(prometheus),is there memory leak?

Hi, I have a single prometheus server that scrape about 50+ targets. it’will be OOM running several hours. i’m confused that,

dmesg

[3907506.014018] [50093]     0 50093  8556396  8490383   16703        0             0 prometheus
[3907506.014031] Out of memory: Kill process 50093 (prometheus) score 947 or sacrifice child
[3907506.014035] Killed process 50093 (prometheus) total-vm:34225584kB, anon-rss:33961532kB, file-rss:0kB
[3920674.254981] [63061]     0 63061  8506886  8492690   16614        0             0 prometheus
[3920674.260250] Out of memory: Kill process 63061 (prometheus) score 947 or sacrifice child
[3920674.262081] Killed process 63061 (prometheus) total-vm:34027544kB, anon-rss:33970760kB, file-rss:0kB
[3958788.455016] [105674]     0 105674  8547989  8492511   16685        0             0 prometheus
[3958788.460257] Out of memory: Kill process 105674 (prometheus) score 947 or sacrifice child
[3958788.462060] Killed process 105674 (prometheus) total-vm:34191956kB, anon-rss:33970044kB, file-rss:0kB
[3970678.851899] [117374]     0 117374  8505681  8494867   16616        0             0 prometheus
[3970678.855538] Out of memory: Kill process 117374 (prometheus) score 947 or sacrifice child
[3970678.857368] Killed process 117374 (prometheus) total-vm:34022724kB, anon-rss:33979468kB, file-rss:0kB

system info

[15:23 root@prometheus-poc:/var/mwc/jobs] # cat /etc/redhat-release 
CentOS Linux release 7.1.1503 (Core) 
[15:23 root@prometheus-poc:/var/mwc/jobs] # uname -a
Linux prometheus-poc 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
[15:23 root@prometheus-poc:/var/mwc/jobs] # free -bt
              total        used        free      shared  buff/cache   available
Mem:    35682078720 24212639744   372396032    94019584 11097042944 11069128704
Swap:             0           0           0
Total:  35682078720 24212639744   372396032

prometheus version

prometheus, version 0.17.0 (branch: release-0.17, revision: e11fab3)
  build user:       fabianreinartz@macpro
  build date:       20160302-17:48:43
  go version:       1.5.3

prometheus startup flags

prometheus -config.file=/var/mwc/jobs/prometheus/conf/prometheus.yml -storage.local.path=/mnt/prom_data -storage.local.memory-chunks=1048576 -log.level=debug -storage.remote.opentsdb-url=http://10.63.121.35:4242 -alertmanager.url=http://10.63.121.65:9093

prometheus scrape config

scrape_configs:
  - job_name: 'prometheus'
    scrape_interval: 5s
    scrape_timeout: 10s
    target_groups:
      - targets: ['localhost:9090']
  - job_name: 'node'
    scrape_interval: 5s
    scrape_timeout: 10s
    target_groups:
      - targets: ['localhost:9100']

  - job_name: 'overwritten-default'
    scrape_interval: 5s
    scrape_timeout: 10s
    consul_sd_configs:
      - server: <consul_server>
        datacenter: “consul_dc”

    relabel_configs:
      - source_labels: ['__meta_consul_service_id']
        regex:         '(.*)'
        target_label:  'job'
        replacement:   '$1'
        action:        'replace'
      - source_labels: ['__meta_consul_service_address','__meta_consul_service_port']
        separator:     ';'
        regex:         '(.*);(.*)'
        target_label:  '__address__'
        replacement:   '$1:$2'
        action:        'replace'
      - source_labels: ['__meta_consul_service_id']
        regex:         '^prometheus_.*'
        action:        'keep'

prometheus process status

Name:   prometheus
State:  S (sleeping)
Tgid:   130923
Ngid:   0
Pid:    130923
PPid:   1
TracerPid:  0
Uid:    0   0   0   0
Gid:    0   0   0   0
FDSize: 512
Groups: 
VmPeak: 19548872 kB
VmSize: 19548872 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:  19486532 kB
VmRSS:  19486532 kB
VmData: 19532964 kB
VmStk:       136 kB
VmExe:      6776 kB
VmLib:         0 kB
VmPTE:     38184 kB
VmSwap:        0 kB
Threads:    19
SigQ:   2/136048
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000000
SigCgt: fffffffe7fc1feff
CapInh: 0000000000000000
CapPrm: 0000001fffffffff
CapEff: 0000001fffffffff
CapBnd: 0000001fffffffff
Seccomp:    0
Cpus_allowed:   ff
Cpus_allowed_list:  0-7
Mems_allowed:   00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Mems_allowed_list:  0
voluntary_ctxt_switches:    1165275
nonvoluntary_ctxt_switches: 234755

graph of process_resident_memory_bytes

process_resident_memory_bytes

graph of prometheus_local_storage_memory_chunks

prometheus_local_storage_memory_chunks

thanks.

About this issue

Original URL
State: closed
Created 8 years ago
Comments: 21 (11 by maintainers)

Most upvoted comments

That Prometheus should only be using ~3GB of RAM, but it looks like it’ll top out at ~70GB.

Do you happen to have over 20M timeseries? If so you need a bigger box and to increase -storage.local.memory-chunks.

brian-brazil on Apr 12, 2016