prometheus: Prometheus OOM crash with 1TB of free memory remaining

Running a very large Prometheus install (1,952GB memory, 128vCPUs), we observed Prometheus crash with a “runtime out of memory” error, despite having almost 1TB of available memory. You can see we actually had two machines crash at about the same time, one of which was legitimately out of memory, but the other one was totally healthy: Screen Shot 2020-08-11 at 12 32 33 PM

Stack trace:

level=error ts=2020-08-11T13:08:40.215Z caller=consul.go:487 component="discovery manager scrape" discovery=consul msg="Error refreshing service" service=cloudwatch_exporter tags= err="Get \"http://127.0.0.1:8500/v1/health/service/cloudwatch_exporter?index=1650985810&stale=&wait=30000ms\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
fatal error: runtime: out of memory

runtime stack:
runtime.throw(0x26d56d3, 0x16)
	/usr/local/go/src/runtime/panic.go:1116 +0x72
runtime.sysMap(0x193f8000000, 0x8000000, 0x4254c78)
	/usr/local/go/src/runtime/mem_linux.go:169 +0xc5
runtime.(*mheap).sysAlloc(0x423fdc0, 0x7c00000, 0x423fdc8, 0x3da0)
	/usr/local/go/src/runtime/malloc.go:715 +0x1cd
runtime.(*mheap).grow(0x423fdc0, 0x3da0, 0x0)
	/usr/local/go/src/runtime/mheap.go:1286 +0x11c
runtime.(*mheap).allocSpan(0x423fdc0, 0x3da0, 0x0, 0x4254c88, 0x415b46)
	/usr/local/go/src/runtime/mheap.go:1124 +0x6a0
runtime.(*mheap).alloc.func1()
	/usr/local/go/src/runtime/mheap.go:871 +0x64
runtime.(*mheap).alloc(0x423fdc0, 0x3da0, 0xc00bd00100, 0x427910)
	/usr/local/go/src/runtime/mheap.go:865 +0x81
runtime.largeAlloc(0x7b40000, 0x460001, 0x193f523e000)
	/usr/local/go/src/runtime/malloc.go:1152 +0x92
runtime.mallocgc.func1()
	/usr/local/go/src/runtime/malloc.go:1047 +0x46
runtime.systemstack(0x0)
	/usr/local/go/src/runtime/asm_amd64.s:370 +0x66
runtime.mstart()
	/usr/local/go/src/runtime/proc.go:1041

<snip 200MB of goroutine traces>

Environment

System information:

$ uname -srm
Linux 4.15.0-1035-aws x86_64

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 7869925
max locked memory       (kbytes, -l) 16384
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65536
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 7869925
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Prometheus version:

$ prometheus --version
prometheus, version 2.18.1 (branch: HEAD, revision: ecee9c8abfd118f139014cb1b174b08db3f342cf)
  build user:       root@2117a9e64a7e
  build date:       20200507-16:51:47
  go version:       go1.14.2

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 24 (11 by maintainers)

Most upvoted comments

For future archeologists: we figured this out. It turns out we were running into the kernel limit vm.max_map_count, which controls how many memory map areas a process can request. The default limit of 65536 was too low, so doing sysctl -w vm.max_map_count=262144 fixed our issue and allowed Prometheus to continue consuming memory. There’s probably nothing you can do about this on the Prometheus side since the golang allocator manages this, but FWIW 65536 is a lot of memory maps and I’m kinda surprised that Prometheus keeps allocating them long after startup.

+28

amckinley on Aug 19, 2020

Also, maybe you should consider sharding at this stage.

roidelapluie on Aug 11, 2020

Yeah, Prometheus maps the TSDB block files into memory, and while it compacts multiple smaller blocks into bigger ones spanning longer timeframes, the total number of blocks is growing until the retention limit is reached. (Prometheus might also only map some older files for real once they are accessed, but I’m not sure about that one.)

In any case, congrats to the awesome troubleshooting. We should definitely keep this in mind and once we write some guidelines about running very large Prometheus servers, we should include vm.max_map_count and not only the obvious ones like the limit for open files.

beorn7 on Aug 20, 2020