prometheus: Prometheus OOM crash with 1TB of free memory remaining
Running a very large Prometheus install (1,952GB memory, 128vCPUs), we observed Prometheus crash with a “runtime out of memory” error, despite having almost 1TB of available memory. You can see we actually had two machines crash at about the same time, one of which was legitimately out of memory, but the other one was totally healthy:
Stack trace:
level=error ts=2020-08-11T13:08:40.215Z caller=consul.go:487 component="discovery manager scrape" discovery=consul msg="Error refreshing service" service=cloudwatch_exporter tags= err="Get \"http://127.0.0.1:8500/v1/health/service/cloudwatch_exporter?index=1650985810&stale=&wait=30000ms\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
fatal error: runtime: out of memory
runtime stack:
runtime.throw(0x26d56d3, 0x16)
/usr/local/go/src/runtime/panic.go:1116 +0x72
runtime.sysMap(0x193f8000000, 0x8000000, 0x4254c78)
/usr/local/go/src/runtime/mem_linux.go:169 +0xc5
runtime.(*mheap).sysAlloc(0x423fdc0, 0x7c00000, 0x423fdc8, 0x3da0)
/usr/local/go/src/runtime/malloc.go:715 +0x1cd
runtime.(*mheap).grow(0x423fdc0, 0x3da0, 0x0)
/usr/local/go/src/runtime/mheap.go:1286 +0x11c
runtime.(*mheap).allocSpan(0x423fdc0, 0x3da0, 0x0, 0x4254c88, 0x415b46)
/usr/local/go/src/runtime/mheap.go:1124 +0x6a0
runtime.(*mheap).alloc.func1()
/usr/local/go/src/runtime/mheap.go:871 +0x64
runtime.(*mheap).alloc(0x423fdc0, 0x3da0, 0xc00bd00100, 0x427910)
/usr/local/go/src/runtime/mheap.go:865 +0x81
runtime.largeAlloc(0x7b40000, 0x460001, 0x193f523e000)
/usr/local/go/src/runtime/malloc.go:1152 +0x92
runtime.mallocgc.func1()
/usr/local/go/src/runtime/malloc.go:1047 +0x46
runtime.systemstack(0x0)
/usr/local/go/src/runtime/asm_amd64.s:370 +0x66
runtime.mstart()
/usr/local/go/src/runtime/proc.go:1041
<snip 200MB of goroutine traces>
Environment
- System information:
$ uname -srm
Linux 4.15.0-1035-aws x86_64
$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 7869925
max locked memory (kbytes, -l) 16384
max memory size (kbytes, -m) unlimited
open files (-n) 65536
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 7869925
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
- Prometheus version:
$ prometheus --version
prometheus, version 2.18.1 (branch: HEAD, revision: ecee9c8abfd118f139014cb1b174b08db3f342cf)
build user: root@2117a9e64a7e
build date: 20200507-16:51:47
go version: go1.14.2
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 24 (11 by maintainers)
For future archeologists: we figured this out. It turns out we were running into the kernel limit
vm.max_map_count
, which controls how many memory map areas a process can request. The default limit of 65536 was too low, so doingsysctl -w vm.max_map_count=262144
fixed our issue and allowed Prometheus to continue consuming memory. There’s probably nothing you can do about this on the Prometheus side since the golang allocator manages this, but FWIW 65536 is a lot of memory maps and I’m kinda surprised that Prometheus keeps allocating them long after startup.Also, maybe you should consider sharding at this stage.
Yeah, Prometheus maps the TSDB block files into memory, and while it compacts multiple smaller blocks into bigger ones spanning longer timeframes, the total number of blocks is growing until the retention limit is reached. (Prometheus might also only map some older files for real once they are accessed, but I’m not sure about that one.)
In any case, congrats to the awesome troubleshooting. We should definitely keep this in mind and once we write some guidelines about running very large Prometheus servers, we should include
vm.max_map_count
and not only the obvious ones like the limit for open files.