go: runtime: huge page fragmentation on Linux leads to out of memory

I recently upgraded to Go 1.5 from 1.4.2 for a moderately large production service, serving thousands of requests per second. This particular program has been run on every Go version from 1.1 through (now) 1.5, and has been stable in production for about 2+ years. Switching to Go 1.5 saw dramatic improvements in GC pause time, as expected, but the processes are intermittently exiting (somewhere between 1-6h of uptime) with the below panic.

On each version, this has been running on Ubuntu 12.04.

fatal error: runtime: cannot allocate memory

runtime stack:
runtime.throw(0x8c8f20, 0x1f)
        /usr/local/go/src/runtime/panic.go:527 +0x90
runtime.persistentalloc1(0x4000, 0x8, 0xaa8a78, 0xa8d3a8)
        /usr/local/go/src/runtime/malloc.go:878 +0x2e3
runtime.persistentalloc.func1()
        /usr/local/go/src/runtime/malloc.go:831 +0x3b
runtime.systemstack(0x7ff1377fdc78)
        /usr/local/go/src/runtime/asm_amd64.s:278 +0xab
runtime.persistentalloc(0x4000, 0x0, 0xaa8a78, 0x5)
        /usr/local/go/src/runtime/malloc.go:832 +0x58
runtime.fixAlloc_Alloc(0xa95eb8, 0xa8d3a8)
        /usr/local/go/src/runtime/mfixalloc.go:67 +0xee
runtime.mHeap_AllocSpanLocked(0xa89ba0, 0x5, 0x412c1b)
        /usr/local/go/src/runtime/mheap.go:561 +0x1a7
runtime.mHeap_Alloc_m(0xa89ba0, 0x5, 0x100000000, 0xc8200205bc)
        /usr/local/go/src/runtime/mheap.go:425 +0x1ac
runtime.mHeap_Alloc.func1()
        /usr/local/go/src/runtime/mheap.go:484 +0x41
runtime.systemstack(0x7ff1377fddb8)
        /usr/local/go/src/runtime/asm_amd64.s:278 +0xab
runtime.mHeap_Alloc(0xa89ba0, 0x5, 0x10100000000, 0x923300)
        /usr/local/go/src/runtime/mheap.go:485 +0x63
runtime.largeAlloc(0x9c40, 0x0, 0x4d4f07)
        /usr/local/go/src/runtime/malloc.go:745 +0xb3
runtime.mallocgc.func3()
        /usr/local/go/src/runtime/malloc.go:634 +0x33
runtime.systemstack(0xc820020000)
        /usr/local/go/src/runtime/asm_amd64.s:262 +0x79
runtime.mstart()
        /usr/local/go/src/runtime/proc1.go:674

goroutine 1145040165 [running]:
runtime.systemstack_switch()
        /usr/local/go/src/runtime/asm_amd64.s:216 fp=0xc8226375f0 sp=0xc8226375e8
runtime.mallocgc(0x9c40, 0x731f60, 0x0, 0x60)
        /usr/local/go/src/runtime/malloc.go:635 +0x9c4 fp=0xc8226376c0 sp=0xc8226375f0
runtime.newarray(0x731f60, 0x1388, 0x3)
        /usr/local/go/src/runtime/malloc.go:777 +0xc9 fp=0xc822637700 sp=0xc8226376c0
runtime.makechan(0x7320e0, 0x1388, 0xc93f800000)
        /usr/local/go/src/runtime/chan.go:72 +0x135 fp=0xc822637750 sp=0xc822637700
github.com/crashlytics/gusset/caching-proxy.(*CachingProxy).Request(0xc820110550, 0x7ff13c6c01b0, 0xc964463810, 0xc9644b30a4, 0x0, 0x0)
        /srv/go/src/github.com/crashlytics/gusset/caching-proxy/caching-proxy.go:445 +0x6d0 fp=0xc822637908 sp=0xc822637750
main.cachedRequestHandler(0x7ff13c6c0178, 0xc9644e2580, 0xc9642b6380)
        /srv/go/src/github.com/crashlytics/gusset/gusset.go:69 +0x437 fp=0xc822637b68 sp=0xc822637908
net/http.HandlerFunc.ServeHTTP(0x922e98, 0x7ff13c6c0178, 0xc9644e2580, 0xc9642b6380)
        /usr/local/go/src/net/http/server.go:1422 +0x3a fp=0xc822637b88 sp=0xc822637b68
net/http.(*ServeMux).ServeHTTP(0xc82010d530, 0x7ff13c6c0178, 0xc9644e2580, 0xc9642b6380)
        /usr/local/go/src/net/http/server.go:1699 +0x17d fp=0xc822637be0 sp=0xc822637b88
net/http.serverHandler.ServeHTTP(0xc82011c540, 0x7ff13c6c0178, 0xc9644e2580, 0xc9642b6380)
        /usr/local/go/src/net/http/server.go:1862 +0x19e fp=0xc822637c40 sp=0xc822637be0
net/http.(*conn).serve(0xc8aef18160)
        /usr/local/go/src/net/http/server.go:1361 +0xbee fp=0xc822637f98 sp=0xc822637c40
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1696 +0x1 fp=0xc822637fa0 sp=0xc822637f98
created by net/http.(*Server).Serve
        /usr/local/go/src/net/http/server.go:1910 +0x3f6

About this issue

  • Original URL
  • State: closed
  • Created 9 years ago
  • Comments: 71 (33 by maintainers)

Commits related to this issue

Most upvoted comments

Yay! I’m very glad to see this resolved. It’s certainly been interesting trying to provide enough information as we tried to track this down. Thanks for the support and help!

@aclements @davecheney @rsc Great news!

After running over the weekend with the system’s vm.max_map_count doubled, I observed 0 crashes in the past 72 hours, vs a minimum of 3 crashes per day since upgrading to 1.5.

crashiness