go: runtime: ReadMemStats fatal error: mappedReady and other memstats are not equal
Go version
go version go1.21.4 linux/amd64 (also arm64)
Reproducibility
- Does this issue reproduce with the latest release?
What operating system and processor architecture are you using (go env
)?
(NOTE: This is our build configuration; we also build for arm64 and see this crash there too)
GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/gocache'
GOENV='/home/user/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/go'
GOPRIVATE=''
GOPROXY=''
GOROOT='/root/.gimme/versions/go1.21.4.linux.amd64'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/root/.gimme/versions/go1.21.4.linux.amd64/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.21.4'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD=''
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build2851638515=/tmp/go-build -gno-record-gcc-switches'
What did you do?
We have a set of processes that periodically calls expvar.Do
, which calls runtime.ReadMemStats
to collect Go memory statistics. We are seeing occasional crashes with the message “mappedReady and other memstats are not equal” across multiple separate programs, on both amd64 and arm64. This comes from the following line of code in the Go runtime: https://github.com/golang/go/blob/master/src/runtime/mstats.go#L487 .
These programs share some common metrics/monitoring type of code, so I suspect there is something that all these processes are doing which triggers this problem. We have been unable to figure out what it may be. It seems to happen only after a process has been running for a few hours. Some of these processes use Cgo code libraries, but some should be only Go code. Any suggestions for how to help track this down would be appreciated.
What did you expect to see?
No crashes.
What did you see instead?
Example crash from amd64
runtime: mappedReady=109281560
runtime: totalMapped=127983896
runtime: released=18767872
runtime: totalMapped-released=109216024
fatal error: mappedReady and other memstats are not equal
runtime stack:
runtime.throw({0x14be1ed?, 0x7fae80a11f88?})
/root/.gimme/versions/go1.21.4.linux.amd64/src/runtime/panic.go:1077 +0x5c fp=0x7faea7ff9b70 sp=0x7faea7ff9b40 pc=0x4844dc
runtime.readmemstats_m(0xc0017a3d88)
/root/.gimme/versions/go1.21.4.linux.amd64/src/runtime/mstats.go:493 +0x585 fp=0x7faea7ffa718 sp=0x7faea7ff9b70 pc=0x47e7e5
runtime.ReadMemStats.func1()
/root/.gimme/versions/go1.21.4.linux.amd64/src/runtime/mstats.go:364 +0x17 fp=0x7faea7ffa730 sp=0x7faea7ffa718 pc=0x47e237
traceback: unexpected SPWRITE function runtime.systemstack
runtime.systemstack()
/root/.gimme/versions/go1.21.4.linux.amd64/src/runtime/asm_amd64.s:509 +0x4a fp=0x7faea7ffa740 sp=0x7faea7ffa730 pc=0x4b8d6a
goroutine 14991247 [running]:
runtime.systemstack_switch()
/root/.gimme/versions/go1.21.4.linux.amd64/src/runtime/asm_amd64.s:474 +0x8 fp=0xc0017a3d50 sp=0xc0017a3d40 pc=0x4b8d08
runtime.ReadMemStats(0xc0017a3d88)
/root/.gimme/versions/go1.21.4.linux.amd64/src/runtime/mstats.go:363 +0x49 fp=0xc0017a3d78 sp=0xc0017a3d50 pc=0x47e1e9
expvar.memstats()
/root/.gimme/versions/go1.21.4.linux.amd64/src/expvar/expvar.go:365 +0x36 fp=0xc0017a5428 sp=0xc0017a3d78 pc=0x967d96
expvar.Func.String(0xc000078508?)
/root/.gimme/versions/go1.21.4.linux.amd64/src/expvar/expvar.go:268 +0x16 fp=0xc0017a5450 sp=0xc0017a5428 pc=0x967516
fmt.(*pp).handleMethods(0xc003fee000, 0x14915a5?)
/root/.gimme/versions/go1.21.4.linux.amd64/src/fmt/print.go:673 +0x2b2 fp=0xc0017a56a8 sp=0xc0017a5450 pc=0x5404f2
fmt.(*pp).printArg(0xc003fee000, {0x12b5b00?, 0x1556250}, 0x73)
/root/.gimme/versions/go1.21.4.linux.amd64/src/fmt/print.go:756 +0x630 fp=0xc0017a5748 sp=0xc0017a56a8 pc=0x540f50
fmt.(*pp).doPrintf(0xc003fee000, {0x148e303, 0x6}, {0xc0017a5928?, 0x2, 0x2})
/root/.gimme/versions/go1.21.4.linux.amd64/src/fmt/print.go:1077 +0x39e fp=0xc0017a5840 sp=0xc0017a5748 pc=0x543a7e
fmt.Fprintf({0x165c6e0, 0xc000ea9500}, {0x148e303, 0x6}, {0xc003f3d928, 0x2, 0x2})
... our code calling expvar.Do omitted ...
Example crash from arm64
runtime: mappedReady=323105112
runtime: totalMapped=373518680
runtime: released=50479104
runtime: totalMapped-released=323039576
fatal error: mappedReady and other memstats are not equal
runtime stack:
runtime.throw({0x1a24762?, 0x40c34?})
/root/.gimme/versions/go1.21.3.linux.arm64/src/runtime/panic.go:1077 +0x40 fp=0xffffcdb4ebe0 sp=0xffffcdb4ebb0 pc=0x4cd10
runtime.readmemstats_m(0x400f193d38)
/root/.gimme/versions/go1.21.3.linux.arm64/src/runtime/mstats.go:493 +0x530 fp=0xffffcdb4f790 sp=0xffffcdb4ebe0 pc=0x46f60
runtime.ReadMemStats.func1()
/root/.gimme/versions/go1.21.3.linux.arm64/src/runtime/mstats.go:364 +0x20 fp=0xffffcdb4f7b0 sp=0xffffcdb4f790 pc=0x46a10
traceback: unexpected SPWRITE function runtime.systemstack
runtime.systemstack()
/root/.gimme/versions/go1.21.3.linux.arm64/src/runtime/asm_arm64.s:243 +0x6c fp=0xffffcdb4f7c0 sp=0xffffcdb4f7b0 pc=0x81fec
goroutine 42133645 [running]:
runtime.systemstack_switch()
/root/.gimme/versions/go1.21.3.linux.arm64/src/runtime/asm_arm64.s:200 +0x8 fp=0x400f193cf0 sp=0x400f193ce0 pc=0x81f68
runtime.ReadMemStats(0x400f193d38)
/root/.gimme/versions/go1.21.3.linux.arm64/src/runtime/mstats.go:363 +0x4c fp=0x400f193d20 sp=0x400f193cf0 pc=0x469cc
expvar.memstats()
/root/.gimme/versions/go1.21.3.linux.arm64/src/expvar/expvar.go:365 +0x4c fp=0x400f1953d0 sp=0x400f193d20 pc=0x665adc
expvar.Func.String(0x4008f9bce0?)
/root/.gimme/versions/go1.21.3.linux.arm64/src/expvar/expvar.go:268 +0x24 fp=0x400f195400 sp=0x400f1953d0 pc=0x665294
fmt.(*pp).handleMethods(0x400710a000, 0x19ddd30?)
/root/.gimme/versions/go1.21.3.linux.arm64/src/fmt/print.go:673 +0x21c fp=0x400f195660 sp=0x400f195400 pc=0x10ce8c
fmt.(*pp).printArg(0x400710a000, {0x16c0600?, 0x1ae4fc8}, 0x73)
/root/.gimme/versions/go1.21.3.linux.arm64/src/fmt/print.go:756 +0x5fc fp=0x400f195700 sp=0x400f195660 pc=0x10d8bc
fmt.(*pp).doPrintf(0x400710a000, {0x19d9db8, 0x6}, {0x400f195908?, 0x2, 0x2})
/root/.gimme/versions/go1.21.3.linux.arm64/src/fmt/print.go:1077 +0x2dc fp=0x400f195810 sp=0x400f195700 pc=0x1100bc
fmt.Fprintf({0x1ccc2c0, 0x400503c0e0}, {0x19d9db8, 0x6}, {0x4008afd908, 0x2, 0x2})
/root/.gimme/versions/go1.21.3.linux.arm64/src/fmt/print.go:224 +0x54 fp=0x400f195870 sp=0x400f195810 pc=0x10a404
... our code calling expvar.Do omitted ...
About this issue
- Original URL
- State: closed
- Created 7 months ago
- Reactions: 1
- Comments: 15 (11 by maintainers)
Change https://go.dev/cl/545277 mentions this issue:
runtime: put ReadMemStats debug assertions behind a double-check mode
It seems like some of the historic data for this has moved outside its retention window 😞.
When the issue was first noticed internally, we actually suspected execution tracing, but ruled it out due to finding a crash that occurred 1 day prior to enabling execution tracing by default.
But given your comments above, and the fact that we don’t see this issue for our go1.20 services that don’t have execution tracing enabled, I suspect that we incorrectly ruled out execution tracing. Maybe the service had manually opted into this feature, or timezones, or *waves hand*.
So yeah, 👍 to putting these checks behind a double-check mode (is this the same as a debug mode?). Thank you so much for the quick help with this!
PS: If it’s possible to back-port this to go1.21 that’d be awesome. Let me know I can help with it.
Digging a little deeper on the tracer theory, I don’t think it can possibly be a preemption at a bad time. This is the entire window in which it could get preempted to cause skew, and that window definitely has zero preemption points (those method calls are inlined and intrsinified, and on top of that all this is always called on the system stack).
However, it’s definitely possible for sysmon to generate an event while the world is stopped, for example via
injectglist
. In the old tracer the event will get added to the global trace buffer. Given all the evidence, I’m fairly certain this is the root cause and the skew is transient.I’ll send a CL to put these checks behind a double-check mode.
We actually have automated execution tracing widely enabled in prod, and it’s possible that this is what started this problem (I’ll try to see if we have enough historic data to proof this).
Also: The go1.20 services mentioned by @evanj don’t have execution tracing enabled, so this might be a promising direction to investigate further, even if it’s not clear yet how this is possible.