go: runtime: memory corruption crashes with os/exec on Linux kernel 4.4
What version of Go are you using (go version
)?
go version go1.8.1 linux/amd64
Some crashes were also observed with go version go1.6.3 linux/amd64
, prompting an upgrade to go1.8.1.
What operating system and processor architecture are you using (go env
)?
The affected servers are linux/amd64. We’ve seen the crashes when running Linux kernel version 4.4, and have not seen any crashes with Linux kernel version 3.2.
The ./go
command here was cross-compiled from darwin/amd64 and dropped onto an affected server:
$ GOROOT=/ ./go env
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/users/rhys/go"
GORACE=""
GOROOT="/"
GOTOOLDIR="/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0"
CXX="clang++"
CGO_ENABLED="1"
PKG_CONFIG="pkg-config"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
What did you do?
I have a monitoring daemon that periodically opens/uses/closes some unix domain sockets, reads files from virtual file systems e.g. under /sys/, uses os/exec to run short-lived child processes, sends UDP packets, makes HTTP requests, and provides some debug endpoints on an HTTP server.
It does not use its own cgo or unsafe (aside from the internals of github.com/golang/protobuf/proto), and the race detector has not detected any data races when run with a production workload. It is not specifically compiled with CGO_ENABLED=0 or with the netgo build tag so the executables end up dynamically linked, probably due to the net package’s use of cgo.
As a test of cgo’s involvement we’ve built the program with CGO_ENABLED=0, resulting in a statically-linked executable. This version of the program still crashes occasionally on Linux kernel version 4.4.
The program is closed-source.
What did you expect to see?
Given the program’s track record of not crashing on Linux kernel version 3.2, I expected that running an identical binary on Linux kernel version 4.4 would also not crash.
What did you see instead?
The program crashes with a variety of “fatal error” messages and an occasional “panic”, each of which seems to point to memory corruption. Some examples follow, here’s what I think they indicate:
One of the “fatal error” crashes complains about the GC finding a pointer to an unallocated span. Others complain about unexpected fault addresses of small-magnitude numbers, like 0x18 or 0xffffffffffffffff. Either something is corrupting the memory allocator’s datastructures—such as data races which are invisible to the race detector, such as within the runtime package—or something is retaining memory in a way that’s invisible to the GC, allowing that memory to be reallocated with a different pointer map which would allow small numbers to be written to word that are expected to be pointers.
I don’t think it’s related to the Go 1.8 changes to argument liveness / runtime.KeepAlive, since there have been some crashes with go1.6.3.
I can’t say it’s related to crypto/rand’s use of getrandom(2), since one of the crashes below shows a goroutine in the code that reads from /dev/urandom.
All of the partial crash logs shown below are from go1.8.1 on Linux kernel version 4.4. We have a couple examples of crashes with go1.6.3 on 4.4. We have not seen this code crash on Linux kernel version 3.2 with either go1.6.3 or go1.8.1.
/cc @aclements
Crash log excerpts follow:
fatal error: exitsyscall: syscall frame is no longer valid
fatal error: unexpected signal during runtime execution
fatal error: exitsyscall: syscall frame is no longer valid
fatal error: runtime: sudog with non-nil elem
fatal error: systemstack called from unexpected goroutine
[repeated lines removed]
fatal error: systemstack called from unexpected goroutine
runtime: newstack called from g=0xc4206196c0
m=0xc4204c8c00 m->curg=0xc420618ea0 m->g0=0xc420693a00 m->gsignal=0xc420693860
fatal error: unexpected signal during runtime execution
fatal error: systemstack called from unexpected goroutine
fatal error: systemstack called from unexpected goroutine
fatal error: systemstack called from unexpected goroutine
acquirep: p->m=0(0) p->status=3
fatal error: acquirep: invalid p state
fatal error: systemstack called from unexpected goroutine
[repeated lines removed]
fatal error: systemstack called from unexpected goroutine
fatal error: acquirep: already in go
fatal error: systemstack called from unexpected goroutine
fatal error: systemstack called from unexpected goroutine
runtime: newstack called from g=0xc4205edd40
m=0xc42006ec00 m->curg=0x0 m->g0=0xc4206a16c0 m->gsignal=0xc4206a1520
strings.genSplit(0xc4207b9416, 0xce, 0x88b0da, 0x1, 0x0, 0x53, 0x1, 0x88b0d0, 0x1)
/usr/local/go/src/strings/strings.go:245 +0x80
strings.Split(0xc4207b9416, 0xce, 0x88b0da, 0x1, 0x54, 0x54, 0x54)
/usr/local/go/src/strings/strings.go:283 +0x5b
fatal error: systemstack called from unexpected goroutine
[repeated lines removed]
fatal error: systemstack called from unexpected goroutine
runtime: newstack called from g=0xc4205beb60
m=0xc4202ee400 m->curg=0xc4200d5d40 m->g0=0xc42032c4e0 m->gsignal=0xc42032c340
runtime.SetFinalizer(0x87f880, 0xc42029a3f0, 0x7fa840, 0x8a8f90)
/usr/local/go/src/runtime/mfinal.go:407 +0x281
net.(*netFD).setAddr(0xc42029a3f0, 0xa735e0, 0xc420502780, 0xa735e0, 0xc420502760)
/usr/local/go/src/net/fd_unix.go:57 +0x93
fatal error: concurrent map writes
fatal error: runtime: releaseSudog with non-nil gp.param
goroutine 13 [running]:
runtime.throw(0x8a1c37, 0x2b)
/usr/local/go/src/runtime/panic.go:596 +0x95 fp=0xc42013b068 sp=0xc42013b048
runtime.releaseSudog(0xc4202bd220)
/usr/local/go/src/runtime/proc.go:346 +0x3be fp=0xc42013b0f0 sp=0xc42013b068
runtime.selectgoImpl(0xc42013b4e8, 0x0, 0x18)
/usr/local/go/src/runtime/select.go:516 +0x157c fp=0xc42013b368 sp=0xc42013b0f0
unexpected fault address 0x0
fatal error: fault
fatal error: runtime·unlock: lock count
fatal error: runtime·unlock: lock count
fatal error: runtime·unlock: lock count
runtime: newstack called from g=0xc4203ea340
m=0xc4202ea400 m->curg=0xc4201264e0 m->g0=0xc4202bbd40 m->gsignal=0xc4202bbba0
time.startTimer(0xc420155108)
/usr/local/go/src/runtime/time.go:68 +0x2b
time.AfterFunc(0x3b9ac0b0, 0xc420385210, 0xc4201a8b58)
/usr/local/go/src/time/sleep.go:165 +0xa6
runtime: nonempty check fails b.log[0]= 0 b.log[1]= 0 b.log[2]= 0 b.log[3]= 0
fatal error: workbuf is empty
runtime: nonempty check fails b.log[0]= 0 b.log[1]= 0 b.log[2]= 0 b.log[3]= 0
fatal error: workbuf is empty
runtime: nonempty check fails b.log[0]= 0 b.log[1]= 0 b.log[2]= 0 b.log[3]= 0
fatal error: workbuf is empty
runtime: empty check fails b.log[0]= 0 b.log[1]= 0 b.log[2]= 0 b.log[3]= 0
fatal error: MHeap_AllocLocked - MSpan not free
fatal error: workbuf is not empty
runtime: pointer 0xc42097c000 to unallocated span idx=0x4be span.base()=0xc4209a2000 span.limit=0xc4209a4000 span.state=0
runtime: found in object at *(0xc420358300+0x10)
object=0xc420358300 k=0x62101ac s.base()=0xc420358000 s.limit=0xc420359fe0 s.sizeclass=4 s.elemsize=48 s.state=_MSpanInUse
*(object+0) = 0x9
*(object+8) = 0x29d53e9900000100
*(object+16) = 0xc42097c000 <==
*(object+24) = 0x0
*(object+32) = 0x1
*(object+40) = 0x0
fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)
fatal error: unexpected signal during runtime execution
[signal 0xb code=0x1 addr=0xffffffffffffffff pc=0x41466b]
fatal error: workbuf is not empty
fatal error: workbuf is empty
[repeated lines removed]
fatal error: workbuf is empty
fatal error: netpollUnblock: already closing
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x45be9b]
goroutine 20 [running]:
panic(0x8143c0, 0xa9abd0)
/usr/local/go/src/runtime/panic.go:531 +0x1cf fp=0xc4204afb50 sp=0xc4204afab8
runtime.panicmem()
/usr/local/go/src/runtime/panic.go:63 +0x5e fp=0xc4204afb70 sp=0xc4204afb50
runtime.sigpanic()
/usr/local/go/src/runtime/signal_unix.go:290 +0x29f fp=0xc4204afbc0 sp=0xc4204afb70
runtime.memmove(0x0, 0xc4207b9000, 0x8)
/usr/local/go/src/runtime/memmove_amd64.s:169 +0x6ab fp=0xc4204afbc8 sp=0xc4204afbc0
bufio.(*Reader).Read(0xc420896900, 0x0, 0x8, 0x8, 0x1000, 0x1000, 0xc4200474c0)
/usr/local/go/src/bufio/bufio.go:224 +0x442 fp=0xc4204afc58 sp=0xc4204afbc8
crypto/rand.(*devReader).Read(0xc420090510, 0x0, 0x8, 0x8, 0x0, 0x0, 0x0)
/usr/local/go/src/crypto/rand/rand_unix.go:64 +0x11a fp=0xc4204afd28 sp=0xc4204afc58
io.ReadAtLeast(0xa700e0, 0xc420090510, 0x0, 0x8, 0x8, 0x8, 0x7f8aa0, 0x1, 0x0)
/usr/local/go/src/io/io.go:307 +0xa9 fp=0xc4204afd88 sp=0xc4204afd28
io.ReadFull(0xa700e0, 0xc420090510, 0x0, 0x8, 0x8, 0x8a9738, 0x2, 0xc420276400)
/usr/local/go/src/io/io.go:325 +0x58 fp=0xc4204afde0 sp=0xc4204afd88
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x43245f]
runtime stack:
runtime.throw(0x8a1634, 0x2a)
/usr/local/go/src/runtime/panic.go:596 +0x95 fp=0x7faf26ffcdd0 sp=0x7faf26ffcdb0
runtime.sigpanic()
/usr/local/go/src/runtime/signal_unix.go:274 +0x2db fp=0x7faf26ffce20 sp=0x7faf26ffcdd0
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 5
- Comments: 78 (74 by maintainers)
GCC is doing a 4K stack probe here. Two things could happen: a segfault if that page is unmapped, or nothing in the general case, since it’s just an OR 0. But since it’s not an atomic/locked op, if it races another thread, it could corrupt data.
So yeah, 104 bytes is not enough. Whether the kernel should be doing stack probes in vdso is another question entirely… but this explains why only Go with its tight split stacks breaks. And why this whole mess is so nondeterministic and the crashes random and hard to reproduce and require multiple threads.
Turns out my second time debugging an issue in the go runtime wound up being a bug in Go->C interop again. Whee.
Edit: just confirmed that hexediting those probes from 0x1020 to 0x20 bytes on one of the crash-inducing kernels cures the crashes.
Turns out my repro is related to the compiler and compiler flags used to compile the kernel.
Will now try to narrow down which change introduced by flipping CONFIG_OPTIMIZE_INLINING causes the issue.
Of course the issue can be fixed in the kernel (and that might even make sense). But right now it’s Go making the undocumented assumptions. So there are two options here (not mutually exclusive):
Doing 2 and not 1 of course means accepting that Go code will potentially break on certain kernels before the change.
That is already an impressive job of fault isolation.
@marcan: Awesome debugging. Thanks!
The 10ns slowdown is unfortunate, but if that’s what we have to do, that’s what we have to do. I made some suggestions on Ian’s CL that may speed things up, but we’ll see.
I emailed the x86 vDSO maintainer (cc: x86, linux-kernel, kernel-hardening) to see what their take on all this is (but I still think Go should fix this, e.g. somehow allocate more stack space or disable vDSO usage on current kernels if the overhead of cgo/stack munging is greater than just using a syscall).
Edit: thread
Note that I came here through node_exporter crashing, and it doesn’t use
os/exec
. I think that reproducer just so happens to be good at triggering this issue by coincidence, even though it has nothing to do with exec directly.Also keep in mind that
CONFIG_OPTIMIZE_INLINING
is somewhat coincidental too; for some reason flipping that flag off (which is supposed to force inline more functions) is making GCC not inline thevread_tsc
function. That function isn’t marked inline, so it’s perfectly within its right to do so. But there’s no guarantee that under some other combination of GCC version and flags it won’t happen too. Basically, any GCC with -fstack-check enabled by default can trigger this bug if the vDSO function makes any function calls at all as compiled, which could happen for any reason since the source code does indeed make function calls. You’re depending on everything getting inlined to not crash, even though the kernel source makes no attempt to guarantee that.@marcan Thanks very much for the detailed analysis.
That is an odd result, though, because as far as I can tell it doesn’t have anything to do with forking. I think you’ve found a code path that could cause a random crash in approximately any amd64 GNU/Linux Go program at approximately any time, if running on a kernel compiled with
CONFIG_OPTIMIZE_INLINING
turned off (fortunately that is not the default). We clearly need to fix that, somehow. But now I’m confused as to why it happens with this particular reproducer.Upgrading to go1.8.3 did not resolve the issue, and we’ve narrowed it down to being related to os/exec.
A coworker reminded me of the FreeBSD os/exec memory corruption bug—#15658—and indeed, the first reproducer there (from May 11, 2016) leads to crashes on a Linux 4.4 system.
Here are the beginnings of the three crashes I’ve seen with the May 11, 2016 reproducer so far (from spending around 100–250 process-hours of execution):
I tried the March 8, 2017 reproducer with seven variants, but none crashed within a few seconds. I haven’t seen variant (1) crash after 60 process-hours of execution.
My very slow and not terribly reliable reproducer (version c7141caadf in a private repo) is below:
It crashed with the following error while running under
perf trace record
while running with the arguments-cmd=/bin/true -workers=10 -cmd_iter=300
. That was one crash out of 4480 attempts.The “stress” command here is golang.org/x/tools/cmd/stress
I don’t see an obvious smoking gun in the
perf trace
output. I’ve shared a portion of it at https://gist.github.com/rhysh/be58b5a5f3a41e218a05f1d8c3522dc0: Line 7 shows thread 16347 begin a call to clone, line 671 and 672 show that call resulting in the creation of process 86746. Lines 677 and 688 show the execve call for /bin/true, and 727 shows the exit_group call that ends the /bin/true process. Line 984 shows thread 16347 “returning” from rt_sigreturn (I don’t understand how this works) with value 13, which happens to be the number of bytes in "fatal error: ", followed by a 42-byte write to fd 2 (probably “unexpected signal during runtime execution”) and a 1-byte write to fd 2 (probably “\n”).Thank you all for the ideas. I’ll upgrade to go1.8.3 soon after it’s out—it looks like it’s slated to include cherry-picks of both CL 43294 and CL 43311.
This program doesn’t use
GOEXPERIMENT=preemptibleloops
, and I agree that we can likely rule out inter-runtime signal handler disputes since it’s built withCGO_ENABLED=0
.The program uses crypto/rand.Read, which tries to use getrandom(2) when available (on Linux 3.17+). Should we be suspicious of this feature?
I got the following crash from an instance of the program that was running with
GODEBUG=gcrescanstacks=1,gccheckmark=1
(andGOGC=1
). The crash doesn’t look like it’s from the GC’s assertions that those settings enabled. SettingGOGC=1
on the test instance doesn’t seem to have made the program crash more often.This program uses Unix domain sockets much more actively than most of our others. Some of the crashes are encountered near the netpoller (like the one below), but I don’t know if that’s because of memory corruption coming from the netpoller, or if the program is just as likely to crash on any active goroutine and a lot of the active goroutines are working with UDS.