go: runtime: memory corruption crashes with os/exec on Linux kernel 4.4

What version of Go are you using (go version)?

go version go1.8.1 linux/amd64

Some crashes were also observed with go version go1.6.3 linux/amd64, prompting an upgrade to go1.8.1.

What operating system and processor architecture are you using (go env)?

The affected servers are linux/amd64. We’ve seen the crashes when running Linux kernel version 4.4, and have not seen any crashes with Linux kernel version 3.2.

The ./go command here was cross-compiled from darwin/amd64 and dropped onto an affected server:

$ GOROOT=/ ./go env
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/users/rhys/go"
GORACE=""
GOROOT="/"
GOTOOLDIR="/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0"
CXX="clang++"
CGO_ENABLED="1"
PKG_CONFIG="pkg-config"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"

What did you do?

I have a monitoring daemon that periodically opens/uses/closes some unix domain sockets, reads files from virtual file systems e.g. under /sys/, uses os/exec to run short-lived child processes, sends UDP packets, makes HTTP requests, and provides some debug endpoints on an HTTP server.

It does not use its own cgo or unsafe (aside from the internals of github.com/golang/protobuf/proto), and the race detector has not detected any data races when run with a production workload. It is not specifically compiled with CGO_ENABLED=0 or with the netgo build tag so the executables end up dynamically linked, probably due to the net package’s use of cgo.

As a test of cgo’s involvement we’ve built the program with CGO_ENABLED=0, resulting in a statically-linked executable. This version of the program still crashes occasionally on Linux kernel version 4.4.

The program is closed-source.

What did you expect to see?

Given the program’s track record of not crashing on Linux kernel version 3.2, I expected that running an identical binary on Linux kernel version 4.4 would also not crash.

What did you see instead?

The program crashes with a variety of “fatal error” messages and an occasional “panic”, each of which seems to point to memory corruption. Some examples follow, here’s what I think they indicate:

One of the “fatal error” crashes complains about the GC finding a pointer to an unallocated span. Others complain about unexpected fault addresses of small-magnitude numbers, like 0x18 or 0xffffffffffffffff. Either something is corrupting the memory allocator’s datastructures—such as data races which are invisible to the race detector, such as within the runtime package—or something is retaining memory in a way that’s invisible to the GC, allowing that memory to be reallocated with a different pointer map which would allow small numbers to be written to word that are expected to be pointers.

I don’t think it’s related to the Go 1.8 changes to argument liveness / runtime.KeepAlive, since there have been some crashes with go1.6.3.

I can’t say it’s related to crypto/rand’s use of getrandom(2), since one of the crashes below shows a goroutine in the code that reads from /dev/urandom.

All of the partial crash logs shown below are from go1.8.1 on Linux kernel version 4.4. We have a couple examples of crashes with go1.6.3 on 4.4. We have not seen this code crash on Linux kernel version 3.2 with either go1.6.3 or go1.8.1.

/cc @aclements

Crash log excerpts follow:

fatal error: exitsyscall: syscall frame is no longer valid
fatal error: unexpected signal during runtime execution
fatal error: exitsyscall: syscall frame is no longer valid
fatal error: runtime: sudog with non-nil elem
fatal error: systemstack called from unexpected goroutine
[repeated lines removed]
fatal error: systemstack called from unexpected goroutine
runtime: newstack called from g=0xc4206196c0
    m=0xc4204c8c00 m->curg=0xc420618ea0 m->g0=0xc420693a00 m->gsignal=0xc420693860
fatal error: unexpected signal during runtime execution
fatal error: systemstack called from unexpected goroutine
fatal error: systemstack called from unexpected goroutine
fatal error: systemstack called from unexpected goroutine
acquirep: p->m=0(0) p->status=3
fatal error: acquirep: invalid p state
fatal error: systemstack called from unexpected goroutine
[repeated lines removed]
fatal error: systemstack called from unexpected goroutine
fatal error: acquirep: already in go
fatal error: systemstack called from unexpected goroutine
fatal error: systemstack called from unexpected goroutine
runtime: newstack called from g=0xc4205edd40
    m=0xc42006ec00 m->curg=0x0 m->g0=0xc4206a16c0 m->gsignal=0xc4206a1520
strings.genSplit(0xc4207b9416, 0xce, 0x88b0da, 0x1, 0x0, 0x53, 0x1, 0x88b0d0, 0x1)
    /usr/local/go/src/strings/strings.go:245 +0x80
strings.Split(0xc4207b9416, 0xce, 0x88b0da, 0x1, 0x54, 0x54, 0x54)
    /usr/local/go/src/strings/strings.go:283 +0x5b
fatal error: systemstack called from unexpected goroutine
[repeated lines removed]
fatal error: systemstack called from unexpected goroutine
runtime: newstack called from g=0xc4205beb60
    m=0xc4202ee400 m->curg=0xc4200d5d40 m->g0=0xc42032c4e0 m->gsignal=0xc42032c340
runtime.SetFinalizer(0x87f880, 0xc42029a3f0, 0x7fa840, 0x8a8f90)
    /usr/local/go/src/runtime/mfinal.go:407 +0x281
net.(*netFD).setAddr(0xc42029a3f0, 0xa735e0, 0xc420502780, 0xa735e0, 0xc420502760)
    /usr/local/go/src/net/fd_unix.go:57 +0x93
fatal error: concurrent map writes
fatal error: runtime: releaseSudog with non-nil gp.param

goroutine 13 [running]:
runtime.throw(0x8a1c37, 0x2b)
    /usr/local/go/src/runtime/panic.go:596 +0x95 fp=0xc42013b068 sp=0xc42013b048
runtime.releaseSudog(0xc4202bd220)
    /usr/local/go/src/runtime/proc.go:346 +0x3be fp=0xc42013b0f0 sp=0xc42013b068
runtime.selectgoImpl(0xc42013b4e8, 0x0, 0x18)
    /usr/local/go/src/runtime/select.go:516 +0x157c fp=0xc42013b368 sp=0xc42013b0f0
unexpected fault address 0x0
fatal error: fault
fatal error: runtime·unlock: lock count
fatal error: runtime·unlock: lock count
fatal error: runtime·unlock: lock count
runtime: newstack called from g=0xc4203ea340
    m=0xc4202ea400 m->curg=0xc4201264e0 m->g0=0xc4202bbd40 m->gsignal=0xc4202bbba0
time.startTimer(0xc420155108)
    /usr/local/go/src/runtime/time.go:68 +0x2b
time.AfterFunc(0x3b9ac0b0, 0xc420385210, 0xc4201a8b58)
    /usr/local/go/src/time/sleep.go:165 +0xa6
runtime: nonempty check fails b.log[0]= 0 b.log[1]= 0 b.log[2]= 0 b.log[3]= 0
fatal error: workbuf is empty
runtime: nonempty check fails b.log[0]= 0 b.log[1]= 0 b.log[2]= 0 b.log[3]= 0
fatal error: workbuf is empty
runtime: nonempty check fails b.log[0]= 0 b.log[1]= 0 b.log[2]= 0 b.log[3]= 0
fatal error: workbuf is empty
runtime: empty check fails b.log[0]= 0 b.log[1]= 0 b.log[2]= 0 b.log[3]= 0
fatal error: MHeap_AllocLocked - MSpan not free
fatal error: workbuf is not empty
runtime: pointer 0xc42097c000 to unallocated span idx=0x4be span.base()=0xc4209a2000 span.limit=0xc4209a4000 span.state=0
runtime: found in object at *(0xc420358300+0x10)
object=0xc420358300 k=0x62101ac s.base()=0xc420358000 s.limit=0xc420359fe0 s.sizeclass=4 s.elemsize=48 s.state=_MSpanInUse
 *(object+0) = 0x9
 *(object+8) = 0x29d53e9900000100
 *(object+16) = 0xc42097c000 <==
 *(object+24) = 0x0
 *(object+32) = 0x1
 *(object+40) = 0x0
fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)
fatal error: unexpected signal during runtime execution
[signal 0xb code=0x1 addr=0xffffffffffffffff pc=0x41466b]
fatal error: workbuf is not empty
fatal error: workbuf is empty
[repeated lines removed]
fatal error: workbuf is empty
fatal error: netpollUnblock: already closing
 panic: runtime error: invalid memory address or nil pointer dereference
 [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x45be9b]

 goroutine 20 [running]:
 panic(0x8143c0, 0xa9abd0)
    /usr/local/go/src/runtime/panic.go:531 +0x1cf fp=0xc4204afb50 sp=0xc4204afab8
 runtime.panicmem()
    /usr/local/go/src/runtime/panic.go:63 +0x5e fp=0xc4204afb70 sp=0xc4204afb50
 runtime.sigpanic()
    /usr/local/go/src/runtime/signal_unix.go:290 +0x29f fp=0xc4204afbc0 sp=0xc4204afb70
 runtime.memmove(0x0, 0xc4207b9000, 0x8)
    /usr/local/go/src/runtime/memmove_amd64.s:169 +0x6ab fp=0xc4204afbc8 sp=0xc4204afbc0
 bufio.(*Reader).Read(0xc420896900, 0x0, 0x8, 0x8, 0x1000, 0x1000, 0xc4200474c0)
    /usr/local/go/src/bufio/bufio.go:224 +0x442 fp=0xc4204afc58 sp=0xc4204afbc8
 crypto/rand.(*devReader).Read(0xc420090510, 0x0, 0x8, 0x8, 0x0, 0x0, 0x0)
    /usr/local/go/src/crypto/rand/rand_unix.go:64 +0x11a fp=0xc4204afd28 sp=0xc4204afc58
 io.ReadAtLeast(0xa700e0, 0xc420090510, 0x0, 0x8, 0x8, 0x8, 0x7f8aa0, 0x1, 0x0)
    /usr/local/go/src/io/io.go:307 +0xa9 fp=0xc4204afd88 sp=0xc4204afd28
 io.ReadFull(0xa700e0, 0xc420090510, 0x0, 0x8, 0x8, 0x8a9738, 0x2, 0xc420276400)
    /usr/local/go/src/io/io.go:325 +0x58 fp=0xc4204afde0 sp=0xc4204afd88
 fatal error: unexpected signal during runtime execution
 [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x43245f]

 runtime stack:
 runtime.throw(0x8a1634, 0x2a)
    /usr/local/go/src/runtime/panic.go:596 +0x95 fp=0x7faf26ffcdd0 sp=0x7faf26ffcdb0
 runtime.sigpanic()
    /usr/local/go/src/runtime/signal_unix.go:274 +0x2db fp=0x7faf26ffce20 sp=0x7faf26ffcdd0

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 5
  • Comments: 78 (74 by maintainers)

Commits related to this issue

Most upvoted comments

TEXT runtime·walltime(SB),NOSPLIT,$16
	// Be careful. We're calling a function with gcc calling convention here.
	// We're guaranteed 128 bytes on entry, and we've taken 16, and the
	// call uses another 8.
	// That leaves 104 for the gettime code to use. Hope that's enough!
0000000000000030 <__vdso_clock_gettime>:
  30:	55                   	push   %rbp
  31:	48 89 e5             	mov    %rsp,%rbp
  34:	48 81 ec 20 10 00 00 	sub    $0x1020,%rsp
  3b:	48 83 0c 24 00       	orq    $0x0,(%rsp)
  40:	48 81 c4 20 10 00 00 	add    $0x1020,%rsp

GCC is doing a 4K stack probe here. Two things could happen: a segfault if that page is unmapped, or nothing in the general case, since it’s just an OR 0. But since it’s not an atomic/locked op, if it races another thread, it could corrupt data.

So yeah, 104 bytes is not enough. Whether the kernel should be doing stack probes in vdso is another question entirely… but this explains why only Go with its tight split stacks breaks. And why this whole mess is so nondeterministic and the crashes random and hard to reproduce and require multiple threads.

Turns out my second time debugging an issue in the go runtime wound up being a bug in Go->C interop again. Whee.

Edit: just confirmed that hexediting those probes from 0x1020 to 0x20 bytes on one of the crash-inducing kernels cures the crashes.

Turns out my repro is related to the compiler and compiler flags used to compile the kernel.

  • linux-4.13.9-gentoo + gcc (Gentoo Hardened 5.4.0-r3 p1.3, pie-0.6.5) 5.4.0 + CONFIG_OPTIMIZE_INLINING=n - no crash
  • linux-4.13.9-gentoo + gcc (Gentoo Hardened 6.4.0 p1.0) 6.4.0 + CONFIG_OPTIMIZE_INLINING=n - crash
  • linux-4.13.9-gentoo + gcc (Gentoo Hardened 6.4.0 p1.0) 6.4.0 + CONFIG_OPTIMIZE_INLINING=y - no crash

Will now try to narrow down which change introduced by flipping CONFIG_OPTIMIZE_INLINING causes the issue.

Of course the issue can be fixed in the kernel (and that might even make sense). But right now it’s Go making the undocumented assumptions. So there are two options here (not mutually exclusive):

  1. Fix Go so it stops assuming vDSO stack usage limits
  2. Convince the kernel guys to change the vDSO spec to guarantee a maximum stack usage, which may include disabling stack probing and/or inlining everything

Doing 2 and not 1 of course means accepting that Go code will potentially break on certain kernels before the change.

That is already an impressive job of fault isolation.

@marcan: Awesome debugging. Thanks!

The 10ns slowdown is unfortunate, but if that’s what we have to do, that’s what we have to do. I made some suggestions on Ian’s CL that may speed things up, but we’ll see.

I emailed the x86 vDSO maintainer (cc: x86, linux-kernel, kernel-hardening) to see what their take on all this is (but I still think Go should fix this, e.g. somehow allocate more stack space or disable vDSO usage on current kernels if the overhead of cgo/stack munging is greater than just using a syscall).

Edit: thread

Note that I came here through node_exporter crashing, and it doesn’t use os/exec. I think that reproducer just so happens to be good at triggering this issue by coincidence, even though it has nothing to do with exec directly.

Also keep in mind that CONFIG_OPTIMIZE_INLINING is somewhat coincidental too; for some reason flipping that flag off (which is supposed to force inline more functions) is making GCC not inline the vread_tsc function. That function isn’t marked inline, so it’s perfectly within its right to do so. But there’s no guarantee that under some other combination of GCC version and flags it won’t happen too. Basically, any GCC with -fstack-check enabled by default can trigger this bug if the vDSO function makes any function calls at all as compiled, which could happen for any reason since the source code does indeed make function calls. You’re depending on everything getting inlined to not crash, even though the kernel source makes no attempt to guarantee that.

@marcan Thanks very much for the detailed analysis.

That is an odd result, though, because as far as I can tell it doesn’t have anything to do with forking. I think you’ve found a code path that could cause a random crash in approximately any amd64 GNU/Linux Go program at approximately any time, if running on a kernel compiled with CONFIG_OPTIMIZE_INLINING turned off (fortunately that is not the default). We clearly need to fix that, somehow. But now I’m confused as to why it happens with this particular reproducer.

Upgrading to go1.8.3 did not resolve the issue, and we’ve narrowed it down to being related to os/exec.

A coworker reminded me of the FreeBSD os/exec memory corruption bug—#15658—and indeed, the first reproducer there (from May 11, 2016) leads to crashes on a Linux 4.4 system.

$ uname -a
Linux redacted 4.4.0-78-generic #99-Ubuntu SMP Thu Apr 27 15:29:09 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Here are the beginnings of the three crashes I’ve seen with the May 11, 2016 reproducer so far (from spending around 100–250 process-hours of execution):

Starting a bunch of goroutines...
unexpected fault address 0x0
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x80 addr=0x0 pc=0x461df2]

goroutine 189069 [running]:
runtime.throw(0x4cd41c, 0x5)
	/usr/local/go/src/runtime/panic.go:596 +0x95 fp=0xc420067a18 sp=0xc4200679f8
runtime.sigpanic()
	/usr/local/go/src/runtime/signal_unix.go:297 +0x28c fp=0xc420067a68 sp=0xc420067a18
syscall.ByteSliceFromString(0x6962732f7273752f, 0x657572742f6e, 0xc4202ce130, 0xc4202ce070, 0x9, 0xc4202ce110, 0xa)
	/usr/local/go/src/syscall/syscall.go:51 +0x32 fp=0xc420067ac8 sp=0xc420067a68
syscall.BytePtrFromString(0x6962732f7273752f, 0x657572742f6e, 0x2, 0xc4202ce130, 0x2)
	/usr/local/go/src/syscall/syscall.go:71 +0x39 fp=0xc420067b10 sp=0xc420067ac8
syscall.SlicePtrFromStrings(0xc4202ce000, 0x1, 0x1, 0x0, 0x0, 0x7f2bde7d3113, 0x7f2bde7d3140, 0x1e)
	/usr/local/go/src/syscall/exec_unix.go:87 +0x97 fp=0xc420067b80 sp=0xc420067b10
syscall.forkExec(0xc4202ce070, 0x9, 0xc4202ce000, 0x1, 0x1, 0xc420067d90, 0x0, 0x0, 0x0)
	/usr/local/go/src/syscall/exec_unix.go:154 +0x10d fp=0xc420067ca8 sp=0xc420067b80
syscall.StartProcess(0xc4202ce070, 0x9, 0xc4202ce000, 0x1, 0x1, 0xc420067d90, 0x2, 0x4, 0x0, 0xc420067d60)
	/usr/local/go/src/syscall/exec_unix.go:240 +0x64 fp=0xc420067d00 sp=0xc420067ca8
os.startProcess(0xc4202ce070, 0x9, 0xc4202ce000, 0x1, 0x1, 0xc420067f38, 0xc4203f84e0, 0x1a, 0x1a)
	/usr/local/go/src/os/exec_posix.go:45 +0x1a3 fp=0xc420067de8 sp=0xc420067d00
os.StartProcess(0xc4202ce070, 0x9, 0xc4202ce000, 0x1, 0x1, 0xc420067f38, 0x0, 0x0, 0xc4202ce070)
	/usr/local/go/src/os/exec.go:94 +0x64 fp=0xc420067e40 sp=0xc420067de8
os/exec.(*Cmd).Start(0xc4203e0000, 0x4, 0x0)
	/usr/local/go/src/os/exec/exec.go:359 +0x3d2 fp=0xc420067f90 sp=0xc420067e40
main.run(0xc420092060)
	/tmp/15658.go:11 +0x68 fp=0xc420067fd8 sp=0xc420067f90
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc420067fe0 sp=0xc420067fd8
created by main.main
	/tmp/15658.go:35 +0x141

goroutine 1 [runnable]:
main.main()
	/tmp/15658.go:34 +0x11f

[...]
Starting a bunch of goroutines...
fatal error: forEachP: P did not run fn
acquirep: p->m=842353238016(37) p->status=1
fatal error: acquirep: invalid p state
acquirep: p->m=0(0) p->status=3
fatal error: acquirep: invalid p state

runtime stack:
runtime.throw(0x4d09a1, 0x1a)
	/usr/local/go/src/runtime/panic.go:596 +0x95
runtime.forEachP(0x4d4460)
	/usr/local/go/src/runtime/proc.go:1276 +0x2df
runtime.gcMarkDone.func1()
	/usr/local/go/src/runtime/mgc.go:1153 +0x2d
runtime.systemstack(0xc420436d40)
	/usr/local/go/src/runtime/asm_amd64.s:327 +0x79
runtime.mstart()
	/usr/local/go/src/runtime/proc.go:1132

goroutine 400850 [running]:
runtime.systemstack_switch()
	/usr/local/go/src/runtime/asm_amd64.s:281 fp=0xc42012ef30 sp=0xc42012ef28
runtime.gcMarkDone()
	/usr/local/go/src/runtime/mgc.go:1154 +0xe9 fp=0xc42012ef50 sp=0xc42012ef30
runtime.gcBgMarkWorker(0xc42001b300)
	/usr/local/go/src/runtime/mgc.go:1581 +0x303 fp=0xc42012efd8 sp=0xc42012ef50
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc42012efe0 sp=0xc42012efd8
created by runtime.gcBgMarkStartWorkers
	/usr/local/go/src/runtime/mgc.go:1412 +0x98

goroutine 1 [chan receive]:
main.main()
	/tmp/15658.go:34 +0x11f

[...]
Starting a bunch of goroutines...
fatal error: workbuf is not empty

runtime stack:
runtime.throw(0x4cfa21, 0x14)
	/usr/local/go/src/runtime/panic.go:596 +0x95 fp=0xc420195df0 sp=0xc420195dd0
runtime.(*workbuf).checkempty(0x7f9ead2609c0)
	/usr/local/go/src/runtime/mgcwork.go:320 +0x4e fp=0xc420195e10 sp=0xc420195df0
runtime.getempty(0xc420023828)
	/usr/local/go/src/runtime/mgcwork.go:332 +0x96 fp=0xc420195e48 sp=0xc420195e10
runtime.(*gcWork).init(0xc420022528)
	/usr/local/go/src/runtime/mgcwork.go:90 +0x22 fp=0xc420195e60 sp=0xc420195e48
runtime.(*gcWork).tryGet(0xc420022528, 0x27)
	/usr/local/go/src/runtime/mgcwork.go:156 +0xdb fp=0xc420195e88 sp=0xc420195e60
runtime.gcDrain(0xc420022528, 0x2)
	/usr/local/go/src/runtime/mgcmark.go:1075 +0x2eb fp=0xc420195ec0 sp=0xc420195e88
runtime.gchelper()
	/usr/local/go/src/runtime/mgc.go:1912 +0x110 fp=0xc420195ef0 sp=0xc420195ec0
runtime.stopm()
	/usr/local/go/src/runtime/proc.go:1653 +0xdd fp=0xc420195f18 sp=0xc420195ef0
runtime.gcstopm()
	/usr/local/go/src/runtime/proc.go:1854 +0xb7 fp=0xc420195f40 sp=0xc420195f18
runtime.schedule()
	/usr/local/go/src/runtime/proc.go:2186 +0x110 fp=0xc420195f80 sp=0xc420195f40
runtime.park_m(0xc4202ec000)
	/usr/local/go/src/runtime/proc.go:2285 +0xab fp=0xc420195fb8 sp=0xc420195f80
runtime.mcall(0x0)
	/usr/local/go/src/runtime/asm_amd64.s:269 +0x5b fp=0xc420195fc8 sp=0xc420195fb8

goroutine 1 [chan receive]:
runtime.gopark(0x4d45a8, 0xc4200940b8, 0x4ce3a3, 0xc, 0x17, 0x3)
	/usr/local/go/src/runtime/proc.go:271 +0x13a fp=0xc420064e00 sp=0xc420064dd0
runtime.goparkunlock(0xc4200940b8, 0x4ce3a3, 0xc, 0x44b417, 0x3)
	/usr/local/go/src/runtime/proc.go:277 +0x5e fp=0xc420064e40 sp=0xc420064e00
runtime.chanrecv(0x4ab280, 0xc420094060, 0x0, 0xc420064f01, 0x49a6f1)
	/usr/local/go/src/runtime/chan.go:513 +0x371 fp=0xc420064ee0 sp=0xc420064e40
runtime.chanrecv1(0x4ab280, 0xc420094060, 0x0)
	/usr/local/go/src/runtime/chan.go:395 +0x35 fp=0xc420064f18 sp=0xc420064ee0
main.main()
	/tmp/15658.go:34 +0x11f fp=0xc420064f88 sp=0xc420064f18
runtime.main()
	/usr/local/go/src/runtime/proc.go:185 +0x20a fp=0xc420064fe0 sp=0xc420064f88
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc420064fe8 sp=0xc420064fe0

[...]

I tried the March 8, 2017 reproducer with seven variants, but none crashed within a few seconds. I haven’t seen variant (1) crash after 60 process-hours of execution.


My very slow and not terribly reliable reproducer (version c7141caadf in a private repo) is below:

package main

import (
	"context"
	"flag"
	"log"
	"os/exec"
	"sync"
	"sync/atomic"
)

func main() {
	var (
		workers = flag.Int("workers", 10, "worker count")
		cmdIter = flag.Int("cmd_iter", 10000, "iterations per command worker")
		cmd     = flag.String("cmd", "/bin/true", "command to run")
	)
	flag.Parse()

	ctx := context.Background()

	var wg sync.WaitGroup
	ctx, cancel := context.WithCancel(ctx)
	errs := make(chan error, 1)
	for i := 0; i < *workers; i++ {
		wg.Add(1)
		go func() {
			defer wg.Done()

			for j := 0; j < *cmdIter; j++ {
				sink.Store(make([]*byte, 1<<20))

				err := doCmd(ctx, *cmd)

				if err != nil {
					cancel()
					select {
					case errs <- err:
					default:
					}
					return
				}
			}
		}()
	}

	wg.Wait()
	cancel()
	close(errs)
	err := <-errs
	if err != nil {
		log.Fatalf("%v", err)
	}
}

var sink atomic.Value

func doCmd(ctx context.Context, command string) error {
	ctx, cancel := context.WithCancel(ctx)
	defer cancel()

	cmd := exec.Command(command)

	err := cmd.Start()
	if err != nil {
		return err
	}

	done := make(chan error, 1)
	go func() {
		err := cmd.Wait()
		done <- err
	}()

	select {
	case <-ctx.Done():
		err := cmd.Process.Kill()
		if err != nil {
			return err
		}
		return ctx.Err()
	case err = <-done:
		return err
	}
}

It crashed with the following error while running under perf trace record while running with the arguments -cmd=/bin/true -workers=10 -cmd_iter=300. That was one crash out of 4480 attempts.

The “stress” command here is golang.org/x/tools/cmd/stress

$ sudo ./stress -timeout=1h -p=10 -- bash -c 'trace_file="/tmp/perf-trace-$(date +%s)-$$" && echo "$trace_file" && perf trace record --quiet --output="$trace_file" -- ./20427.c7141caadf -cmd=/bin/true -workers=10 -cmd_iter=300 && (rm -- "$trace_file" || true)'
fatal error: unexpected signal during runtime execution
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x42d1bf]

runtime stack:
runtime.throw(0x4e755d, 0x2a)
	/usr/local/go/src/runtime/panic.go:596 +0x95
runtime.sigpanic()
	/usr/local/go/src/runtime/signal_unix.go:274 +0x2db
runtime.execute(0xc4238c4000, 0xc42001b300)
	/usr/local/go/src/runtime/proc.go:1874 +0x17f
runtime.schedule()
	/usr/local/go/src/runtime/proc.go:2239 +0x127
runtime.exitsyscall0(0xc420001a00)
	/usr/local/go/src/runtime/proc.go:2772 +0xec
runtime.mcall(0x0)
	/usr/local/go/src/runtime/asm_amd64.s:269 +0x5b

goroutine 1 [semacquire]:
sync.runtime_Semacquire(0xc42001230c)
	/usr/local/go/src/runtime/sema.go:47 +0x34
sync.(*WaitGroup).Wait(0xc420012300)
	/usr/local/go/src/sync/waitgroup.go:131 +0x7a
main.main()
	main.go:47 +0x27e

[...]

I don’t see an obvious smoking gun in the perf trace output. I’ve shared a portion of it at https://gist.github.com/rhysh/be58b5a5f3a41e218a05f1d8c3522dc0: Line 7 shows thread 16347 begin a call to clone, line 671 and 672 show that call resulting in the creation of process 86746. Lines 677 and 688 show the execve call for /bin/true, and 727 shows the exit_group call that ends the /bin/true process. Line 984 shows thread 16347 “returning” from rt_sigreturn (I don’t understand how this works) with value 13, which happens to be the number of bytes in "fatal error: ", followed by a 42-byte write to fd 2 (probably “unexpected signal during runtime execution”) and a 1-byte write to fd 2 (probably “\n”).

Thank you all for the ideas. I’ll upgrade to go1.8.3 soon after it’s out—it looks like it’s slated to include cherry-picks of both CL 43294 and CL 43311.

This program doesn’t use GOEXPERIMENT=preemptibleloops, and I agree that we can likely rule out inter-runtime signal handler disputes since it’s built with CGO_ENABLED=0.

The program uses crypto/rand.Read, which tries to use getrandom(2) when available (on Linux 3.17+). Should we be suspicious of this feature?

I got the following crash from an instance of the program that was running with GODEBUG=gcrescanstacks=1,gccheckmark=1 (and GOGC=1). The crash doesn’t look like it’s from the GC’s assertions that those settings enabled. Setting GOGC=1 on the test instance doesn’t seem to have made the program crash more often.

unexpected fault address 0xd01ed0
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x2 addr=0xd01ed0 pc=0xd01ed0]

goroutine 1153267 [running]:
runtime.throw(0xa56736, 0x5)
     /usr/local/go/src/runtime/panic.go:596 +0x95 fp=0xc42059fb08 sp=0xc42059fae8
runtime.sigpanic()
     /usr/local/go/src/runtime/signal_unix.go:297 +0x28c fp=0xc42059fb58 sp=0xc42059fb08
created by redacted
     /redacted.go:155 +0x46a

goroutine 1 [chan receive, 126 minutes]:
runtime.gopark(0xa73c38, 0xc4200e8298, 0xa5bd35, 0xc, 0xc4202b8717, 0x3)
     /usr/local/go/src/runtime/proc.go:271 +0xfd fp=0xc420062c38 sp=0xc420062c08
runtime.goparkunlock(0xc4200e8298, 0xa5bd35, 0xc, 0x42fd17, 0x3)
     /usr/local/go/src/runtime/proc.go:277 +0x5e fp=0xc420062c78 sp=0xc420062c38
runtime.chanrecv(0x9b1f60, 0xc4200e8240, 0x0, 0xc42027c201, 0xc4200e8240)
     /usr/local/go/src/runtime/chan.go:513 +0x24b fp=0xc420062d10 sp=0xc420062c78
runtime.chanrecv1(0x9b1f60, 0xc4200e8240, 0x0)
     /usr/local/go/src/runtime/chan.go:395 +0x35 fp=0xc420062d48 sp=0xc420062d10
[snip]

This program uses Unix domain sockets much more actively than most of our others. Some of the crashes are encountered near the netpoller (like the one below), but I don’t know if that’s because of memory corruption coming from the netpoller, or if the program is just as likely to crash on any active goroutine and a lot of the active goroutines are working with UDS.

fatal error: netpollblock: corrupted state

goroutine 340603 [running]:
runtime.throw(0x894220, 0x1d)
	/usr/local/go/src/runtime/panic.go:596 +0x95 fp=0xc420a68ae0 sp=0xc420a68ac0
runtime.netpollblock(0x7f7c182d1e00, 0x72, 0xb)
	/usr/local/go/src/runtime/netpoll.go:350 +0x105 fp=0xc420a68b28 sp=0xc420a68ae0
net.runtime_pollWait(0x7f7c182d1e00, 0x72, 0x14)
	/usr/local/go/src/runtime/netpoll.go:164 +0x59 fp=0xc420a68b50 sp=0xc420a68b28
net.(*pollDesc).wait(0xc4201724c8, 0x72, 0xa68080, 0xa643c8)
	/usr/local/go/src/net/fd_poll_runtime.go:75 +0x38 fp=0xc420a68b78 sp=0xc420a68b50
net.(*pollDesc).waitRead(0xc4201724c8, 0xc420a08a00, 0x200)
	/usr/local/go/src/net/fd_poll_runtime.go:80 +0x34 fp=0xc420a68ba8 sp=0xc420a68b78
net.(*netFD).Read(0xc420172460, 0xc420a08a00, 0x200, 0x200, 0x0, 0xa68080, 0xa643c8)
	/usr/local/go/src/net/fd_unix.go:250 +0x1b7 fp=0xc420a68c10 sp=0xc420a68ba8
net.(*conn).Read(0xc4205cc0c8, 0xc420a08a00, 0x200, 0x200, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:181 +0x70 fp=0xc420a68c80 sp=0xc420a68c10
bytes.(*Buffer).ReadFrom(0xc420a68d20, 0x7f7c184f1108, 0xc4205cc0c8, 0xc420a08a00, 0x0, 0x200)
	/usr/local/go/src/bytes/buffer.go:179 +0x160 fp=0xc420a68cf0 sp=0xc420a68c80
io/ioutil.readAll(0x7f7c184f1108, 0xc4205cc0c8, 0x200, 0x0, 0x0, 0x0, 0x0, 0x0)
	/usr/local/go/src/io/ioutil/ioutil.go:33 +0x150 fp=0xc420a68d98 sp=0xc420a68cf0
io/ioutil.ReadAll(0x7f7c184f1108, 0xc4205cc0c8, 0xc4205cc0c8, 0x7f7c184f1108, 0xc4205cc0c8, 0x0, 0x0)
	/usr/local/go/src/io/ioutil/ioutil.go:42 +0x3e fp=0xc420a68de8 sp=0xc420a68d98
[snip]