go: runtime: sporadic but frequent "signal: illegal instruction" during all.bash on darwin/arm64 at CL 272258 PS 1
CL 272258 has made it easy to run all.bash
, so I ran it a few times. It has passed at least once, but most of the some test will fail due to “signal: illegal instruction”.
This is on a MacBook Air (M1, 2020), macOS 11.0.1 (20B29).
This may get resolved as part of finishing work on #42684, but reporting this in case these failures are helpful to see sooner, since I don’t think of any existing macOS/ARM64 issues cover this failure mode specifically.
What version of Go are you using (go version
)?
$ go version go version devel +7716a2fbb7 Sun Nov 22 11:44:49 2020 -0500 darwin/arm64
What operating system and processor architecture are you using (go env
)?
go env
Output
$ go env GO111MODULE="" GOARCH="arm64" GOBIN="" GOCACHE="/Users/dmitri/Library/Caches/go-build" GOENV="/Users/dmitri/Library/Application Support/go/env" GOEXE="" GOFLAGS="" GOHOSTARCH="arm64" GOHOSTOS="darwin" GOINSECURE="" GOMODCACHE="/Users/dmitri/go/pkg/mod" GONOPROXY="" GONOSUMDB="" GOOS="darwin" GOPATH="/Users/dmitri/go" GOPRIVATE="" GOPROXY="https://proxy.golang.org,direct" GOROOT="/Users/dmitri/go116r2" GOSUMDB="sum.golang.org" GOTMPDIR="" GOTOOLDIR="/Users/dmitri/go116r2/pkg/tool/darwin_arm64" GOVCS="" GOVERSION="devel +7716a2fbb7 Sun Nov 22 11:44:49 2020 -0500" GCCGO="gccgo" AR="ar" CC="clang" CXX="clang++" CGO_ENABLED="1" GOMOD="/dev/null" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/_0/h0671fcn4rgb5pn9c745dx2h0000gn/T/go-build446541792=/tmp/go-build -gno-record-gcc-switches -fno-common"
What did you do?
Built Go at commit 7716a2fbb7 (CL 272258 PS 1) successfully, pointed GOROOT_BOOTSTRAP to it, then checked out the same commit elsewhere and ran all.bash
repeatedly.
What did you expect to see?
$ ./all.bash
Building Go cmd/dist using /Users/dmitri/go116r2. (devel +7716a2fbb7 Sun Nov 22 11:44:49 2020 -0500 darwin/arm64)
Building Go toolchain1 using /Users/dmitri/go116r2.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
Building Go toolchain3 using go_bootstrap and Go toolchain2.
Building packages and commands for darwin/arm64.
[...]
ALL TESTS PASSED
---
Installed Go for darwin/arm64 in /Users/dmitri/gotip
Installed commands in /Users/dmitri/gotip/bin
*** You need to add /Users/dmitri/gotip/bin to your PATH.
What did you see instead?
Most of the time, a failure like:
##### ../test
# go run run.go -- fixedbugs/issue5856.go
exit status 1
go tool compile: signal: illegal instruction
FAIL fixedbugs/issue5856.go 0.015s
2020/11/22 15:17:43 Failed: exit status 1
##### ../test
# go run run.go -- fixedbugs/issue16095.go
exit status 1
go tool link: signal: illegal instruction
FAIL fixedbugs/issue16095.go 0.027s
2020/11/22 15:49:41 Failed: exit status 1
##### ../test
# go run run.go -- fixedbugs/issue29264.go
exit status 1
go tool compile: signal: illegal instruction
FAIL fixedbugs/issue29264.go 0.020s
2020/11/22 16:02:34 Failed: exit status 1
$ ./all.bash
Building Go cmd/dist using /Users/dmitri/gotip. (devel +7716a2fbb7 Sun Nov 22 11:44:49 2020 -0500 darwin/arm64)
Building Go toolchain1 using /Users/dmitri/gotip.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
go build internal/testlog: /Users/dmitri/go116r2/pkg/tool/darwin_arm64/compile: signal: illegal instruction
go tool dist: FAILED: /Users/dmitri/go116r2/pkg/tool/darwin_arm64/go_bootstrap install -gcflags=all= -ldflags=all= -i cmd/asm cmd/cgo cmd/compile cmd/link: exit status 1
--- FAIL: TestCover (1.48s)
cover_test.go:141: running [go -build -o /var/folders/_0/h0671fcn4rgb5pn9c745dx2h0000gn/T/go-testcover103682725/toolexec.exe testdata/toolexec.go]
cover_test.go:130: running [go build -o /var/folders/_0/h0671fcn4rgb5pn9c745dx2h0000gn/T/go-testcover103682725/cover.exe]
cover_test.go:173: failed to build testcover or toolexec program: signal: illegal instruction
--- FAIL: TestCoverHTML (1.48s)
cover_test.go:390: failed to build testcover or toolexec program: signal: illegal instruction
--- FAIL: TestFuncWithDuplicateLines (1.48s)
cover_test.go:537: failed to build testcover or toolexec program: signal: illegal instruction
--- FAIL: TestHtmlUnformatted (1.48s)
cover_test.go:454: failed to build testcover or toolexec program: signal: illegal instruction
--- FAIL: TestDirectives (1.48s)
cover_test.go:249: failed to build testcover or toolexec program: signal: illegal instruction
--- FAIL: TestCoverFunc (1.48s)
cover_test.go:368: failed to build testcover or toolexec program: signal: illegal instruction
FAIL
FAIL cmd/cover 1.728s
--- FAIL: TestDWARF (0.85s)
--- FAIL: TestDWARF/testprog (1.75s)
dwarf_test.go:99: symbols testprog.exe: exit status 255: /var/folders/_0/h0671fcn4rgb5pn9c745dx2h0000gn/T/go-link-TestDWARF652205567/testprog.exe [arm64, 0.054295 seconds]:
null-uuid /private/var/folders/_0/h0671fcn4rgb5pn9c745dx2h0000gn/T/go-link-TestDWARF652205567/testprog.exe [AOUT, PIE, FaultedFromDisk]
0x0000000000000000 (0x100000000) __PAGEZERO SEGMENT
0x0000000100000000 ( 0xf0000) __TEXT SEGMENT
0x0000000100000000 ( 0x1000) MACH_HEADER
0x0000000100001000 ( 0xc6c90) __TEXT __text
0x0000000100001000 ( 0x70) go.buildid [FUNC, NameNList, MangledNameNList, Merged, NList]
0x0000000100001070 ( 0x50) internal/cpu.Initialize [FUNC, NameNList, MangledNameNList, NList]
[...]
0x00000001002683a8 ( 0x8) p [EXT, NameNList, MangledNameNList, NList]
0x00000001002683b0 ( 0x4) threadExited [EXT, NameNList, MangledNameNList, NList]
0x000000010026c000 ( 0x402d2) __LINKEDIT SEGMENT
symbols[31119]: [fatal] child process status could not be determined; exit code unavailable.
FAIL
FAIL cmd/link 20.081s
(Full log for TestDWARF/testprog
failure.)
CC @cherrymui.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 49 (35 by maintainers)
I filed FB8922558 with Apple about these uncatchable spurious
SIGILL
s.I’ve come up with a reduced C testcase that exhibits the bug fairly readily:
t_fault_sigaltstack.cc
. It should be possible to use this to prototype other possible workarounds, and for filing with Apple.If it doesn’t crash with
SIGILL
shortly after starting it, startwhile true; do sudo purge; done
(either before or whilet_fault_sigaltstack
is running) and it should crash.Other times, it will show the thread calling
pthread_kill
as the crash thread.Apparently, it more likely to fail when the system is under load (e.g. running multiple instances of the test in parallel). So, one assumption is that it likely to fail when the signal stack is not faulted in when the signal happens. So I tried to mlock the signal stack, and it does seem to fix this.
CL https://golang.org/cl/273686 does the mlock. For people who want to try, let me know if that works on your machine. Thanks!
I kept looking anyway, and as you may have found, the T8020 and T8027 kernels in those early releases were identical, so I was able to use the T8020
.dSYM
from the 20A5299w KDK. Nothing much came of that, it basically confirmed your symbolization. But I was looking at what seemed to be the most promising stack you shared:(
0xff68000
computed as0xfffffe001765d038
minus the address shown bynm kernel.release.t8020.dSYM/Contents/Resources/DWARF/kernel.release.t8020 | grep ' [Ss] _postsig_locked$'
.)Nothing too new or interesting there, and we can’t do much with the filenames and line numbers without a source drop. So I looked for other sources of
SIGILL
in xnu that aren’t strictly hardware traps, and I found one, right in 10.15.6 xnu-6153.141.1/bsd/dev/arm/unix_signal.csendsig
. Incidentally, that’s exactly where bsd/kern/kern_sig.cpostsig_locked
calls into.At the bottom of
sendsig
, at the labelbad
, you’ll see a hard-codedSIGILL
.bad
is used when an error occurs while attempting to fulfill a request to send a signal to a user-space handler. Here, the errors can be in getting or setting thread state, or in copying data to user-space, so as to establish the user-space handler’s stack frame. ThisSIGILL
is forced toSIG_DFL
, which also agrees with your observation that the user-space handler never ran to show a backtrace.(x86_64 need not feel left out, it has the same
SIGILL
behavior in bsd/dev/i386/unix_signal.csendsig
.)Here’s a sample:
t_impossible_sigaltstack
. It sets up asigaltstack
to handleSIGABRT
, but sets the protection on it toPROT_NONE
, preventing the kernel from setting up the stack frame, before callingabort()
to crash. You were expectingSIGABRT
? In reality:(Another note:
sigaltstack
is broken for arm64 in the 10.15.6 xnu sources, but it’s working in 11.0. Doesn’t actually matter for this bug, though: the 10.15.6 xnu source shows everything you need to see how thisSIGILL
is generated.)How to dig deeper from here? Well, the Mach exception type that was recorded (here,
EXC_CRASH
) will actually be the true exception type from the original hardware fault if there was one, and will beEXC_CRASH
if it was a software fault. Since your reports don’t showEXC_BAD_INSTRUCTION
, they’re not real illegal instructions at all, they’re just failures to invoke the user-space signal handler. In fact yours are showingEXC_CRASH
, just like my example here.Since golang is a heavy user of
SIGURG
internally, and since you mentioned that the__pthread_kill
calls I originally suggested might be relevant were raisingSIGURG
, I think we have a very strong theory that the problem is a failure to invoke the user-spaceSIGURG
signal handler. You don’t normally think ofSIGURG
as being a fatal signal, because its default disposition (atSIG_DFL
) is to be discarded, but in fact when a user-space handler can’t be invoked, the default disposition ofSIGILL
will be the controlling behavior.The remaining question concerns why the user-space signal handler couldn’t be invoked. It could be something like my example here, with an unusable signal stack. Conceivably, it could be stack exhaustion or some other memory problem, or it could be a problem with the user thread state or signal handler thread state. But I think that this new lead will be valuable in troubleshooting this further.
Okay, I can reproduce the SIGILL with running the io benchmarks in a loop on the DTK, although with a much lower failure rate. I’ll take it from there. (I’ll be OOO this afternoon, though, so I probably won’t get back to this very soon.)
Preemption signal is not the only issue here. It can be any signals, like user generated signals. So I don’t think disabling preemption is preferable.
We could use sigaltstack+mlock on iOS >=14. But before we do that, I think we would want to understand better if #35851 is the same cause or not.
I tried to manually symbolize a kernel stack trace (by comparing the kernel memory with the symbol information of a kernel with a nearby version):
The fifth frame is weird, as it is not a call instruction. Maybe a faked return address? I’ll see if I can make sense of it…
Crash report for this error: https://gist.github.com/tonistiigi/23fbd4be8d1562ff935aa36b2dcdb299
With
GODEBUG=asyncpreemptoff=1
./io.test benchmark doesn’t reproduce issue for me either.I’ve also seen this happen a few times. Interestingly the place I ran into it most often is when running the
io
benchmarks, particularlyBenchmarkCopyNLarge-8
.