go: runtime: sporadic but frequent "signal: illegal instruction" during all.bash on darwin/arm64 at CL 272258 PS 1

CL 272258 has made it easy to run all.bash, so I ran it a few times. It has passed at least once, but most of the some test will fail due to “signal: illegal instruction”.

This is on a MacBook Air (M1, 2020), macOS 11.0.1 (20B29).

This may get resolved as part of finishing work on #42684, but reporting this in case these failures are helpful to see sooner, since I don’t think of any existing macOS/ARM64 issues cover this failure mode specifically.

What version of Go are you using (go version)?

$ go version
go version devel +7716a2fbb7 Sun Nov 22 11:44:49 2020 -0500 darwin/arm64

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="arm64"
GOBIN=""
GOCACHE="/Users/dmitri/Library/Caches/go-build"
GOENV="/Users/dmitri/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/dmitri/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/dmitri/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/Users/dmitri/go116r2"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/Users/dmitri/go116r2/pkg/tool/darwin_arm64"
GOVCS=""
GOVERSION="devel +7716a2fbb7 Sun Nov 22 11:44:49 2020 -0500"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/_0/h0671fcn4rgb5pn9c745dx2h0000gn/T/go-build446541792=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

Built Go at commit 7716a2fbb7 (CL 272258 PS 1) successfully, pointed GOROOT_BOOTSTRAP to it, then checked out the same commit elsewhere and ran all.bash repeatedly.

What did you expect to see?

$ ./all.bash
Building Go cmd/dist using /Users/dmitri/go116r2. (devel +7716a2fbb7 Sun Nov 22 11:44:49 2020 -0500 darwin/arm64)
Building Go toolchain1 using /Users/dmitri/go116r2.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
Building Go toolchain3 using go_bootstrap and Go toolchain2.
Building packages and commands for darwin/arm64.

[...]

ALL TESTS PASSED
---
Installed Go for darwin/arm64 in /Users/dmitri/gotip
Installed commands in /Users/dmitri/gotip/bin
*** You need to add /Users/dmitri/gotip/bin to your PATH.

What did you see instead?

Most of the time, a failure like:

##### ../test
# go run run.go -- fixedbugs/issue5856.go
exit status 1
go tool compile: signal: illegal instruction

FAIL	fixedbugs/issue5856.go	0.015s
2020/11/22 15:17:43 Failed: exit status 1
##### ../test
# go run run.go -- fixedbugs/issue16095.go
exit status 1
go tool link: signal: illegal instruction

FAIL	fixedbugs/issue16095.go	0.027s
2020/11/22 15:49:41 Failed: exit status 1
##### ../test
# go run run.go -- fixedbugs/issue29264.go
exit status 1
go tool compile: signal: illegal instruction

FAIL	fixedbugs/issue29264.go	0.020s
2020/11/22 16:02:34 Failed: exit status 1
$ ./all.bash
Building Go cmd/dist using /Users/dmitri/gotip. (devel +7716a2fbb7 Sun Nov 22 11:44:49 2020 -0500 darwin/arm64)
Building Go toolchain1 using /Users/dmitri/gotip.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
go build internal/testlog: /Users/dmitri/go116r2/pkg/tool/darwin_arm64/compile: signal: illegal instruction
go tool dist: FAILED: /Users/dmitri/go116r2/pkg/tool/darwin_arm64/go_bootstrap install -gcflags=all= -ldflags=all= -i cmd/asm cmd/cgo cmd/compile cmd/link: exit status 1
--- FAIL: TestCover (1.48s)
    cover_test.go:141: running [go -build -o /var/folders/_0/h0671fcn4rgb5pn9c745dx2h0000gn/T/go-testcover103682725/toolexec.exe testdata/toolexec.go]
    cover_test.go:130: running [go build -o /var/folders/_0/h0671fcn4rgb5pn9c745dx2h0000gn/T/go-testcover103682725/cover.exe]
    cover_test.go:173: failed to build testcover or toolexec program: signal: illegal instruction
--- FAIL: TestCoverHTML (1.48s)
    cover_test.go:390: failed to build testcover or toolexec program: signal: illegal instruction
--- FAIL: TestFuncWithDuplicateLines (1.48s)
    cover_test.go:537: failed to build testcover or toolexec program: signal: illegal instruction
--- FAIL: TestHtmlUnformatted (1.48s)
    cover_test.go:454: failed to build testcover or toolexec program: signal: illegal instruction
--- FAIL: TestDirectives (1.48s)
    cover_test.go:249: failed to build testcover or toolexec program: signal: illegal instruction
--- FAIL: TestCoverFunc (1.48s)
    cover_test.go:368: failed to build testcover or toolexec program: signal: illegal instruction
FAIL
FAIL	cmd/cover	1.728s
--- FAIL: TestDWARF (0.85s)
    --- FAIL: TestDWARF/testprog (1.75s)
        dwarf_test.go:99: symbols testprog.exe: exit status 255: /var/folders/_0/h0671fcn4rgb5pn9c745dx2h0000gn/T/go-link-TestDWARF652205567/testprog.exe [arm64, 0.054295 seconds]:
                null-uuid                            /private/var/folders/_0/h0671fcn4rgb5pn9c745dx2h0000gn/T/go-link-TestDWARF652205567/testprog.exe [AOUT, PIE, FaultedFromDisk]  
                    0x0000000000000000 (0x100000000) __PAGEZERO SEGMENT
                    0x0000000100000000 ( 0xf0000) __TEXT SEGMENT
                        0x0000000100000000 (  0x1000) MACH_HEADER
                        0x0000000100001000 ( 0xc6c90) __TEXT __text
                            0x0000000100001000 (    0x70) go.buildid [FUNC, NameNList, MangledNameNList, Merged, NList] 
                            0x0000000100001070 (    0x50) internal/cpu.Initialize [FUNC, NameNList, MangledNameNList, NList] 
[...]
                            0x00000001002683a8 (     0x8) p [EXT, NameNList, MangledNameNList, NList] 
                            0x00000001002683b0 (     0x4) threadExited [EXT, NameNList, MangledNameNList, NList] 
                    0x000000010026c000 ( 0x402d2) __LINKEDIT SEGMENT
            symbols[31119]: [fatal]  child process status could not be determined; exit code unavailable.
FAIL
FAIL	cmd/link	20.081s

(Full log for TestDWARF/testprog failure.)

CC @cherrymui.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 49 (35 by maintainers)

Most upvoted comments

I filed FB8922558 with Apple about these uncatchable spurious SIGILLs.

I’ve come up with a reduced C testcase that exhibits the bug fairly readily: t_fault_sigaltstack.cc. It should be possible to use this to prototype other possible workarounds, and for filing with Apple.

If it doesn’t crash with SIGILL shortly after starting it, start while true; do sudo purge; done (either before or while t_fault_sigaltstack is running) and it should crash.

Process:               t_fault_sigaltstack [80089]
[…]
Code Type:             ARM-64 (Native)
[…]
Crashed Thread:        1

Exception Type:        EXC_CRASH (SIGILL)
Exception Codes:       0x0000000000000000, 0x0000000000000000
Exception Note:        EXC_CORPSE_NOTIFY

Termination Signal:    Illegal instruction: 4
Termination Reason:    Namespace SIGNAL, Code 0x4
Terminating Process:   t_fault_sigaltstack [80089]

Application Specific Information:
dyld2 mode

Thread 0:: Dispatch queue: com.apple.main-thread
0   libsystem_kernel.dylib              0x000000018ff43cec __pthread_kill + 8
1   libsystem_pthread.dylib             0x000000018ff74c24 pthread_kill + 292
2   t_fault_sigaltstack                 0x0000000102b9dd60 main + 360
3   libdyld.dylib                       0x000000018ff90f54 start + 4

Thread 1 Crashed:
0   libsystem_kernel.dylib              0x000000018ff3f284 __semwait_signal + 8
1   libsystem_c.dylib                   0x000000018feb93a0 nanosleep + 216
2   t_fault_sigaltstack                 0x0000000102b9dfb4 (anonymous namespace)::ThreadMain(void*) + 420
3   libsystem_pthread.dylib             0x000000018ff7506c _pthread_start + 320
4   libsystem_pthread.dylib             0x000000018ff6fda0 thread_start + 8
[…]

Other times, it will show the thread calling pthread_kill as the crash thread.

Apparently, it more likely to fail when the system is under load (e.g. running multiple instances of the test in parallel). So, one assumption is that it likely to fail when the signal stack is not faulted in when the signal happens. So I tried to mlock the signal stack, and it does seem to fix this.

CL https://golang.org/cl/273686 does the mlock. For people who want to try, let me know if that works on your machine. Thanks!

I kept looking anyway, and as you may have found, the T8020 and T8027 kernels in those early releases were identical, so I was able to use the T8020 .dSYM from the 20A5299w KDK. Nothing much came of that, it basically confirmed your symbolization. But I was looking at what seemed to be the most promising stack you shared:

mark@sweet16 zsh% pbpaste
              kernel.release.t8027`0xfffffe001765d038+0x244 // postsig_locked
              kernel.release.t8027`0xfffffe001765d540+0x494 // bsd_ast
              kernel.release.t8027`0xfffffe001716b9f4+0xdc  // ast_taken_user
              kernel.release.t8027`0xfffffe001712fd9c+0xc   // _arm64_thread_exception_return (label user_take_ast)
              kernel.release.t8027`0xfffffe001727a714+0x564 // sleh_synchronous ???
              kernel.release.t8027`0xfffffe001712f858+0x28  // fleh_synchronous
              0x102387f84 // user code
mark@sweet16 zsh% pbpaste | sed -En -e 's/^.*kernel\.release\.t8027`([^ ]+).*$/print(hex(\1))/p' | python3 | atos -o kernel.release.t8020.dSYM/Contents/Resources/DWARF/kernel.release.t8020 -s 0xff68000
postsig_locked (in kernel.release.t8020) (kern_sig.c:3359)
bsd_ast (in kernel.release.t8020) (kern_sig.c:3573)
ast_taken_user (in kernel.release.t8020) (ast.c:219)
user_take_ast (in kernel.release.t8020) + 12
sleh_synchronous (in kernel.release.t8020) (sleh.c:594)
fleh_synchronous (in kernel.release.t8020) + 40

(0xff68000 computed as 0xfffffe001765d038 minus the address shown by nm kernel.release.t8020.dSYM/Contents/Resources/DWARF/kernel.release.t8020 | grep ' [Ss] _postsig_locked$'.)

Nothing too new or interesting there, and we can’t do much with the filenames and line numbers without a source drop. So I looked for other sources of SIGILL in xnu that aren’t strictly hardware traps, and I found one, right in 10.15.6 xnu-6153.141.1/bsd/dev/arm/unix_signal.c sendsig. Incidentally, that’s exactly where bsd/kern/kern_sig.c postsig_locked calls into.

At the bottom of sendsig, at the label bad, you’ll see a hard-coded SIGILL. bad is used when an error occurs while attempting to fulfill a request to send a signal to a user-space handler. Here, the errors can be in getting or setting thread state, or in copying data to user-space, so as to establish the user-space handler’s stack frame. This SIGILL is forced to SIG_DFL, which also agrees with your observation that the user-space handler never ran to show a backtrace.

(x86_64 need not feel left out, it has the same SIGILL behavior in bsd/dev/i386/unix_signal.c sendsig.)

Here’s a sample: t_impossible_sigaltstack. It sets up a sigaltstack to handle SIGABRT, but sets the protection on it to PROT_NONE, preventing the kernel from setting up the stack frame, before calling abort() to crash. You were expecting SIGABRT? In reality:

mark@arm-and-hammer zsh% clang++ -Wall -Werror t_impossible_sigaltstack.cc -o t_impossible_sigaltstack
mark@arm-and-hammer zsh% ./t_impossible_sigaltstack 
zsh: illegal hardware instruction  ./t_impossible_sigaltstack
mark@arm-and-hammer zsh% cat less Library/Logs/DiagnosticReports/t_impossible_sigaltstack_*.crash
Process:               t_impossible_sigaltstack [72514]
[…]
Code Type:             ARM-64 (Native)
[…]
Exception Type:        EXC_CRASH (SIGILL)
Exception Codes:       0x0000000000000000, 0x0000000000000000
Exception Note:        EXC_CORPSE_NOTIFY

Termination Signal:    Illegal instruction: 4
Termination Reason:    Namespace SIGNAL, Code 0x4
Terminating Process:   t_impossible_sigaltstack [72514]

Application Specific Information:
dyld2 mode
abort() called

(Another note: sigaltstack is broken for arm64 in the 10.15.6 xnu sources, but it’s working in 11.0. Doesn’t actually matter for this bug, though: the 10.15.6 xnu source shows everything you need to see how this SIGILL is generated.)

How to dig deeper from here? Well, the Mach exception type that was recorded (here, EXC_CRASH) will actually be the true exception type from the original hardware fault if there was one, and will be EXC_CRASH if it was a software fault. Since your reports don’t show EXC_BAD_INSTRUCTION, they’re not real illegal instructions at all, they’re just failures to invoke the user-space signal handler. In fact yours are showing EXC_CRASH, just like my example here.

Since golang is a heavy user of SIGURG internally, and since you mentioned that the __pthread_kill calls I originally suggested might be relevant were raising SIGURG, I think we have a very strong theory that the problem is a failure to invoke the user-space SIGURG signal handler. You don’t normally think of SIGURG as being a fatal signal, because its default disposition (at SIG_DFL) is to be discarded, but in fact when a user-space handler can’t be invoked, the default disposition of SIGILL will be the controlling behavior.

The remaining question concerns why the user-space signal handler couldn’t be invoked. It could be something like my example here, with an unusable signal stack. Conceivably, it could be stack exhaustion or some other memory problem, or it could be a problem with the user thread state or signal handler thread state. But I think that this new lead will be valuable in troubleshooting this further.

Okay, I can reproduce the SIGILL with running the io benchmarks in a loop on the DTK, although with a much lower failure rate. I’ll take it from there. (I’ll be OOO this afternoon, though, so I probably won’t get back to this very soon.)

Preemption signal is not the only issue here. It can be any signals, like user generated signals. So I don’t think disabling preemption is preferable.

We could use sigaltstack+mlock on iOS >=14. But before we do that, I think we would want to understand better if #35851 is the same cause or not.

I tried to manually symbolize a kernel stack trace (by comparing the kernel memory with the symbol information of a kernel with a nearby version):

  6   2606 get_system_inshutdown:signal-send pid=670 tid=51032 io.test send signal 4 to 670 51030

              kernel.release.t8027`0xfffffe001765d038+0x244 // postsig_locked
              kernel.release.t8027`0xfffffe001765d540+0x494 // bsd_ast
              kernel.release.t8027`0xfffffe001716b9f4+0xdc  // ast_taken_user
              kernel.release.t8027`0xfffffe001712fd9c+0xc   // _arm64_thread_exception_return (label user_take_ast)
              kernel.release.t8027`0xfffffe001716bd24       // ???
              kernel.release.t8027`0xfffffe00171387c0+0x7c  // _Call_continuation

The fifth frame is weird, as it is not a call instruction. Maybe a faked return address? I’ll see if I can make sense of it…

Crash report for this error: https://gist.github.com/tonistiigi/23fbd4be8d1562ff935aa36b2dcdb299

With GODEBUG=asyncpreemptoff=1 ./io.test benchmark doesn’t reproduce issue for me either.

I’ve also seen this happen a few times. Interestingly the place I ran into it most often is when running the io benchmarks, particularly BenchmarkCopyNLarge-8.

~/go/src % ../bin/go test ./io -run=XXX -bench=. -count=10
goos: darwin
goarch: arm64
pkg: io
BenchmarkCopyNSmall-8   	 5467614	       425.9 ns/op
BenchmarkCopyNSmall-8   	 3287067	       428.3 ns/op
BenchmarkCopyNSmall-8   	 5085464	       740.8 ns/op
BenchmarkCopyNLarge-8   	signal: illegal instruction
FAIL	io	24.012s
FAIL