go: runtime: corrupt binary export data seen after signal preemption CL
$ go version
go version devel +d2c039fb21 Sun Nov 3 01:44:46 2019 +0000 linux/amd64
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/mvdan/.cache/go-build"
GOENV="/home/mvdan/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GONOPROXY="brank.as/*"
GONOSUMDB="brank.as/*"
GOOS="linux"
GOPATH="/home/mvdan/go"
GOPRIVATE="brank.as/*"
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/mvdan/tip"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/mvdan/tip/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/mvdan/src/gio/cmd/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build819987201=/tmp/go-build -gno-record-gcc-switches"
After building Go from master, I sometimes see errors like:
$ go test -race
# encoding/json
vet: /home/mvdan/tip/src/encoding/json/decode.go:13:2: could not import fmt (cannot import "fmt" (unknown bexport format version -1 ("\x80\x16\x13\x00\xc0\x00\x00\x00\x80\x17\x13\x00\xc0\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x06format\x01a\x00\x04esc:\x05esc:\x02\x18$GOROOT/src/fmt/print.go\x05Write\x01b\x01n\x03err\x05Width\x03wid\x02ok\tPrecision\x04prec\x04Flag\x01c\x06Format\x01f\x05State\x06String\bGoString\x01w\x06Writer\x02io")), possibly version skew - reinstall package)
PASS
ok gioui.org/cmd/gogio 4.559s
Here’s another crash from earlier today, with a slightly modified (and freshly built) Go master - you can see the error mentions a different std package:
$ go test
# mime
vet: /home/mvdan/tip/src/mime/encodedword.go:12:2: could not import io (cannot import "io" (unknown bexport format version -1 ("DX\xcdq㦔d_\xbf\x97\xa64h\xf7\x8f\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00d\x01p\x01n\x03err\x05Write\x05Close\x04Seek\x06offset\x06whence\x06Reader\x06Writer\x06Closer\x06Seeker\bReadFrom\x01r\aWriteTo\x01w\x06ReadAt\x03off\aWriteAt\bReadByte")), possibly version skew - reinstall package)
PASS
ok gioui.org/cmd/gogio 7.199s
@heschik correctly points out that this could be a bad version of vet in play, since the bexport format has been replaced with the iexport. However, I already nuked my $GOBIN, and go test -x seems to be running /home/mvdan/tip/pkg/tool/linux_amd64/vet, which is freshly built.
Usually I’d assume this is an issue with my setup, but I can’t find anything wrong with it, and I’ve only started seeing these errors today.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 88 (74 by maintainers)
I just got the C reproducer working. I’m working on tidying it up and I’ll post it. Both madvising and mmaping the sigaltstack work to clear the pages (and that is necessary). The other missing ingredient was just running lots of the processes simultaneously.
Here’s the C reproducer. This fails almost instantly on 5.3.0-1008-gcp, and torvalds/linux@d9c9ce34ed5c892323cbf5b4f9a4c498e036316a (5.1.0-rc3+). It does not fail at the parent of that commit (torvalds/linux@a5eff7259790d5314eff10563d6e59d358cce482).
I’ll work on filing this upstream with Linux.
I’ve filed the upstream kernel bug here: https://bugzilla.kernel.org/show_bug.cgi?id=205663
I set out to bisect the kernel and discovered that it also depends on the GCC version:
So this may happen in earlier kernel versions but be masked by their incompatibility with GCC 9. My guess is that GCC 9 started using AVX registers for something that GCC 8 didn’t?
I’m going to see if a later v5.2.x kernel builds with GCC 9.
I’m finally ready to actually bisect. Setting CONFIG_RETPOLINE to n works around the GCC 9 incompatibility. With this configuration, I’ve been able to reproduce the failure with v5.2, but NOT with v5.1. I’m now bisecting between the two.
For reference, here’s the script I’m using to configure, build, and load the kernel. This is based on the ubuntu-1910-eoan-v20191114 GCE image (grown to 50 GiB to fit the kernel build, sigh).
I’ve bisected the issue to kernel commit torvalds/linux@d9c9ce34ed5c892323cbf5b4f9a4c498e036316a. I haven’t dug into this commit yet, but it appears to be a fix for bug that was introduced earlier (or possibly a redo of an earlier commit).
For reference, torvalds/linux@a352a3b7b7920212ee4c45a41500c66826318e92 in that same commit series introduced another bug first that produced similar failure types, but at a far higher rate (I couldn’t even run cmd/go). However, this was fixed later in the series, somewhere between torvalds/linux@e0d3602f933367881bddfff310a744e6e61c284c and torvalds/linux@1d731e731c4cd7cbd3b1aa295f0932e7610da82f.
My first bisect log is below. This led to the fail-fast failure.
After this I backed up and looked for the original failure. This bisect also happened to reveal that the fast failure had been fixed before the original failure was introduced.
Linux kernel version-specific signal handling behavior sounds fuuuuun.
@ianlancetaylor, I just tried with Go 1.13 and sending SIGWINCH in a loop and was able to reproduce the same vet version header corruption.
The change you mentioned in the kernel suggests that the kernel is now using AVX registers and probably wasn’t before. I could definitely see that introducing bugs around save/restore (especially if that’s done lazily; I’m not sure if it is). But it’s a good question why we’re seeing this.
Well, at least you’ve proved that this is a pre-existing bug that has nothing to do with signal preemption, so clearly we don’t have to fix it for 1.14.
(Edit: this is a joke.)
I’ve now reproduced this on another Fedora 30 linux/amd64 machine, also with kernel
5.3.7-200.fc30.x86_64but this time with a Ryzen 1800X (not overclocked, 8/16 cores) with 32 GB.I’ve git bisected this and on both of my Fedora 30 machines, the commit where the failures start is 62e53b79227dafc6afcd92240c89acb8c0e1dd56, ‘runtime: use signals to preempt Gs for suspendG’, which is targeted at #10958 and #24543. I think this makes this issue CC @aclements.
@dr2chase One difference I see in the disassembly is that GCC 9 is caching the address of the thread-local variable checked by
test_thread_flagacross the function, while GCC 8 is reloading it each time. If the retry loop can cause a change between threads, then the call totest_thread_flag(TIF_NEED_FPU_LOAD)will be looking at the wrong thread if the retry occurs.@ianlancetaylor, I tried your suggestion and it does indeed still crash with a signal completely unrelated to preemption, delivered from outside the process.
I ran
while GODEBUG=asyncpreemptoff=1 ./bin/go vet all; do true; donefor 10 minutes with no failures.I then started
(while true; do killall -WINCH vet; done)(we don’t ignore XCPU, but do ignore WINCH) and the vet loop failed three times in 10 minutes (one header corruption, one index out of range, and one weird nil dereference).For your nerd-sniping amusement (and not mine), the bug is somewhere in the differences in these two disassembled files. Link Linux with the gcc8 one, all is well, link with the other, it goes bad.
signal.8.dis.txt signal.9.dis.txt
There’s a ridiculous amount of inlining going on, not sure it helps to have the source annotation in the disassembly, but here it is:
signal.8.il.dis.txt signal.9.il.dis.txt
The source file in question,
arch/x86/kernel/fpu/signal.o, is from this commit: https://github.com/torvalds/linux/commit/d9c9ce34ed5c892323cbf5b4f9a4c498e036316aI verified this by building two kernels, one entirely compiled by gcc8 except for
arch/x86/kernel/fpu/signal.o, compiled by gcc9. It fails . The other kernel is built entirely by gcc9, except forarch/x86/kernel/fpu/signal.o, built by gcc8. It does not fail.Enjoy!
I assume that the corruption does not occur if you disable preemption. You could try disabling preemption, and then run the program, and have another program hit it regularly with some signal that the Go runtime will catch but ignore, like
SIGXCPU. If that still fails, there is something wrong with the way that the AVX registers are being saved and restored when a signal occurs, and the problem has nothing to do with preemption as such.@mdempsky There is some discussion of workarounds over at #35777 (I forget why there are two different issues).
FYI, I reproduced, at Linux tip (5.4 + a little), the dependence of the bug on how
arch/x86/kernel/fpu/signal.ois compiled –The exact gcc versions are
and
Would somebody like to try this C program on a 5.3 kernel and see whether it reports any memory corruption? It runs without error on my 4.19 kernel.
Sorry for confusion, I was being sarcastic.
Of course we have to fix it, one way or another.
Change https://golang.org/cl/208218 mentions this issue:
runtime: stress testing for non-cooperative preemption