go: runtime,cmd/compile: frequent memory corruption on NetBSD and OpenBSD since 2021-10-11

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 45 (39 by maintainers)

Commits related to this issue

Most upvoted comments

Reproduced on a bare metal Ryzen 3600 running NetBSD 9.99.92

Using the quicker reproducer at #34988 on NetBSD with the help of several NetBSD folk: AMD 10h: OK (Turion II Neo N40L) AMD 15h: OK AMD 17h: NOT OK (Zen 1950X, Zen2 3600) AMD 19h: NOT OK (Zen3 5950X)

We have narrowed this down a bit to being specifically related to AMD CPUs. The E2 instances we switched to are a mix of Intel or AMD machines. https://golang.org/cl/367534 added explicit Intel-only (-n2) and AMD-only (-n2d) builders and we found:

  • Near 100% failure rate for AMD openbsd-386, netbsd-386, netbsd-amd64
  • Near 0% failure rate for AMD openbsd-amd64 (but the failures are memory corruption)
  • 0% failure rate for Intel openbsd-386, openbsd-amd64, netbsd-386, netbsd-amd64

Thus far we’ve only been able to test on GCE instances, but would love to know if these crashes reproduce on OpenBSD/NetBSD on bare-metal AMD machines.

cc @4a6f656c @bsiegert @tklauser or anyone else that may have an OpenBSD or NetBSD AMD machine, just running GOARCH=386 ./all.bash (perhaps a couple of times) should be sufficient to reproduce some kind of memory corruption crash.

This is not a bug in Go. The failing builders will be annotated with a known issue until it is resolved. Because of this, it is no longer a release blocker.

I have yet (after about a dozen tries) to reproduce this on an openbsd/amd64 host using 1.17.3 as bootstrap to build the 386 dist.

$ sysctl hw.model kern.version          
hw.model=AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx
kern.version=OpenBSD 7.0-current (GENERIC.MP) #133: Tue Nov 30 00:53:23 MST 2021
    deraadt@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP

Per https://github.com/golang/go/issues/49209#issuecomment-982057154 you would need to be running OpenBSD i386 (not amd64) to be able to reproduce the issue (OpenBSD amd64 does not run i386 binaries, which would presumably be needed to trigger the problem).

I have yet (after about a dozen tries) to reproduce this on an openbsd/amd64 host using 1.17.3 as bootstrap to build the 386 dist.

$ sysctl hw.model kern.version          
hw.model=AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx
kern.version=OpenBSD 7.0-current (GENERIC.MP) #133: Tue Nov 30 00:53:23 MST 2021
    deraadt@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP

GOARCH=386 ./all.bash only starts erroring with exec format error when tests begin to run, which is expected.

That said, I have long suspected memory corruption errors specifically related to Go, OpenBSD, AMD, and forking; see issue #34988.

I’m not sure this error is entirely a regression, as I’ve seen this with pre-Go-1.18 on NetBSD. But perhaps something is making it much more frequent.