go: syscall: memory corruption when forking on OpenBSD, NetBSD, AIX, and Solaris
#!watchflakes
default <- `fatal error: (?:.*\n\s*)*syscall\.forkExec` && (goos == "aix" || goos == "netbsd" || goos == "openbsd" || goos == "solaris")
What version of Go are you using (go version
)?
$ go version go version go1.13.2 openbsd/amd64
Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (go env
)?
go env
Output
$ go env GO111MODULE="" GOARCH="amd64" GOBIN="" GOCACHE="/home/jrick/.cache/go-build" GOENV="/home/jrick/.config/go/env" GOEXE="" GOFLAGS="-tags=netgo -ldflags=-extldflags=-static" GOHOSTARCH="amd64" GOHOSTOS="openbsd" GONOPROXY="" GONOSUMDB="" GOOS="openbsd" GOPATH="/home/jrick/go" GOPRIVATE="" GOPROXY="https://proxy.golang.org,direct" GOROOT="/home/jrick/src/go" GOSUMDB="sum.golang.org" GOTMPDIR="" GOTOOLDIR="/home/jrick/src/go/pkg/tool/openbsd_amd64" GCCGO="gccgo" AR="ar" CC="gcc" CXX="g++" CGO_ENABLED="1" GOMOD="" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0"
What did you do?
I observed these issues in one of my applications, and assumed it was a race or invalid unsafe.Pointer usage or some other fault of the application code. When the 1.13.2 release dropped yesterday I built it from source and observed a similar issue running the regression tests. The failed regression test does not look related to the memory corruption, but I can reproduce the problem by repeatedly running the test in a loop:
$ cd test # from go repo root
$ while :; do go run run.go -- fixedbugs/issue27829.go || break; done >go.panic 2>&1
It can take several minutes to observe the issue but here are some of the captured panics and fatal runtime errors:
Additionally, I observed go run
hanging (no runtime failure due to deadlock) and it had to be killed with SIGABRT to get a trace: https://gist.githubusercontent.com/jrick/d4ae1e4355a7ac42f1910b7bb10a1297/raw/54e408c51a01444abda76dc32ac55c2dd217822b/gistfile1.txt
It may not matter which regression test is run as the errors also occur in run.go.
About this issue
- Original URL
- State: open
- Created 5 years ago
- Comments: 55 (39 by maintainers)
Commits related to this issue
- Attempt to guarantee that on copy-on-write faulting, the new copy can't be written to while any thread can see the original version of the page via a not-yet-flushed stale TLB entry: pmaps can indicat... — committed to openbsd/src by deleted user 2 years ago
- os/exec: parallelize more tests This cuts the wall duration for 'go test os/exec' and 'go test -race os/exec' roughly in half on my machine, which is an even more significant speedup with a high '-co... — committed to golang/go by bcmills 2 years ago
- uvm: prevent TLB invalidation races during COW resolution When a thread takes a page fault which results in COW resolution, other threads in the same process can be concurrently accessing that same m... — committed to NetBSD/src by deleted user a year ago
- Pull up following revision(s) (requested by chs in ticket #327): sys/uvm/uvm_fault.c: revision 1.234 uvm: prevent TLB invalidation races during COW resolution When a thread takes a page fault whic... — committed to NetBSD/src by MartinHusemann a year ago
- Pull up following revision(s) (requested by chs in ticket #1714): sys/uvm/uvm_fault.c: revision 1.234 uvm: prevent TLB invalidation races during COW resolution When a thread takes a page fault whi... — committed to NetBSD/src by MartinHusemann a year ago
- Pull up following revision(s) (requested by chs in ticket #327): sys/uvm/uvm_fault.c: revision 1.234 uvm: prevent TLB invalidation races during COW resolution When a thread takes a page fault whic... — committed to NetBSD/src by MartinHusemann a year ago
- uvm: prevent TLB invalidation races during COW resolution When a thread takes a page fault which results in COW resolution, other threads in the same process can be concurrently accessing that same m... — committed to IIJ-NetBSD/netbsd-src by deleted user a year ago
- Pull up following revision(s) (requested by chs in ticket #327): sys/uvm/uvm_fault.c: revision 1.234 uvm: prevent TLB invalidation races during COW resolution When a thread takes a page fault whic... — committed to IIJ-NetBSD/netbsd-src by MartinHusemann a year ago
- Pull up following revision(s) (requested by chs in ticket #1714): sys/uvm/uvm_fault.c: revision 1.234 uvm: prevent TLB invalidation races during COW resolution When a thread takes a page fault whi... — committed to IIJ-NetBSD/netbsd-src by MartinHusemann a year ago
I managed to reproduce this on NetBSD with a C program!
forkstress.c
:Build and run:
Notes:
fork
syscall directly rather than usingfork()
from libc. I think this is simply because libcfork()
is significantly slower to return thansyscall(SYS_fork)
, and we seem to have a small race window._malloc_prefork/postfork
and (I believe) all of the registered atfork callbacks (here, here, and here), and none of them seem important, as neither thread is interacting with pthread or malloc.The summarized behavior we see is:
page->b = 102
fork()
page->b = 2
page->b
, observe 102 instead of 2.page->b
again, which typically observes 2 again.All while another thread is spinning writing to
page->c
(unrelated word in the same page).While debugging oxidecomputer/omicron#1146 I saw that this bug mentions Solaris and wondered if it might affect illumos as well, since the failure modes look the same for my issue. For the record, I don’t think my issue was caused by this one. I ran the Go and C test programs for several days without issue, and I ultimately root-caused my issue to illumos#15254. I mention this in case anyone in the future is wondering if illumos is affected by this. I don’t know whether Solaris (or any other system) has the same issue with preserving the %ymm registers across signal handlers, but that can clearly cause the same failure modes shown here.
OpenBSD applied the following change:
https://github.com/openbsd/src/commit/43687ba57c7d88063c6fa2df2386adbd1a0cf241
Based on a cursory skim (without having thought much about the details of the change), I suspect that this change is designed to fix another copy-on-write bug just like https://reviews.freebsd.org/D14347 and like I described in https://github.com/golang/go/issues/34988#issuecomment-997000571.
However, I have no theory for how the FreeBSD fix or the NetBSD fix could affect the problem we detected here, because that copy-on-write bug – and the fix in FreeBSD and NetBSD – makes sense only if the TLB IPI handler runs between the store and load of the memory location at issue, whereas our measurements on NetBSD indicate that that’s physically implausible for the issue in this thread.
In NetBSD, we have not yet committed a fix for that COW bug because our draft fixes (like the one in https://github.com/golang/go/issues/34988#issuecomment-997000571) seem to have the side effect of suppressing the issue in this thread. I’m still hoping to hear from AMD with an idea about what could be going wrong in the 10ns window we observed but they haven’t gotten back to me yet. However, if we don’t hear anything before NetBSD 10.0 is ready we’ll probably just apply the unrelated-COW fix.
I tried the linked C program on aix with a couple modifications. I had to modify it to use
fork
, I am not sure if aix can fork without going through libc, so the test might not really indicate anything in this case. It did not reproduce anything after running for 5 or so minutes.FWIW, the NetBSD issue I filed is at http://gnats.netbsd.org/56535.
@bokunodev the issue you are describing is not related to this bug. You perhaps accidentally replied to this bug, or need to file a new issue.
spoke too soon:
https://gist.githubusercontent.com/jrick/a071767cde2d2d71b210135cf8282b04/raw/6fcd814e5a93a6a1d204c2d00b0a1f4195664d61/gistfile1.txt