go: runtime: FreeBSD memory corruption involving fork system call
Please answer these questions before submitting your issue. Thanks!
- What version of Go are you using (
go version
)?
go version go1.6.2 freebsd/amd64
- What operating system and processor architecture are you using (
go env
)?
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="freebsd"
GOOS="freebsd"
GOPATH=""
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/freebsd_amd64"
GO15VENDOREXPERIMENT="1"
CC="cc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0"
CXX="clang++"
- What did you do?
package main
/* stdlib includes */
import (
"fmt"
"os/exec"
)
func run(done chan struct{}) {
cmd := exec.Command("true")
if err := cmd.Start(); err != nil {
goto finished
}
cmd.Wait()
finished:
done <- struct{}{}
return
}
func main() {
fmt.Println("Starting a bunch of goroutines...")
// 8 & 16 are arbitrary
done := make(chan struct{}, 16)
for i := 0; i < 8; i++ {
go run(done)
}
for {
select {
case <-done:
go run(done)
}
}
}
- What did you expect to see?
I expect this strange program to spawn instances of /bin/true in parallel, until I stop it.
- What did you see instead?
Various types of panics caused by what looks to be corruption within the finalizer lists, caused by what I am assuming is based on race conditions. These panics can happen as quickly as 2 minutes, or much longer. 10 minutes seems a good round number.
Occasionally addspecial gets stuck in an infinite loop holding the lock, and the process wedges. This is illustrated in log 1462933614, with x.next pointing to x. This appears to be corruption of that data structure. I have seen processes in this state run for 22 hours.
I understand there is some trepidation expressed in issue #11485 around the locking of the data structures involved.
Here are some sample messages:
fatal error: runtime.SetFinalizer: finalizer already set
runtime: nonempty check fails b.log[0]= 0 b.log[1]= 0 b.log[2]= 0 b.log[3]= 0 fatal error: workbuf is empty
1462926841-SetFinalizer-ex1.txt 1462926969-SetFinalizer-ex2.txt 1462933295-nonempty-check-fails.txt 1462933614-wedged.txt
This was run on an 8-core processor, and a 4-core 8-thread processor with ECC RAM, similar results.
Additionally, while this example is an extreme, it also represents the core functionality of a project I’ve been working on part-time for many months. I’m happy to provide any further assistance diagnosing this issue - I’m very invested!
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Reactions: 1
- Comments: 243 (161 by maintainers)
Commits related to this issue
- runtime: clear signal stack on main thread This is a workaround for a FreeBSD kernel bug. It can be removed when we are confident that all people are using the fixed kernel. See #15658. Fixes #15658... — committed to fancybits/go by ianlancetaylor 7 years ago
- runtime: clear signal stack on main thread This is a workaround for a FreeBSD kernel bug. It can be removed when we are confident that all people are using the fixed kernel. See #15658. Updates #156... — committed to golang/go by ianlancetaylor 7 years ago
- Eliminate use of sysctl command on FreeBSD In order to improve performance and help prevent crashes due to the outstanding fork crash bug: https://github.com/golang/go/issues/15658 Replace string pa... — committed to stevenh/gopsutil by stevenh 7 years ago
- Eliminate use of sysctl command on FreeBSD In order to improve performance and help prevent crashes due to the outstanding fork crash bug: https://github.com/golang/go/issues/15658 Replace string pa... — committed to ContinuumLLC/godep-shirou-gopsutil by stevenh 7 years ago
- Eliminate call to swapinfo on FreeBSD Improve performance by eliminating the fork out to swapinfo on FreeBSD which also helps prevent crashes / hangs due to the outstanding fork crash bug: golang/go#... — committed to stevenh/gopsutil by stevenh 6 years ago
- Eliminate call to uname on FreeBSD Improve performance by eliminating the fork out to uname on FreeBSD which also helps prevent crashes / hangs due to the outstanding fork crash bug: golang/go#15658 ... — committed to stevenh/gopsutil by stevenh 6 years ago
Ok reproduced it in C:
Produces:
As you can see the second main output, which is straight after an execv, already has a signal stack which it shouldn’t have.
This is corroborated by execve(2)
The signal stack is reset to be undefined (see sigaction(2) for more information).
So now to go kernel bug hunting…
Thanks to @kostikbel who knew exactly where to look for this, here’s the kernel patch for people to test:
My test here which usually crashes in a few seconds has been running 20 mins now no issues, so I’m hopeful this is the fix for this.
Ok so I may have a lead on this.
Given go 1.4 appeared to be unaffected I tried go 1.5 which also appears to be unaffected, however switching to go 1.6 and crashed within seconds.
After some bisecting I found that reverting https://github.com/golang/go/commit/a7cad52e04dd1890420452b38563997930edb2ca from go 1.6 made it stable again.
Digging some more I see that after execve sigaltstack is returning the details of a signal stack set before the execve, and which should never happen.
Ok still no issues triggered on my test box and its been running forks in a loop since 18th and 20th of November
Given this I’m tempted to say that the issue has been fixed either in FreeBSD or golang runtime.
I’ve been running the usual tests on 11.2-RELEASE-p3 with go1.11.1 and have not managed to reproduce after a weeks run.
Has anyone else seen a crash with these updated versions?
Wow, incredible work @stevenh.
Looks like that patch was committed this morning: https://github.com/freebsd/freebsd/commit/0350bb5e92a7d5e3867a00def3b358d1983a4b2a
Does it make sense to workaround this bug in go until the upstream fix propagates out? (By reverting the os1_freebsd.go changes in a7cad52e04dd1890420452b38563997930edb2ca).
Hi matt my current reproduction case is:
Then this in the lib:
It’s not quick and not guaranteed.
Having applied the two patches I’ve unfortunately manged to confirm it hasn’t solved the issue as I had a crash:
Just wanted to confirm this is still an issue with 1.9.1 which includes the signal handling fixes:
After speaking to @kostikbel this afternoon we may yet still have an outstanding issue.
Its possible that the repo case identified by @tmm1 was independent of the original issue.
I’m rebuilding the original fork only test case and retesting.
A huge thank you to @stevenh and everyone involved in tracking down this issue!
The FreeBSD change will make it into the stable/10 and stable/11 branches next week (by end of Mar 2017), but it will take more time to appear in new minor or errata releases; I suspect that it will be around August when we can count on fixed FreeBSD releases being widely available.
Is it possible to perform a run-time check, with the change in a7cad52 applying only to known fixed versions?
(Comment updated to clarify that it’s the FreeBSD change that’s making it into branches next week, and to provide a non-relative “next week”.)
Thanks @bradfitz. I only just realized that the numbers in freebsd-amd64-gce93 and freebsd-amd64-gce101 are FreeBSD versions. I thought they were GCE versions somehow.
Anyway, reproduced reliably on freebsd-amd64-110 gomote (FreeBSD 11.0). That’s progress.
To see what happened when we eliminate new goroutines and channels I’ve been testing with the the following:
After 4 days of running it wedged using very little CPU.
ktrace showed it was waiting periodically but not much else. The /debug/pprof/ was unresponsive and there was one goroutine waiting to be run in Wait4:
It seems like the runtime scheduler is failing to run either the one goroutine which holds the fork lock (above) or the pprof handler, so is some how wedged.
This may be irrelevant and just another symptom of memory corruption but I wanted to share as I’ve not seen it before.
I have the full goroutine trace after -ABRT and the gdb threads trace before that in case it might shed some additional light.
Just wanted to give a quick status-update:
As has been mentioned the fork appears to be critical. In testing, I’ve removed all the Pipes plumbing in the parent, and exited the child immediately after fork (using SYS_EXIT), for my /doesnotexist example. This reliably crashes the process, with similar crashes. Removing the fork, Everything Is Fine. This implies that the exec syscall is not implicated.
What is surprising to me, is the nature of the call. i.e. - it is surprising that a child might affect a parent process, but if the syscall itself is to blame, well that is also surprising.
I can make this work available if anyone is interested.
I feel that my next step is to circumvent the regular runtime path that /doesnotexist takes, and write a second test using, say the RawSyscall interface alone.
While updating our test box to 12.0 from 11.2 I noticed that this latest test was running with a potentially relevant kernel patch on top of 11.2-RELEASE.
If anyone can reproduce this issue on 11.2 then they should try applying the following to see if was indeed the fix: https://svnweb.freebsd.org/base?view=revision&revision=335171
This is already included in 12.0-RELEASE and was MFC’ed to 11-stable as of: https://svnweb.freebsd.org/base?view=revision&revision=335508
Thanks very much for following up on this., I’m going to close the issues in the hopes that it is indeed fixed. Anyone, please comment if you find future failures.
Not sure how as this appears to be random memory corruption of the forking proces. I’ve tried running it through truss before but it adds enough delay that it doesn’t reproduce.
I can see if tracing the fork syscall with dtrace shows anything
On Mon, 18 Jun 2018 at 14:39, Bartek Rutkowski notifications@github.com wrote:
From the description I got from alc, the way the issue this fix address exhibits, it sounds like its not our cause, but I’m test running a test none the less.
Are the folks experiencing this running on Skylake or Kaby Lake hardware? If so, this might be relevant: https://lists.debian.org/debian-devel/2017/06/msg00308.html . Please try updating your CPU microcode to the latest release.
@rsc: Thanks. Agreed that there may be other things going on here.
Having said that, the issue I’m referencing is having the same crashes (
fatal error: workbuf is not empty
,fatal error: acquirep: invalid p state
), in the same places, under Linux running the same test as in FreeBSD. I suspect this should be enough to invalidate the change to FreeBSD-only, yet perhaps not enough to merge the two issues.I haven’t seen the
fatal error: unexpected signal during runtime execution
referenced there, for example, and I’ve seen a lot of crashes.Confirmed that patch fixes the crash on FreeBSD 11.0-RELEASE-p8
Another data point for this is that after 8 hours of testing the program from https://github.com/golang/go/issues/15658#issuecomment-284940367 under go 1.4 it has still yet to fail.
The same under go 1.8 fails within a matter of minutes at the most.
Oh, and has anyone tried reproducing this on a single-core box, versus SMP?
(you can use cpuset to push the process+descendents onto a single core.)
-adrian
Here’s an updated version of the reproduction program, now with no external dependencies or goroutines:
I’d add that it also seems to need GC as with GOGC=100 it always takes many hours here, with GOGC=2-5 then its usually between 20mins and 1 hour and I’ve never had it happen with GOGC=off.
I’m going to leave a GOGC=off test running over the weekend to try and confirm that.
Had it hang on a run last night here’s a full trace with GOTRACEBACK=crash so includes the runtime routines. I also have the crash dump if its of use: https://gist.github.com/stevenh/4b6d5bde0b81503c1a4bca461a5322ae