go: runtime: FreeBSD memory corruption involving fork system call

Please answer these questions before submitting your issue. Thanks!

  1. What version of Go are you using (go version)?

go version go1.6.2 freebsd/amd64

  1. What operating system and processor architecture are you using (go env)?
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="freebsd"
GOOS="freebsd"
GOPATH=""
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/freebsd_amd64"
GO15VENDOREXPERIMENT="1"
CC="cc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0"
CXX="clang++"
  1. What did you do?
package main

/* stdlib includes */
import (
        "fmt"
        "os/exec"
)

func run(done chan struct{}) {
        cmd := exec.Command("true")
        if err := cmd.Start(); err != nil {
                goto finished
        }

        cmd.Wait()

finished:
        done <- struct{}{}
        return
}

func main() {
        fmt.Println("Starting a bunch of goroutines...")

        // 8 & 16 are arbitrary
        done := make(chan struct{}, 16)

        for i := 0; i < 8; i++ {
                go run(done)
        }

        for {
                select {
                case <-done:
                        go run(done)
                }
        }
}
  1. What did you expect to see?

I expect this strange program to spawn instances of /bin/true in parallel, until I stop it.

  1. What did you see instead?

Various types of panics caused by what looks to be corruption within the finalizer lists, caused by what I am assuming is based on race conditions. These panics can happen as quickly as 2 minutes, or much longer. 10 minutes seems a good round number.

Occasionally addspecial gets stuck in an infinite loop holding the lock, and the process wedges. This is illustrated in log 1462933614, with x.next pointing to x. This appears to be corruption of that data structure. I have seen processes in this state run for 22 hours.

I understand there is some trepidation expressed in issue #11485 around the locking of the data structures involved.

Here are some sample messages:

  • fatal error: runtime.SetFinalizer: finalizer already set
  • runtime: nonempty check fails b.log[0]= 0 b.log[1]= 0 b.log[2]= 0 b.log[3]= 0 fatal error: workbuf is empty

1462926841-SetFinalizer-ex1.txt 1462926969-SetFinalizer-ex2.txt 1462933295-nonempty-check-fails.txt 1462933614-wedged.txt

This was run on an 8-core processor, and a 4-core 8-thread processor with ECC RAM, similar results.

Additionally, while this example is an extreme, it also represents the core functionality of a project I’ve been working on part-time for many months. I’m happy to provide any further assistance diagnosing this issue - I’m very invested!

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Reactions: 1
  • Comments: 243 (161 by maintainers)

Commits related to this issue

Most upvoted comments

Ok reproduced it in C:

#include <errno.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <pthread.h>

extern char **environ;

static void
die(const char *s)
{
       perror(s);
       exit(EXIT_FAILURE);
}

static void
printsigstack(const char *prefix)
{
        stack_t ss;

        if (sigaltstack(NULL, &ss) < 0)
                die("sigaltstack get");
        printf("%s - ss_sp=%p ss_size=%lu ss_flags=0x%x\n", prefix, ss.ss_sp,
            (unsigned long)ss.ss_size, (unsigned int)ss.ss_flags);
}

static void
setstack(void *arg)
{
        stack_t ss;

        ss.ss_sp = malloc(SIGSTKSZ);
        if (ss.ss_sp == NULL)
                die("malloc");

        ss.ss_size = SIGSTKSZ;
        ss.ss_flags = 0;
        if (sigaltstack(&ss, NULL) < 0)
                die("sigaltstack set");
}


static void *
thread(void *arg)
{
        char *argv[] = { "./test-sigs", "no-more", 0 };

        printsigstack("thread - before setstack");
        setstack(arg);
        printsigstack("thread - after setstack + before execve");

        execve(argv[0], &argv[0], environ);
        die("exec failed");

        return NULL;
}

int
main(int argc, char** argv)
{
        int i, j;
        int alloc_size = 32 * 1024;
        pthread_t tid;
        char buf[1024];
        void *arg;

        printf("argc: %d, %s\n", argc, argv[0]);
        printsigstack("main");
        if (argc != 1)
                return 0;

        j = pthread_create(&tid, NULL, thread, NULL);
        if (j != 0) {
               errno = i;
               die("pthread_create");
        }

        j = pthread_join(tid, NULL);
        if (j != 0) {
                errno = j;
                die("pthread_join");
        }

        return 0;
}

Produces:

argc: 1, ./test-sigs
main - ss_sp=0x0 ss_size=0 ss_flags=0x4
thread - before setstack - ss_sp=0x0 ss_size=0 ss_flags=0x4
thread - after setstack + before execve - ss_sp=0x8014106c0 ss_size=34816 ss_flags=0x0
argc: 2, ./test-sigs
main - ss_sp=0x8014106c0 ss_size=34816 ss_flags=0x0

As you can see the second main output, which is straight after an execv, already has a signal stack which it shouldn’t have.

This is corroborated by execve(2) The signal stack is reset to be undefined (see sigaction(2) for more information).

So now to go kernel bug hunting…

Thanks to @kostikbel who knew exactly where to look for this, here’s the kernel patch for people to test:

Index: sys/kern/kern_sig.c
===================================================================
--- sys/kern/kern_sig.c (revision 315303)
+++ sys/kern/kern_sig.c (working copy)
@@ -976,7 +976,6 @@ execsigs(struct proc *p)
         * and are now ignored by default).
         */
        PROC_LOCK_ASSERT(p, MA_OWNED);
-       td = FIRST_THREAD_IN_PROC(p);
        ps = p->p_sigacts;
        mtx_lock(&ps->ps_mtx);
        while (SIGNOTEMPTY(ps->ps_sigcatch)) {
@@ -1007,6 +1006,8 @@ execsigs(struct proc *p)
         * Reset stack state to the user stack.
         * Clear set of signals caught on the signal stack.
         */
+       td = curthread;
+       MPASS(td->td_proc == p);
        td->td_sigstk.ss_flags = SS_DISABLE;
        td->td_sigstk.ss_size = 0;
        td->td_sigstk.ss_sp = 0;

My test here which usually crashes in a few seconds has been running 20 mins now no issues, so I’m hopeful this is the fix for this.

Ok so I may have a lead on this.

Given go 1.4 appeared to be unaffected I tried go 1.5 which also appears to be unaffected, however switching to go 1.6 and crashed within seconds.

After some bisecting I found that reverting https://github.com/golang/go/commit/a7cad52e04dd1890420452b38563997930edb2ca from go 1.6 made it stable again.

Digging some more I see that after execve sigaltstack is returning the details of a signal stack set before the execve, and which should never happen.

Ok still no issues triggered on my test box and its been running forks in a loop since 18th and 20th of November

root  50215  273.0  0.0   104384   4988  9  S+J  20Nov18  134201:59.00 ./test
root  46993  259.4  0.0   104640   3204  8  S+J  18Nov18  151748:35.09 ./test2

Given this I’m tempted to say that the issue has been fixed either in FreeBSD or golang runtime.

I’ve been running the usual tests on 11.2-RELEASE-p3 with go1.11.1 and have not managed to reproduce after a weeks run.

Has anyone else seen a crash with these updated versions?

Wow, incredible work @stevenh.

Looks like that patch was committed this morning: https://github.com/freebsd/freebsd/commit/0350bb5e92a7d5e3867a00def3b358d1983a4b2a

Does it make sense to workaround this bug in go until the upstream fix propagates out? (By reverting the os1_freebsd.go changes in a7cad52e04dd1890420452b38563997930edb2ca).

Hi matt my current reproduction case is:

package main
  
import (
        "fmt"
        "log"
        "runtime"
        "syscall"
)

var (
        forkRoutines = 16
)

func run(done chan struct{}) {
        if err := syscall.ForkOnlyBSDTest(); err != nil {
                log.Fatal(err)
        }

        done <- struct{}{}
}

func main() {
        fmt.Printf("Starting %v forking goroutines...\n", forkRoutines)
        fmt.Println("GOMAXPROCS:", runtime.GOMAXPROCS(0))

        done := make(chan struct{}, forkRoutines*2)

        for i := 0; i < forkRoutines; i++ {
                go run(done)
        }

        for range done {
                go run(done)
        }
}

Then this in the lib:

// +build darwin dragonfly freebsd netbsd openbsd
  
package syscall

func ForkOnlyBSDTest() (err error) {
        var r1 uintptr
        var pid int
        var err1 Errno
        var wstatus WaitStatus

        ForkLock.Lock()
        runtime_BeforeFork()

        r1, _, err1 = RawSyscall(SYS_FORK, 0, 0, 0)
        if err1 != 0 {
                runtime_AfterFork()
                ForkLock.Unlock()
                return err1
        }

        if r1 == 0 {
                // in child, die die die my darling
                for {
                        RawSyscall(SYS_EXIT, 253, 0, 0)
                }
        }

        runtime_AfterFork()
        ForkLock.Unlock()

        pid = int(r1)

        // Prime directive, exterminate
        // Whatever stands left
        _, err = Wait4(pid, &wstatus, 0, nil)
        for err == EINTR {
                _, err = Wait4(pid, &wstatus, 0, nil)
        }

        return
}

It’s not quick and not guaranteed.

Having applied the two patches I’ve unfortunately manged to confirm it hasn’t solved the issue as I had a crash:

acquirep: p->m=5491744(0) p->status=1
fatal error: acquirep: invalid p state

Just wanted to confirm this is still an issue with 1.9.1 which includes the signal handling fixes:

2017/10/10 22:57:08 Starting 8 forking goroutines...
2017/10/10 22:57:08 GOMAXPROCS: 24 version: go1.9.1
acquirep: p->m=842351920128(9) p->status=1
fatal error: acquirep: invalid p state

runtime stack:
runtime.throw(0x7a8101, 0x19)
        /usr/local/go/src/runtime/panic.go:605 +0x95
runtime.acquirep1(0xc42001ec00)
        /usr/local/go/src/runtime/proc.go:3689 +0x16d
runtime.acquirep(0xc42001ec00)
        /usr/local/go/src/runtime/proc.go:3661 +0x2b
runtime.findrunnable(0xc420021300, 0x0)
        /usr/local/go/src/runtime/proc.go:2068 +0x74f
runtime.schedule()
        /usr/local/go/src/runtime/proc.go:2245 +0x12c
runtime.park_m(0xc420152900)
        /usr/local/go/src/runtime/proc.go:2308 +0xb6
runtime.mcall(0x800a0b040)
        /usr/local/go/src/runtime/asm_amd64.s:286 +0x5b
...
1 runs completed, 1 failures, over 6h47m54s

After speaking to @kostikbel this afternoon we may yet still have an outstanding issue.

Its possible that the repo case identified by @tmm1 was independent of the original issue.

I’m rebuilding the original fork only test case and retesting.

A huge thank you to @stevenh and everyone involved in tracking down this issue!

The FreeBSD change will make it into the stable/10 and stable/11 branches next week (by end of Mar 2017), but it will take more time to appear in new minor or errata releases; I suspect that it will be around August when we can count on fixed FreeBSD releases being widely available.

Is it possible to perform a run-time check, with the change in a7cad52 applying only to known fixed versions?

(Comment updated to clarify that it’s the FreeBSD change that’s making it into branches next week, and to provide a non-relative “next week”.)

Thanks @bradfitz. I only just realized that the numbers in freebsd-amd64-gce93 and freebsd-amd64-gce101 are FreeBSD versions. I thought they were GCE versions somehow.

Anyway, reproduced reliably on freebsd-amd64-110 gomote (FreeBSD 11.0). That’s progress.

To see what happened when we eliminate new goroutines and channels I’ve been testing with the the following:

package main

import (
        "log"
        "net/http"
        _ "net/http/pprof"
        "runtime"
        "syscall"
)

var (
        forkRoutines = 16
)

func run() {
        for {
                if err := syscall.ForkOnlyBSDTest(); err != nil {
                        log.Fatal(err)
                }
        }
}

func main() {
        log.Printf("Starting %v forking goroutines...\n", forkRoutines)
        log.Println("GOMAXPROCS:", runtime.GOMAXPROCS(0))

        for i := 0; i < forkRoutines; i++ {
                go run()
        }

        log.Println(http.ListenAndServe(":8080", nil))
}

After 4 days of running it wedged using very little CPU.

ktrace showed it was waiting periodically but not much else. The /debug/pprof/ was unresponsive and there was one goroutine waiting to be run in Wait4:

goroutine 71 [runnable]:
syscall.Wait4(0x13e31, 0xc42008d73c, 0x0, 0x0, 0x13e31, 0x0, 0x0)
        /usr/local/go/src/syscall/syscall_bsd.go:127 fp=0xc42008d700 sp=0xc42008d6f8
syscall.ForkOnlyBSDTest(0x0, 0x0)
        /usr/local/go/src/syscall/exec_bsd_debug.go:35 +0xf1 fp=0xc42008d780 sp=0xc42008d700
main.run()
        /data/go/src/github.com/multiplay/go/apps/test/main.go:17 +0x2e fp=0xc42008d7c0 sp=0xc42008d780
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc42008d7c8 sp=0xc42008d7c0
created by main.main
        /data/go/src/github.com/multiplay/go/apps/test/main.go:28 +0x1c4

It seems like the runtime scheduler is failing to run either the one goroutine which holds the fork lock (above) or the pprof handler, so is some how wedged.

This may be irrelevant and just another symptom of memory corruption but I wanted to share as I’ve not seen it before.

I have the full goroutine trace after -ABRT and the gdb threads trace before that in case it might shed some additional light.

Just wanted to give a quick status-update:

As has been mentioned the fork appears to be critical. In testing, I’ve removed all the Pipes plumbing in the parent, and exited the child immediately after fork (using SYS_EXIT), for my /doesnotexist example. This reliably crashes the process, with similar crashes. Removing the fork, Everything Is Fine. This implies that the exec syscall is not implicated.

What is surprising to me, is the nature of the call. i.e. - it is surprising that a child might affect a parent process, but if the syscall itself is to blame, well that is also surprising.

I can make this work available if anyone is interested.

I feel that my next step is to circumvent the regular runtime path that /doesnotexist takes, and write a second test using, say the RawSyscall interface alone.

While updating our test box to 12.0 from 11.2 I noticed that this latest test was running with a potentially relevant kernel patch on top of 11.2-RELEASE.

If anyone can reproduce this issue on 11.2 then they should try applying the following to see if was indeed the fix: https://svnweb.freebsd.org/base?view=revision&revision=335171

This is already included in 12.0-RELEASE and was MFC’ed to 11-stable as of: https://svnweb.freebsd.org/base?view=revision&revision=335508

Thanks very much for following up on this., I’m going to close the issues in the hopes that it is indeed fixed. Anyone, please comment if you find future failures.

Not sure how as this appears to be random memory corruption of the forking proces. I’ve tried running it through truss before but it adds enough delay that it doesn’t reproduce.

I can see if tracing the fork syscall with dtrace shows anything

On Mon, 18 Jun 2018 at 14:39, Bartek Rutkowski notifications@github.com wrote:

@justinclift https://github.com/justinclift Wouldn’t DTrace ( https://www.freebsd.org/doc/handbook/dtrace.html) be enough to investigate this problem?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/15658#issuecomment-398056601, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGXL2HVT4qVJ4zI1cAoC7BLw8MxQpiYks5t962GgaJpZM4IcqQO .

From the description I got from alc, the way the issue this fix address exhibits, it sounds like its not our cause, but I’m test running a test none the less.

Are the folks experiencing this running on Skylake or Kaby Lake hardware? If so, this might be relevant: https://lists.debian.org/debian-devel/2017/06/msg00308.html . Please try updating your CPU microcode to the latest release.

@rsc: Thanks. Agreed that there may be other things going on here.

Having said that, the issue I’m referencing is having the same crashes (fatal error: workbuf is not empty, fatal error: acquirep: invalid p state), in the same places, under Linux running the same test as in FreeBSD. I suspect this should be enough to invalidate the change to FreeBSD-only, yet perhaps not enough to merge the two issues.

I haven’t seen the fatal error: unexpected signal during runtime execution referenced there, for example, and I’ve seen a lot of crashes.

I think we can work around the problem without requiring a version check. Can someone with an affected system please test whether https://golang.org/cl/38325 fixes the problem? Thanks.

Confirmed that patch fixes the crash on FreeBSD 11.0-RELEASE-p8

Another data point for this is that after 8 hours of testing the program from https://github.com/golang/go/issues/15658#issuecomment-284940367 under go 1.4 it has still yet to fail.

The same under go 1.8 fails within a matter of minutes at the most.

Oh, and has anyone tried reproducing this on a single-core box, versus SMP?

(you can use cpuset to push the process+descendents onto a single core.)

-adrian

Here’s an updated version of the reproduction program, now with no external dependencies or goroutines:

package main

import (
	"bufio"
	"log"
	"os"
	"os/exec"
	"regexp"
	"syscall"
	"time"
)

func init() {
	exec.Command("true").Run()

	// do some work after the fork/exec, to expose memory corruption if it occurred
	f, _ := os.Open("/var/run/dmesg.boot")
	defer f.Close()
	scanner := bufio.NewScanner(f)
	for scanner.Scan() {
		line := scanner.Text()
		regexp.MustCompile(`Origin\s*=\s*"(.+)"\s+Id\s*=\s*(.+)\s+Family\s*=\s*(.+)\s+Model\s*=\s*(.+)\s+Stepping\s*=\s*(.+)`).FindStringSubmatch(line)
	}
}

func main() {
	log.SetFlags(log.LstdFlags)
	log.Printf("started process")
	time.Sleep(1 * time.Millisecond)
	log.Printf("running exec...")
	syscall.Exec(os.Args[0], os.Args, os.Environ())
	log.Printf("exec failed!")
}

I’d add that it also seems to need GC as with GOGC=100 it always takes many hours here, with GOGC=2-5 then its usually between 20mins and 1 hour and I’ve never had it happen with GOGC=off.

I’m going to leave a GOGC=off test running over the weekend to try and confirm that.

Had it hang on a run last night here’s a full trace with GOTRACEBACK=crash so includes the runtime routines. I also have the crash dump if its of use: https://gist.github.com/stevenh/4b6d5bde0b81503c1a4bca461a5322ae