go: syscall: memory corruption when forking on OpenBSD, NetBSD, AIX, and Solaris

#!watchflakes
default <- `fatal error: (?:.*\n\s*)*syscall\.forkExec` && (goos == "aix" || goos == "netbsd" || goos == "openbsd" || goos == "solaris")

What version of Go are you using (go version)?

$ go version
go version go1.13.2 openbsd/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/jrick/.cache/go-build"
GOENV="/home/jrick/.config/go/env"
GOEXE=""
GOFLAGS="-tags=netgo -ldflags=-extldflags=-static"
GOHOSTARCH="amd64"
GOHOSTOS="openbsd"
GONOPROXY=""
GONOSUMDB=""
GOOS="openbsd"
GOPATH="/home/jrick/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/jrick/src/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/jrick/src/go/pkg/tool/openbsd_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0"

What did you do?

I observed these issues in one of my applications, and assumed it was a race or invalid unsafe.Pointer usage or some other fault of the application code. When the 1.13.2 release dropped yesterday I built it from source and observed a similar issue running the regression tests. The failed regression test does not look related to the memory corruption, but I can reproduce the problem by repeatedly running the test in a loop:

$ cd test # from go repo root
$ while :; do go run run.go -- fixedbugs/issue27829.go || break; done >go.panic 2>&1

It can take several minutes to observe the issue but here are some of the captured panics and fatal runtime errors:

https://gist.githubusercontent.com/jrick/f8b21ecbfbe516e1282b757d1bfe4165/raw/6cf0efb9ba47ba869f98817ce945971f2dff47d6/gistfile1.txt

https://gist.githubusercontent.com/jrick/9a54c085b918aa32910f4ece84e5aa21/raw/91ec29275c2eb1be49f62ad8a01a5317ad168c94/gistfile1.txt

https://gist.githubusercontent.com/jrick/8faf088593331c104cc0da0adb3f24da/raw/7c92e7e7d60d426b2156fd1bdff42e0717b708f1/gistfile1.txt

https://gist.githubusercontent.com/jrick/4645316444c12cd815fb71874f6bdfc4/raw/bffac2a448b07242a538b77a2823c9db34b6ef6f/gistfile1.txt

https://gist.githubusercontent.com/jrick/3843b180670811069319e4122d32507a/raw/0d1f897aa25d91307b04ae951f1b260f33246b61/gistfile1.txt

https://gist.githubusercontent.com/jrick/99b7171c5a49b4b069edf06884ad8e17/raw/740c7b9e8fa64d9ad149fd2669df94e89c466927/gistfile1.txt

Additionally, I observed go run hanging (no runtime failure due to deadlock) and it had to be killed with SIGABRT to get a trace: https://gist.githubusercontent.com/jrick/d4ae1e4355a7ac42f1910b7bb10a1297/raw/54e408c51a01444abda76dc32ac55c2dd217822b/gistfile1.txt

It may not matter which regression test is run as the errors also occur in run.go.

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Comments: 55 (39 by maintainers)

Commits related to this issue

Most upvoted comments

I managed to reproduce this on NetBSD with a C program!

forkstress.c:

#include <pthread.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/wait.h>
#include <unistd.h>

void __attribute((noinline)) spin(uint64_t loops) {
  for (volatile uint64_t i = 0; i < loops; i++) {
  }
}

struct thing {
  uint64_t b;
  uint32_t c;  // Making this (plus 'sink' below) uint64_t may make repro take longer?
};

volatile struct thing* page;

volatile uint32_t sink;

int ready;

void* thread(void* arg) {
  __atomic_store_n(&ready, 1, __ATOMIC_SEQ_CST);

  while (1) {
    // Spin not strictly required, but it speeds up repro in my case.
    spin(40*1000);

    // Atomic not required, this works too:
    // page->c = sink;
    __atomic_store_n(&page->c, sink, __ATOMIC_SEQ_CST);
    sink++;
  }
}

int main(void) {
  page = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
  if (page == MAP_FAILED) {
    perror("mmap");
    return 1;
  }

  pthread_t thread_id;
  int ret = pthread_create(&thread_id, NULL, &thread, NULL);
  if (ret != 0) {
    perror("pthread_create");
    return 1;
  }

  // Wait for child thread to start.
  //
  // This is not required to repro, but eliminates racing fork+thread create as
  // a possibility.
  while (!__atomic_load_n(&ready, __ATOMIC_SEQ_CST)) {
  }

  int64_t i = 0;
  while (1) {
    i++;
    if (i % 10000 == 0) {
      printf("Loop %d...\n", i);
    }

    page->b = 102;
    // Does not work with libc fork(). libc fork() is significantly slower,
    // which may be the problem.
    uint64_t pid = syscall(SYS_fork);
    if (pid == 0) {
      /* Child */
      _exit(0);
    }

    /* Parent */
    /* spin(40*1000); may speed up repro. */
    page->b = 2;
    uint64_t pb = page->b;
    if (pb != 2) {
      printf("Corruption! pb, page->b = %lu, %lu\n", pb, page->b);
      _exit(1);
    }

    int status;
    ret = waitpid(pid, &status, 0);
    if (ret < 0) {
      perror("waitpid");
      return 1;
    }
    if (WEXITSTATUS(status) != 0) {
      printf("Bad child status %#x\n", status);
      return 1;
    }
  }
}

Build and run:

$ cc -pthread forkstress.c
$ ./a.out & ./a.out & ./a.out  & ./a.out
Loop 10000...
Corruption! pb, page->b = 102, 2
Loop 10000...
Loop 10000...
Loop 10000...
Corruption! pb, page->b = 102, 2
Loop 20000...
Loop 20000...
Corruption! pb, page->b = 102, 2

Notes:

  • Anecdotally, running several instances at the same time seems to speed up repro more than a linear speedup (system under more load?). That said, a single instance will eventually fail.
  • Sometimes I get several failures immediately, sometimes it takes >60s to get a failure.
  • The most immediately interesting part here is that I call the fork syscall directly rather than using fork() from libc. I think this is simply because libc fork() is significantly slower to return than syscall(SYS_fork), and we seem to have a small race window.
    • I’ve looked through _malloc_prefork/postfork and (I believe) all of the registered atfork callbacks (here, here, and here), and none of them seem important, as neither thread is interacting with pthread or malloc.

The summarized behavior we see is:

  1. page->b = 102
  2. fork()
  3. page->b = 2
  4. Read page->b, observe 102 instead of 2.
  5. When logging the corruption, we load page->b again, which typically observes 2 again.

All while another thread is spinning writing to page->c (unrelated word in the same page).

While debugging oxidecomputer/omicron#1146 I saw that this bug mentions Solaris and wondered if it might affect illumos as well, since the failure modes look the same for my issue. For the record, I don’t think my issue was caused by this one. I ran the Go and C test programs for several days without issue, and I ultimately root-caused my issue to illumos#15254. I mention this in case anyone in the future is wondering if illumos is affected by this. I don’t know whether Solaris (or any other system) has the same issue with preserving the %ymm registers across signal handlers, but that can clearly cause the same failure modes shown here.

OpenBSD applied the following change:

https://github.com/openbsd/src/commit/43687ba57c7d88063c6fa2df2386adbd1a0cf241

Based on a cursory skim (without having thought much about the details of the change), I suspect that this change is designed to fix another copy-on-write bug just like https://reviews.freebsd.org/D14347 and like I described in https://github.com/golang/go/issues/34988#issuecomment-997000571.

However, I have no theory for how the FreeBSD fix or the NetBSD fix could affect the problem we detected here, because that copy-on-write bug – and the fix in FreeBSD and NetBSD – makes sense only if the TLB IPI handler runs between the store and load of the memory location at issue, whereas our measurements on NetBSD indicate that that’s physically implausible for the issue in this thread.

In NetBSD, we have not yet committed a fix for that COW bug because our draft fixes (like the one in https://github.com/golang/go/issues/34988#issuecomment-997000571) seem to have the side effect of suppressing the issue in this thread. I’m still hoping to hear from AMD with an idea about what could be going wrong in the 10ns window we observed but they haven’t gotten back to me yet. However, if we don’t hear anything before NetBSD 10.0 is ready we’ll probably just apply the unrelated-COW fix.

I tried the linked C program on aix with a couple modifications. I had to modify it to use fork, I am not sure if aix can fork without going through libc, so the test might not really indicate anything in this case. It did not reproduce anything after running for 5 or so minutes.

FWIW, the NetBSD issue I filed is at http://gnats.netbsd.org/56535.

@bokunodev the issue you are describing is not related to this bug. You perhaps accidentally replied to this bug, or need to file a new issue.

spoke too soon:

package main

import "os/exec"

func main() {
	loop := func() {
		for {
			err := exec.Command("/usr/bin/true").Run()
			if err != nil {
				panic(err)
			}
		}
	}
	for i := 0; i < 100; i++ {
		go loop()
	}
	select {}
}

https://gist.githubusercontent.com/jrick/a071767cde2d2d71b210135cf8282b04/raw/6fcd814e5a93a6a1d204c2d00b0a1f4195664d61/gistfile1.txt