go: runtime: system stack and heap corruption when interacting with cgo on Windows

What version of Go are you using (go version)?

$ go version
go version go1.20.2 windows/amd64

I also tried this with 1.8, 1.12, and 1.18

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
set GO111MODULE=
set GOARCH=amd64
set GOBIN=
set GOCACHE=C:\Users\Stephen\AppData\Local\go-build
set GOENV=C:\Users\Stephen\AppData\Roaming\go\env
set GOEXE=.exe
set GOEXPERIMENT=
set GOFLAGS=
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOINSECURE=
set GOMODCACHE=C:\msys64\mingw64\pkg\mod
set GONOPROXY=github.com/vkngwrapper/arsenal
set GONOSUMDB=github.com/vkngwrapper/arsenal
set GOOS=windows
set GOPATH=C:/msys64/mingw64
set GOPRIVATE=github.com/vkngwrapper/arsenal
set GOPROXY=https://proxy.golang.org,direct
set GOROOT=C:\Program Files\Go
set GOSUMDB=sum.golang.org
set GOTMPDIR=
set GOTOOLDIR=C:\Program Files\Go\pkg\tool\windows_amd64
set GOVCS=
set GOVERSION=go1.20.2
set GCCGO=gccgo
set GOAMD64=v1
set AR=ar
set CC=gcc
set CXX=g++
set CGO_ENABLED=1
set GOMOD=C:\Users\Stephen\projects\cgotest\go.mod
set GOWORK=
set CGO_CFLAGS=-O2 -g
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-O2 -g
set CGO_FFLAGS=-O2 -g
set CGO_LDFLAGS=-O2 -g
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m64 -mthreads -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=C:\msys64\tmp\go-build1601121073=/tmp/go-build -gno-record-gcc-switches

What did you do?

Ran this program a few times in a row: https://github.com/CannibalVox/heapcorruptrepro

What did you expect to see?

Successful completion

What did you see instead?

It often succeeds, but it fails maybe a third to half of the time on Windows, and no other operating system. The nature of the failure is different at different times. Often, I’ll see crashes on exit when attempting to run the exit syscall because the system stack has been corrupted. Sometimes the program will exit prematurely with exit code 0xc0000374 (corrupted heap). At other times, I will see access violation panics when calling C methods.

Because of the involvement of vulkan (and I can’t tell exactly what aspect of vulkan is triggering the issue to reproduce it with a different library), it’s easy to point the finger at vulkan. However, I do not believe vulkan per se is responsible:

  • The vulkan loader code is very similar between linux and windows and we cannot reproduce this in linux
  • In Windows, this is reproducing with both AMD and NVidia drivers.
  • The reproduction of this issue, in testing, depends very deeply on go behaviors that Vulkan has no knowledge of.

This issue will only repro if the following four things all happen on the same goroutine. Moving any of them elsewhere or doing them in a different order works properly.

  1. Create vulkan structures (instance and device) via cgo
  2. Spin up some goroutines to do CPU-bound work. If this is done before step 1, the issue will not repro.
  3. Wait for the goroutines to complete in a way that forces a change of OS-thread. If we force a change of OS-thread without spinning up goroutines (for instance, using time.Sleep) or wait for the goroutines without changing the OS-thread (by using runtime.LockOSThread) this issue will not repro.
  4. Destroy vulkan structures via cgo

This is not simply a case of vulkan using thread context (for one thing, it doesn’t)- we can perform create and destroy operations on any arbitrary goroutines all day long if we want to, as long as we don’t follow the above instructions to the letter. Likewise, we can do the above on linux without difficulty.

Here are the 5 scenarios that I was able to try: SUCCEED - Ubuntu 22.04.2 LTS - GeForce RTX 4090 v525.105 SUCCEED - Ubuntu 22.10 - Intel® UHD Graphics v22.2.5 SUCCEED - Ubuntu 22.10 - GeForce RTX 3070 v525.105.17 FAIL - Windows 10 - GeForce RTX 3070 v531.61 FAIL - Windows 10 - Radeon 6800M v21.20.01.24

It’s difficult to reproduce (given the fact that I can only figure out how to get vulkan to trigger it), but I believe that this is an issue with the go runtime. Vulkan would not be able to tell the difference between one goroutine on thread A creating objects and one go routine on thread B destroying them, and one goroutine creating objects, switching to thread B, and then destroying them. And it certainly can’t tell whether go has spun up goroutines performing unrelated tasks between the two points. I’m concerned that this may indicate deeper issues with cgo on windows in the go runtime.

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 37 (13 by maintainers)

Most upvoted comments

In triage, we think that trying to reproduce with LockOSThread would give us some useful information, even though in theory it shouldn’t be necessary. Also might be helpful to try and reproduce with the GC disabled (GOGC=off) to see what happens.

It cannot be reproduced with LockOSThread, as I mention in the bug description, and I turn the GC off in the linked code using debug.SetGCPercent(-1). The GC running will repro this by the way, since it spins up goroutines and waits on them- that’s what originally caused the problem and caused me to start investigating this.

Doh, apologies.