go: runtime,cmd/compile: `exit status 0xc0000374` (`STATUS_HEAP_CORRUPTION`) on windows-amd64-longtest

#!watchflakes
post <- builder ~ `windows` && `0xc0000374`
XXXBANNERXXX:Test execution environment.
# GOARCH: amd64
# CPU: Intel(R) Xeon(R) CPU @ 2.20GHz
# GOOS: windows
# OS Version: 10.0.14393
go tool compile: exit status 0xc0000374

go tool dist: FAILED: go list -f={{if .Stale}}	STALE {{.ImportPath}}: {{.StaleReason}}{{end}} std: exit status 1

According to https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-erref/596a1078-e883-4972-9bbc-49e60bebca55, this exit code means:

0xC0000374 STATUS_HEAP_CORRUPTION A heap has been corrupted.


greplogs --dashboard -md -l -e \(\?ms\)\\Awindows-.\*0xc0000374

2022-04-27T14:23:28-f0c0e0f/windows-amd64-longtest

Since this has only been seen once, leaving on the backlog to see whether this is a recurring pattern or a one-off fluke. (CC @golang/runtime)

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 38 (35 by maintainers)

Commits related to this issue

Most upvoted comments

I wrote a simple program that ran four instances of $GOROOT/pkg/pkg/tool/$GOOS_$GOARCH=compile -V=full four times in parallel. I ran that program 100,000 times on a gomote, while in a separate terminal running go test std cmd in a loop. I never saw the STATUS_HEAP_CORRUPTION failure.

No idea what is happening here.

OK, I see where the go tool compile error message is coming from: (*Builder).toolID in cmd/go/internal/work/buildid.go. That code is invoked as, among other things, b.toolID("compile"). It runs the compiler binary directly; it does not run go tool compile. However, if the compiler fails, it prints a message as go tool compile: followed by the error message.

This is invoked while running the go list command. I’m fairly confident that this is why we are seeing these error messages.

The toolID method will always the compiler with -V=full. So it appears that very occasionally running the compiler with -V=full is causing it to exit with STATUS_HEAP_CORRUPTION.

One thing I haven’t tried is testing at exactly one of the commits that previously failed. To that end, I’ll test at f0c0e0f255c59c8ee6e463103d0b8491b8f9b1af (commit from the 2022-04-27 failure).

I’ve instrumented checkNotStale and you are right that we don’t run it very often in standard all.bash (once per ##### test block). With sharding it should be running every few packages I believe. So I can try increasing the number of staleness checks. That said, by my envelope calculations I think I’ve run ~5000 all.bash runs, so I’ve still run the staleness check quite a bit. (I have 578 other windows test failure logs sitting in /tmp!)

Three days of continuous testing on 25 windows gomotes has gotten me zero of these failures, so I suspect I am missing some required component of the failure.