go: cmd/compile: compiler built with PGO crashes occasionally on ppc64{le}

#!watchflakes
default <- goarch ~ `ppc64` && date != "" && date > "2023-05-22" && date < "2023-06-01"

https://go.dev/cl/495596 added a default.pgo profile for cmd/compile, enabling a PGO build of the compiler (as long as -pgo=none is not set explicitly).

This caused a variety of crashes in the compiler on ppc64{le} builders:

2023-05-18T16:55:07-88f89d8/linux-ppc64le-power9osu 2023-05-18T13:41:27-33a601b/aix-ppc64 2023-05-18T12:52:14-7b0835d/aix-ppc64 2023-05-18T10:23:17-75add1c/aix-ppc64 2023-05-18T09:16:07-774f602/linux-ppc64-sid-power10 2023-05-18T09:16:07-774f602/linux-ppc64le-buildlet 2023-05-18T09:15:25-27906bb/aix-ppc64 2023-05-18T09:15:25-27906bb/linux-ppc64-sid-power10 2023-05-18T01:40:37-6ed8474/linux-ppc64-sid-buildlet 2023-05-18T01:40:37-6ed8474/linux-ppc64-sid-power10 2023-05-18T00:35:53-956d31e/linux-ppc64le-buildlet 2023-05-17T22:11:31-0b86a04/linux-ppc64le-buildlet 2023-05-17T21:53:11-c426c87/linux-ppc64-sid-buildlet 2023-05-17T21:53:11-c426c87/linux-ppc64le-power10osu 2023-05-17T21:44:30-2693ade/linux-ppc64-sid-power10

That CL also caused #60263. Since it caused several issues, the CL was reverted in https://go.dev/cl/496185. #60263 has since been fixed.

Given these failures are all on ppc64{le}, @dr2chase and I suspect that they are due to a bad ppc64-specific optimization (SSA rule, e.g.) that is tickled by the additional inlining caused by PGO.

I have had some success reproducing these crashes. Running all.bash in a loop on three linux-ppc64-sid-power10 builders concurrently with GOGC=5 usually gets me a failure in <30 minutes. Not stellar, but I think workable.

We should then be able to bisect down to a bad function with GOSSAHASH applied in inlineCostOK to enable/disable PGO-based inlining. (Also set -d=pgodevirtualize=0 to disable PGO-based devirtualization, which was submitted in https://go.dev/cl/492436, after the bad CL above was reverted).

Given that there is a path forward to debugging ppc64, and we’d like more soak time on the primary ports, I intend to resubmit https://go.dev/cl/495596, which will make ppc64 flaky until this issue is resolved. (GOARCH=ppc64{le} could also temporarily change the default of -pgo from auto to none if necessary).

cc @golang/ppc64 @dr2chase @aclements @cherrymui

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 38 (18 by maintainers)

Commits related to this issue

Most upvoted comments

If the commit time is after the issue closure time, watchflakes should reopen the issue, otherwise it just posts. So if the issue is not reopened we can assume the failures are old.

Change https://go.dev/cl/499679 mentions this issue: runtime: preserve R29 in the write barrier flush path on ppc64

@laboger PROGRESS!

I ran a binary search on the flake, and isolated it to a single PGO inline.

you can compare two versions of ssa.html with (full recipe):

First checkout experiment CL, build a known-good (no PGO-inlining) compiler, and save it:

cd whatever/go/src
git fetch https://go.googlesource.com/go refs/changes/57/497557/1 && git checkout -b change-497557 FETCH_HEAD
GOCOMPILEDEBUG=gossahash=10101010101010110101001011 ./make.bash
go install golang.org/x/tools/cmd/toolstash@latest
toolstash save

Then you can build a bad compiler, or a good compiler, with output to a nearby directory (…/bad or …/good):

GOSSADIR=`pwd`/../bad  GOSSAFUNC='cmd/compile/internal/ssa.(*expandState).rewriteSelect' GOCOMPILEDEBUG=gossahash=v1100101001011 toolstash go install cmd/compile

GOSSADIR=`pwd`/../good  GOSSAFUNC='cmd/compile/internal/ssa.(*expandState).rewriteSelect' GOCOMPILEDEBUG=gossahash=10101010101010110101001011 toolstash go install cmd/compile

Note that gossahash=10101010101010110101001011 matches nothing, but is quieter than n which logs a lot about what it isn’t doing. When you build the bad compiler with the command line above, it should print:

# cmd/compile/internal/ssa
cmd/compile/internal/ssa/expand_calls.go:550:22 [bisect-match 0xa6214988bb18594b]
gossahash triggered cmd/compile/internal/ssa/expand_calls.go:550:22 000110000101100101001011
dumped SSA to ../bad/cmd/compile/internal/ssa.(*expandState).rewriteSelect.html

I have not had a chance to dig into this yet, it seems likely that you might be able to make faster progress than me, unless something obvious leaps out at me.

We have done some experimenting with this. I found that if we build the default.pgo file natively instead of using the one built on amd64 we are not able to get the problem to occur. When using the default pgo file built on amd64 more inlining happens within cmd/compile/internal/ssagen.(*state).stmt and according to the stacktraces that seems like where many of the problems occur.

I was going to attempt to force more aggressive inlining and see if that forces the problem.