go: cmd/compile: compiler built with PGO crashes occasionally on ppc64{le}
#!watchflakes
default <- goarch ~ `ppc64` && date != "" && date > "2023-05-22" && date < "2023-06-01"
https://go.dev/cl/495596 added a default.pgo
profile for cmd/compile
, enabling a PGO build of the compiler (as long as -pgo=none
is not set explicitly).
This caused a variety of crashes in the compiler on ppc64{le} builders:
2023-05-18T16:55:07-88f89d8/linux-ppc64le-power9osu 2023-05-18T13:41:27-33a601b/aix-ppc64 2023-05-18T12:52:14-7b0835d/aix-ppc64 2023-05-18T10:23:17-75add1c/aix-ppc64 2023-05-18T09:16:07-774f602/linux-ppc64-sid-power10 2023-05-18T09:16:07-774f602/linux-ppc64le-buildlet 2023-05-18T09:15:25-27906bb/aix-ppc64 2023-05-18T09:15:25-27906bb/linux-ppc64-sid-power10 2023-05-18T01:40:37-6ed8474/linux-ppc64-sid-buildlet 2023-05-18T01:40:37-6ed8474/linux-ppc64-sid-power10 2023-05-18T00:35:53-956d31e/linux-ppc64le-buildlet 2023-05-17T22:11:31-0b86a04/linux-ppc64le-buildlet 2023-05-17T21:53:11-c426c87/linux-ppc64-sid-buildlet 2023-05-17T21:53:11-c426c87/linux-ppc64le-power10osu 2023-05-17T21:44:30-2693ade/linux-ppc64-sid-power10
That CL also caused #60263. Since it caused several issues, the CL was reverted in https://go.dev/cl/496185. #60263 has since been fixed.
Given these failures are all on ppc64{le}, @dr2chase and I suspect that they are due to a bad ppc64-specific optimization (SSA rule, e.g.) that is tickled by the additional inlining caused by PGO.
I have had some success reproducing these crashes. Running all.bash in a loop on three linux-ppc64-sid-power10
builders concurrently with GOGC=5
usually gets me a failure in <30 minutes. Not stellar, but I think workable.
We should then be able to bisect down to a bad function with GOSSAHASH applied in inlineCostOK to enable/disable PGO-based inlining. (Also set -d=pgodevirtualize=0
to disable PGO-based devirtualization, which was submitted in https://go.dev/cl/492436, after the bad CL above was reverted).
Given that there is a path forward to debugging ppc64, and we’d like more soak time on the primary ports, I intend to resubmit https://go.dev/cl/495596, which will make ppc64 flaky until this issue is resolved. (GOARCH=ppc64{le} could also temporarily change the default of -pgo from auto to none if necessary).
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 38 (18 by maintainers)
Commits related to this issue
- cmd/compile: build compiler with PGO Reapples CL 495596, which was reverted at CL 496185. The x/tools failure, #60263, has been resolved. The ppc64 failures, #60368, have _not_ been resolved, but are... — committed to golang/go by cherrymui a year ago
- runtime: preserve R29 in the write barrier flush path on ppc64 Surprisingly, it usually survived the call to flush a write barrier. Usually. Fixes #60368 Change-Id: I4792a57738e5829c79baebae4d13b6... — committed to golangFame/go by dr2chase a year ago
If the commit time is after the issue closure time, watchflakes should reopen the issue, otherwise it just posts. So if the issue is not reopened we can assume the failures are old.
Change https://go.dev/cl/499679 mentions this issue:
runtime: preserve R29 in the write barrier flush path on ppc64
@laboger PROGRESS!
I ran a binary search on the flake, and isolated it to a single PGO inline.
you can compare two versions of ssa.html with (full recipe):
First checkout experiment CL, build a known-good (no PGO-inlining) compiler, and save it:
Then you can build a bad compiler, or a good compiler, with output to a nearby directory (…/bad or …/good):
Note that
gossahash=10101010101010110101001011
matches nothing, but is quieter thann
which logs a lot about what it isn’t doing. When you build the bad compiler with the command line above, it should print:I have not had a chance to dig into this yet, it seems likely that you might be able to make faster progress than me, unless something obvious leaps out at me.
We have done some experimenting with this. I found that if we build the default.pgo file natively instead of using the one built on amd64 we are not able to get the problem to occur. When using the default pgo file built on amd64 more inlining happens within cmd/compile/internal/ssagen.(*state).stmt and according to the stacktraces that seems like where many of the problems occur.
I was going to attempt to force more aggressive inlining and see if that forces the problem.