go: runtime: broken work conservation due to CL 310850
##### GOMAXPROCS=2 runtime -cpu=1,2,4 -quick
SIGQUIT: quit
PC=0x4747c1 m=0 sigcode=0
goroutine 0 [idle]:
runtime.futex()
/workdir/go/src/runtime/sys_linux_amd64.s:519 +0x21 fp=0x7ffffcc5a8e8 sp=0x7ffffcc5a8e0 pc=0x4747c1
runtime.futexsleep(0xc000024000, 0x7f6ae0, 0xc000036800)
/workdir/go/src/runtime/os_linux.go:44 +0x36 fp=0x7ffffcc5a938 sp=0x7ffffcc5a8e8 pc=0x435776
runtime.notesleep(0x7f6c30)
/workdir/go/src/runtime/lock_futex.go:160 +0x87 fp=0x7ffffcc5a970 sp=0x7ffffcc5a938 pc=0x40da47
runtime.mPark()
/workdir/go/src/runtime/proc.go:1441 +0x2a fp=0x7ffffcc5a990 sp=0x7ffffcc5a970 pc=0x43ef2a
runtime.stoplockedm()
/workdir/go/src/runtime/proc.go:2600 +0x65 fp=0x7ffffcc5a9e0 sp=0x7ffffcc5a990 pc=0x440b85
runtime.schedule()
/workdir/go/src/runtime/proc.go:3287 +0x3d fp=0x7ffffcc5aa30 sp=0x7ffffcc5a9e0 pc=0x44267d
runtime.park_m(0xc0000001a0)
/workdir/go/src/runtime/proc.go:3504 +0x14d fp=0x7ffffcc5aa60 sp=0x7ffffcc5aa30 pc=0x442dcd
runtime.mcall()
/workdir/go/src/runtime/asm_amd64.s:307 +0x43 fp=0x7ffffcc5aa70 sp=0x7ffffcc5aa60 pc=0x470783
…
goroutine 23533 [running]:
goroutine running on other thread; stack unavailable
created by runtime_test.testGoroutineParallelism2
/workdir/go/src/runtime/proc_test.go:174 +0x325
goroutine 23532 [runnable]:
runtime_test.testGoroutineParallelism2.func1()
/workdir/go/src/runtime/proc_test.go:174 fp=0xc00016cfe0 sp=0xc00016cfd8 pc=0x5aa140
runtime.goexit()
/workdir/go/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc00016cfe8 sp=0xc00016cfe0 pc=0x4729c1
created by runtime_test.testGoroutineParallelism2
/workdir/go/src/runtime/proc_test.go:174 +0x325
goroutine 23531 [running]:
goroutine running on other thread; stack unavailable
created by runtime_test.testGoroutineParallelism2
/workdir/go/src/runtime/proc_test.go:174 +0x325
goroutine 23511 [runnable]:
runtime.gopark(0xc0000623b8, 0x0, 0x0, 0x0, 0x0)
/workdir/go/src/runtime/proc.go:366 +0xd6 fp=0xc000474ee8 sp=0xc000474ec8 pc=0x43c5d6
runtime.chansend(0x272, 0x657a90, 0x1, 0x0)
/workdir/go/src/runtime/chan.go:257 +0x435 fp=0xc000474f78 sp=0xc000474ee8 pc=0x407435
runtime.chansend1(0x0, 0x0)
/workdir/go/src/runtime/chan.go:143 +0x1d fp=0xc000474fa8 sp=0xc000474f78 pc=0x406fdd
runtime_test.testGoroutineParallelism2.func2.1(0xce)
/workdir/go/src/runtime/proc_test.go:212 +0x5a fp=0xc000474fc8 sp=0xc000474fa8 pc=0x5aa0ba
runtime_test.testGoroutineParallelism2.func2·dwrap·87()
/workdir/go/src/runtime/proc_test.go:213 +0x2d fp=0xc000474fe0 sp=0xc000474fc8 pc=0x5aa02d
runtime.goexit()
/workdir/go/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc000474fe8 sp=0xc000474fe0 pc=0x4729c1
created by runtime_test.testGoroutineParallelism2.func2
/workdir/go/src/runtime/proc_test.go:205 +0x46
goroutine 23505 [chan receive]:
runtime.gopark(0x40f25b, 0xc0003680c0, 0x20, 0x0, 0x7fd204238c40)
/workdir/go/src/runtime/proc.go:366 +0xd6 fp=0xc0000afd20 sp=0xc0000afd00 pc=0x43c5d6
runtime.chanrecv(0x0, 0x0, 0x1)
/workdir/go/src/runtime/chan.go:576 +0x54c fp=0xc0000afdb8 sp=0xc0000afd20 pc=0x4082cc
runtime.chanrecv1(0xffffffff, 0xc000051120)
/workdir/go/src/runtime/chan.go:439 +0x18 fp=0xc0000afde0 sp=0xc0000afdb8 pc=0x407d18
runtime_test.testGoroutineParallelism2(0xbf0b36aee8, 0x1, 0x0)
/workdir/go/src/runtime/proc_test.go:183 +0x20c fp=0xc0000aff48 sp=0xc0000afde0 pc=0x5a9b0c
runtime_test.TestGoroutineParallelism2(0x406fdd)
/workdir/go/src/runtime/proc_test.go:146 +0x25 fp=0xc0000aff70 sp=0xc0000aff48 pc=0x5a98a5
testing.tRunner(0xc0002201a0, 0x6593d0)
/workdir/go/src/testing/testing.go:1242 +0x102 fp=0xc0000affc0 sp=0xc0000aff70 pc=0x4dff02
testing.(*T).Run·dwrap·21()
/workdir/go/src/testing/testing.go:1289 +0x2a fp=0xc0000affe0 sp=0xc0000affc0 pc=0x4e0c4a
runtime.goexit()
/workdir/go/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc0000affe8 sp=0xc0000affe0 pc=0x4729c1
created by testing.(*T).Run
/workdir/go/src/testing/testing.go:1289 +0x365
…
*** Test killed with quit: ran too long (6m0s).
FAIL runtime 360.133s
FAIL
2021-04-30T00:32:42-303b194/linux-amd64-noopt 2020-06-26T17:07:58-d1015f3/freebsd-386-11_2
Also seen on a recent TryBot: https://storage.googleapis.com/go-build-log/eb7ebec0/freebsd-amd64-12_2_913d0c87.log
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 41 (38 by maintainers)
Commits related to this issue
- runtime: call runtime.GC in several tests that disable GC These tests disable GC because of the potential for a deadlock, but don't consider that a GC could be in progress due to other tests. The lik... — committed to golang/go by mknyszek 3 years ago
- runtime: fix comments on the behavior of SetGCPercent Fixes for #49680, #49695, #45867, and #49370 all assumed that SetGCPercent(-1) doesn't block until the GC's mark phase is done, but it actually d... — committed to golang/go by mknyszek 3 years ago
@aclements pointed out that “[M 4] drops P 1 … and goes directly into blocking netpoll.” is new behavior in https://go.dev/cl/310850, which was submitted just before the first one of these test failures.
Correct, the test should no longer fail.
To be clear, this isn’t quite “only a performance issue”. The failing tests were deadlocked and would never recover. You are correct that “requires a pretty specific and subtle setup to trigger.”
I think dropping this as a release blocker is OK, as this issue is new in 1.17, not 1.18. We may still want to backport the eventual fix to both 1.17 and 1.18.
@prattmic , CL 380058 fixes the test problem, right? If it’s just the work conservation problem remaining, I think we should punt this to 1.19 because that’s only a performance issue that requires a pretty specific and subtle setup to trigger.
Interesting. That explains that I tried to run that test in isolation but never reproduced the failure…
#50749 explains why async preemption did not kick in to allow forward progress. Disabling
TestFutexsleepmakes this failure disappear. But we still have a work conservation bug here.A few ideas I’ve brainstormed:
To summarize, what I see in the trace above is:
I have not been able to reproduce this yet, but I make an interesting observation that even though it passes, this test takes a very long time on
freebsd-arm64-dmgk.linux-amd64:linux-arm64:freebsd-arm64:That is a rather extreme delay which I will be investigating further.
ppc64 and ppc64le to my understanding do indeed have a loose memory model.
Hmmm… The Dec 7th fix was intended to target a problem introduced by the lower minimum heap size (which is now disabled across the board), but upon further investigation it’s clear that we’ve been seeing failures of this form since August.
That being said, I suppose it is possible that somehow this extra call to
runtime.GCis enough to mask (or fix) the issue entirely on strong memory architectures. You do make a good point that there aren’t any amd64 failures in recent memory. I’ll take 2 of the 5 ppc64-buildlet builders and give it a try again instead of hammering on linux/amd64.