go: runtime: scheduler change causes Delve's function call injection to fail intermittently

Starting with change ad943066f63c6945e92fa00c83c7cac6a78f793b (https://go.dev/cl/501976) I see frequent (but intermittent) failures of TestCallFunction of github.com/go-delve/delve/pkg/proc. The failure has a few different presentations, a frequent one is this one:

    variables_test.go:1353: call "getAStruct(3).VRcvr(1)"
2023-08-03T10:27:16+02:00 info kind=fncall,layer=proc function call initiated <nil> frame size 65535 goroutine 1 (thread 550064)
2023-08-03T10:27:16+02:00 info kind=fncall,layer=proc findCallInjectionStateForThread thread=550069 goroutine=59
    variables_test.go:1367: call "getAStruct(3).VRcvr(1)": error "could not recover call injection state for goroutine 59 (thread 550069)"
--- FAIL: TestCallFunction (0.53s)

Produced passing -v -log=fncall to TestCallFunction.

Delve keeps a map from goroutine IDs to call injection states, wherever it sees a SIGINT in one of the debugCall functions (runtime.debugCallV2, debugCall32, &c) it reads the goroutine id and searches the map for the corresponding call injection state. If none is found it looks for a call injection state that has just been started on the same thread.

The assumption here is that when the runtime spawns a new goroutine to handle the call injection it will use the same thread to run it, initially. This was agreed upon in https://go-review.googlesource.com/c/go/+/229299:

More generally, though, it’s guaranteed to keep the whole call injection protocol on one OS thread and it won’t let anything else happen on that thread during the protocol, so if you know what G you’re on when you start the injection and associate that with the OS thread, you can check which G it’s on when it reaches debugCallNNNN and you know they’re tied together.

The logs above suggest that this invariant is being violated, the injection starts on thread 550064 goroutine 1 but the first stop is on thread 550069 goroutine 59.

I’m not 100% sure that the bug was introduced by this commit, I think I saw a similar failure in CI before but I could never reproduce it locally (or see it frequently enough in CI). It could be that the bug already existed and this change made it orders of magnitude more likely.

cc @prattmic @mknyszek @aclements

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 20 (17 by maintainers)

Commits related to this issue

Most upvoted comments

@mknyszek I haven’t tried patchset 3 yet but if it’s equivalent to patchset 2 with lockOSThread(); defer unlockOSThread() added at the start of debugCallWrap then it works. I’ll try it tomorrow.