go: net/http/pprof: TestDeltaProfile failures missing mutexHog2 on ARM architectures

--- FAIL: TestDeltaProfile (53.30s)
    pprof_test.go:206: want mutexHog2 but no mutexHog1 in the profile, and non-zero p.DurationNanos, got PeriodType: contentions count
        Period: 1
        Time: 2021-12-16 19:51:59.42219359 +1100 AEDT
        Duration: 32.041535781s
        Samples:
        contentions/count delay/nanoseconds
        Locations
        Mappings
        1: 0x0/0x0/0x0   [FN]
    pprof_test.go:214: want both mutexHog1 and mutexHog2 in the profile, got PeriodType: contentions count
        Period: 1
        Time: 2021-12-16 19:51:59.43107526 +1100 AEDT
        Samples:
        contentions/count delay/nanoseconds
                  9   17621016: 1 2 3 
                  1     149081: 1 4 3 
        Locations
             1: 0x851d3 M=1 sync.(*Mutex).Unlock /home/gopher/build/go/src/sync/mutex.go:214 s=0
             2: 0x2d249f M=1 net/http/pprof.mutexHog1 /home/gopher/build/go/src/net/http/pprof/pprof_test.go:107 s=0
             3: 0x2d2937 M=1 net/http/pprof.mutexHog.func1 /home/gopher/build/go/src/net/http/pprof/pprof_test.go:148 s=0
             4: 0x2d24ab M=1 net/http/pprof.mutexHog1 /home/gopher/build/go/src/net/http/pprof/pprof_test.go:108 s=0
        Mappings
        1: 0x0/0x0/0x0   [FN]
FAIL
FAIL	net/http/pprof	57.491s

greplogs --dashboard -md -l -e 'FAIL: TestDeltaProfile ' --since=2021-01-01

2021-12-16T00:34:10-7f23145/openbsd-arm-jsing 2021-11-25T00:02:52-b2a5a37/openbsd-arm-jsing 2021-06-04T17:33:24-831f937/openbsd-arm-jsing

(Forked from #38544, which was mitigated in CL 229498.)

CC @4a6f656c @hyangah

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 19 (17 by maintainers)

Commits related to this issue

Most upvoted comments

Ha, OK. I missed that when I was walking back over the commit log. Thanks @rhysh. I’m going to reopen this. Still not certain this should block the release at this point, but it looks like there’s a real problem and it’s currently being masked.

If I recall correctly, https://go.dev/cl/384239 shouldn’t affect this failure. In that case, the failure was that you’d end up with multiple Locations for things that should be the same Location. e.g., one Location including all of the inlined frames, and 2 separate Locations with one containing the inlined frame and the other containing the parent frame.

But all of the frames should still appear in the profile. In this case, the mutexHog2 frame is missing entirely, so this seems different.

Still, I think it is OK to wait and see if this occurs again.

I’m not opposed, I wanted to give others a chance to chime in, but it’s probably less work to just close it and move on. 😃 If it happens again we can take some of this information forward.

So, this hasn’t happened since March. I have a feeling this is related to a few things that were fixed with respect to profiles recently. Curiously, the failures stopped happening on March 7th, and on March 8th https://go.dev/cl/384239 landed to fix #50996. There aren’t any other pprof or profile related CLs around this one. The previous one landed in February, and the following one in May. That CL, I think, only affected proto output, but then again that’s exactly what the net/http/pprof test is checking, because it obtains the profile via query.

This test is checking a mutex profile, but I imagine it uses the same stack frame information that anything else does. I think it’s possible that what’s happening in this test is call frames are dropped because these NOPs are actually (created when inlining for a “fake” PC for the inlined function) weren’t handled correctly.

However, I’m not totally clear on what the failure mode for #50996 is. Specifically, neither mutexHop1 nor mutexHog2 are inline-able, but they have a few calls that are (like the Lock calls). So is it the parent frame that could get lost? Just the inlined call? The entire stack? I suspect it’s the entire stack because if we land on one of these NOPs and don’t handle them correctly (but only sometimes?). It’s also not clear whether these NOPs play a role here. They probably play some role, given that this is a mutex profile and the Lock and Unlock calls are inlined.

I’m inclined to call this fixed. I haven’t fully figured out the details, but I think there is a plausible connection here.

Sorry, I didn’t get around to it. I’ll do it right now.

@golang/runtime needs to decide whether to fix and unskip this test, or to drop the test.