go: runtime/pprof: TestVMInfo failures

#!watchflakes
default <- pkg == "runtime/pprof" && test == "TestVMInfo"

Issue created automatically to collect these failures.

Example (log):

--- FAIL: TestVMInfo (0.37s)
    vminfo_darwin_test.go:59: exit status 255

watchflakes

About this issue

  • Original URL
  • State: closed
  • Created 10 months ago
  • Comments: 18 (11 by maintainers)

Commits related to this issue

Most upvoted comments

OK, so if I read your note correctly, the suspicion is that the runner is not actually overloaded, but rather, it’s a case of 18146. This seems at least plausible to me if there were no other issues with the test runner when this test failed. I’ll make the test retry when it sees this error rather than skipping.

Cheers, Cos.

On Wed, Nov 8, 2023 at 8:58 AM Bryan C. Mills @.***> wrote:

I’d worry that the test could fail silently for extended periods and hence be rendered useless without us ever knowing

Agreed; that’s a problem for flaky tests in general. Making the skip as precise as possible helps somewhat (it prevents the skip from masking other failures), but it would still be possible for the test to skip due to this exact failure mode unexpectedly.

I could retry the test, but that seems like the wrong thing to do on an overloaded test runner.

I agree that a retry is unfortunate, but it might be the least-bad option. It seems better to waste a small amount of resources on retrying a single test, instead of wasting a much larger amount of resources retrying an entire failed TryBot run or causing a contributor to have to re-run go test (or, even worse, all.bash) after a spurious failure on their local machine.

Was this the only test run that failed at that time on that runner - i.e. was the runner sufficiently overloaded for other tests to fail?

At least in the last log that was posted, no other tests failed.

Note that the Go runtime explicitly retries EAGAIN errors from pthread_create, which seems likely to stem from essentially the same resource exhaustion problem:

https://cs.opensource.google/go/go/+/master:src/runtime/os_darwin.go;l=244-246;drc=4be921d888d3a68c51e38d4c615a4438c7b2cb30

The history of that retry seems to be related to #18146 https://github.com/golang/go/issues/18146, in which calling pthread_create concurrently with exec caused a spurious EAGAIN error. Perhaps the error in vmmap could have a related cause?

Has it happened again/how often does it happen?

The logs posted on this issue are the failures I am aware of.

I’d be happy to make the change, but I’m not sure it’s solving a problem beyond having a clean dashboard.

Alert fatigue https://en.wikipedia.org/wiki/Alarm_fatigue is a problem in its own right: if we train people that test failures are just something that occasionally happens in normal operation, then they will dismiss intermittent test failures that may indicate a deeper, systematic problem. (In the case of macOS in particular, we currently have #63937 https://github.com/golang/go/issues/63937 and #60449 https://github.com/golang/go/issues/60449 to contend with on that front.)

— Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/62352#issuecomment-1802293679, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACCMUYBEXYEU5XLAQC7ZVTLYDO233AVCNFSM6AAAAAA4DJ3CZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBSGI4TGNRXHE . You are receiving this because you were mentioned.Message ID: @.***>