go: os/exec: failures with "netpollBreak write failed" on linux-amd64 since 2021-11-10

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 16 (15 by maintainers)

Most upvoted comments

I’ve also seen two failures of the form:

runtime: epollctl failed with 22
fatal error: runtime: epollctl failed

goroutine 4 [running]:
runtime.throw({0x74bc90?, 0x3?})
        /workdir/go/src/runtime/panic.go:992 +0x71 fp=0xc000033de0 sp=0xc000033db0 pc=0x436c91
runtime.netpollinit()
        /workdir/go/src/runtime/netpoll_epoll.go:54 +0x165 fp=0xc000033e30 sp=0xc000033de0 pc=0x432d05
runtime.netpollGenericInit()
        /workdir/go/src/runtime/netpoll.go:127 +0x3a fp=0xc000033e48 sp=0xc000033e30 pc=0x43239a
runtime.doaddtimer(0xc000022500, 0xc000068000)
        /workdir/go/src/runtime/time.go:287 +0x30 fp=0xc000033ea0 sp=0xc000033e48 pc=0x455650
runtime.modtimer(0xc000068000, 0x6726dc4c38, 0x0, 0x777938, {0x0?, 0x0}, 0x0)
        /workdir/go/src/runtime/time.go:493 +0x366 fp=0xc000033ee8 sp=0xc000033ea0 pc=0x456186
runtime.resettimer(...)
        /workdir/go/src/runtime/time.go:540
time.resetTimer(...)
        /workdir/go/src/runtime/time.go:230
runtime.scavengeSleep(0x4792a1c0)
        /workdir/go/src/runtime/mgcscavenge.go:243 +0x73 fp=0xc000033f38 sp=0xc000033ee8 pc=0x422a53
runtime.bgscavenge(0x0?)
        /workdir/go/src/runtime/mgcscavenge.go:381 +0x18f fp=0xc000033fc8 sp=0xc000033f38 pc=0x422c8f
runtime.gcenable.func2()
        /workdir/go/src/runtime/mgc.go:178 +0x26 fp=0xc000033fe0 sp=0xc000033fc8 pc=0x41acc6
runtime.goexit()
        /workdir/go/src/runtime/asm_amd64.s:1579 +0x1 fp=0xc000033fe8 sp=0xc000033fe0 pc=0x468961
created by runtime.gcenable
        /workdir/go/src/runtime/mgc.go:178 +0xaa

goroutine 1 [running, locked to thread]:
        goroutine running on other thread; stack unavailable

This is a failure attempting to add netpollBreakRd to epoll, which seems to be in the same vain.

I’m going to fix the races in that init function and we’ll see if that gets rid of the crashes. I think it is worthwhile either way.

I did manage to reproduce this on a linux-amd64-fedora gomote after ~30min with:

$ gomote run -dir /workdir/go/src/os/exec $(gomote-instance) /workdir/go/bin/go test -c os/exec
$ gomote run -dir /workdir/go/src/os/exec $(gomote-instance) /bin/bash -c 'while /workdir/go/src/os/exec/exec.test -test.short; do true; done'

I think there is a race condition here. The call to f.Stat could occur before the pipe is created, so it fails. Then the pipe could be created, perhaps with the same descriptor as f. Then the init function calls f.Close, closing the pipe descriptor. This doesn’t seem like a likely race, but I doesn’t seem impossible.

errno 32 is EPIPE, meaning the other end of the pipe (netpollBreakRd) is closed. The runtime never closes that FD, so I suspect this is a bug of something closing an FD it doesn’t own.