go: runtime/netpoll: randomly hanging Accept on macOS since go1.18

Context

We ship golang binaries that run a net/http.Server to a number of users on different platforms.

Recently we’ve started receiving reports from macOS users, mostly using M1 mac minis, of an issue where the http server stops responding.

The reports started shortly after we shipped an upgrade from go1.17 to go1.18

Observations

The issue is very intermittent, with some users experiencing it atleast once a week.

At the same time, my own M1 mac mini has never experience this issue.

We have been trying to debug for a couple months, and have been able to gather some information via users:

  • does not seem to be ulimit related, as other parts of the binary continue to function properly, creating files and making network requests

  • the listening socket is active, because curl experiences a timeout instead of connection refused

    curl: (28) Failed to connect to 127.0.0.1 port 8089 after 25926 ms: Operation timed out
    
  • golang stack trace shows the TCPListener.Accept sitting idle in netpoll

    goroutine 69 [IO wait, 771 minutes]:
    internal/poll.runtime_pollWait(0x12bdee198, 0x72)
        runtime/netpoll.go:302 +0xa4
    internal/poll.(*pollDesc).wait(0x140000c2780?, 0x4?, 0x0)
        internal/poll/fd_poll_runtime.go:83 +0x2c
    internal/poll.(*pollDesc).waitRead(...)
        internal/poll/fd_poll_runtime.go:88
    internal/poll.(*FD).Accept(0x140000c2780)
        internal/poll/fd_unix.go:614 +0x1d0
    net.(*netFD).accept(0x140000c2780)
        net/fd_unix.go:172 +0x28
    net.(*TCPListener).accept(0x1400066e0c0)
        net/tcpsock_posix.go:139 +0x2c
    net.(*TCPListener).Accept(0x1400066e0c0)
        net/tcpsock.go:288 +0x30
    

I’m looking for any suggestions on how to debug this further, or if anyone has seen anything similar.

Currently I am wondering if this issue affects just the single socket, or is something process-wide. To verify, we’ve spun up another listener on a secondary port, and are waiting to hear if that port keeps working when the primary port stops responding.

As a workaround, we are considering adding an internal health check in the process that tries to connect back to itself over loopback. If the connection times out, we could restart the TCPListener.

What version of Go are you using (go version)?

$ go version
go version go1.18.5 darwin/arm64

Does this issue reproduce with the latest release?

Have not tried with go 1.19

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 20 (18 by maintainers)

Most upvoted comments

I filed FB11405981 with Apple

A user reports that removing Little Snitch network extension fixed the issue. More details here: https://community.getchannels.com/t/channels-dvr-stops-responding-randomly-on-macos/32418/83

Another thought: it could be that your somaxconn is too small to accept more sockets and your listener was overwhelmed, have you ever tried netstat -nt|grep {listener_port} to find out how many connections were stuck in SYN_RECV state during this issue? or how many times the listen queue overflowed (by netstat -s | grep overflow) and how many connections it dropped (by netstat -s| grep dropped)?

take a look at Send-Q and Recv-Q of the TCP listener

What’s the best way to do so on macOS? Does netstat show this info?

EDIT: The answer is yes, a simple netstat -an will work.

You may want to add a -L as well to the command to show the amounts of SYN Queue and Accept queue. @tmm1

I’ve gone through the netpoll code roughly, but didn’t catch anything, maybe I missed something, will keep browsing the source code.