go: runtime: netpoll port_getn() fails with impossible errno on illumos

What version of Go are you using (go version)?

$ go version
go version go1.16.3 illumos/amd64

Does this issue reproduce with the latest release?

I do not know, I found the issue running Navidrome https://github.com/navidrome/navidrome which enforces building only with Go 1.16 The relevant bits of netpoll_solaris.go do not seem to be changed against latest.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/whorfin/.cache/go-build"
GOENV="/home/whorfin/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="illumos"
GOINSECURE=""
GOMODCACHE="/home/whorfin/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="illumos"
GOPATH="/home/whorfin/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/opt/ooce/go-1.16"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/opt/ooce/go-1.16/pkg/tool/illumos_amd64"
GOVCS=""
GOVERSION="go1.16.3"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/whorfin/navidrome/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build3292860960=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Having built the latest navidrome (master from https://github.com/navidrome/navidrome), it will sporadically crash with:

runtime: port_getn on fd 4 failed (errno=0)
fatal error: runtime: netpoll failed

runtime stack:
runtime.throw(0xe01def, 0x17)
        /opt/ooce/go-1.16/src/runtime/panic.go:1117 +0x72
runtime.netpoll(0x71c11d, 0x12202ce5e8f688)
        /opt/ooce/go-1.16/src/runtime/netpoll_solaris.go:249 +0x505
runtime.findrunnable(0xc000034000, 0x0)
        /opt/ooce/go-1.16/src/runtime/proc.go:2879 +0x3ee
runtime.schedule()
        /opt/ooce/go-1.16/src/runtime/proc.go:3125 +0x2d7
runtime.park_m(0xc000001080)
        /opt/ooce/go-1.16/src/runtime/proc.go:3274 +0x9d
runtime.mcall(0x57eac8)
        /opt/ooce/go-1.16/src/runtime/asm_amd64.s:327 +0x64

goroutine 1 [chan receive, 26 minutes]:
github.com/oklog/run.(*Group).Run(0xc000052d20, 0xc0003fa0b0, 0x1)
        /home/whorfin/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:43 +0xed
github.com/navidrome/navidrome/cmd.runNavidrome()
        /home/whorfin/navidrome/cmd/root.go:59 +0x107
github.com/navidrome/navidrome/cmd.glob..func2(0x25e8540, 0x266d098, 0x0, 0x0)
        /home/whorfin/navidrome/cmd/root.go:31 +0x25
github.com/spf13/cobra.(*Command).execute(0x25e8540, 0xc000092020, 0x0, 0x0, 0x25e8540, 0xc000092020)
        /home/whorfin/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:856 +0x2c2
github.com/spf13/cobra.(*Command).ExecuteC(0x25e8540, 0xc000000180, 0x200000003, 0xc000000180)
        /home/whorfin/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:960 +0x375
github.com/spf13/cobra.(*Command).Execute(...)
        /home/whorfin/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:897
github.com/navidrome/navidrome/cmd.Execute()
        /home/whorfin/navidrome/cmd/root.go:39 +0x65
main.main()
        /home/whorfin/navidrome/main.go:11 +0x2b

That (errno=0) appears to be impossible.

What did you expect to see?

Ideally, no crashes. If there were a crash, I’d expect a meaningful errno.

What did you see instead?

A crash with errno=0

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 27 (12 by maintainers)

Most upvoted comments

Here is a script that will trace some of the port_getn() calls for a process:

#!/usr/sbin/dtrace -qs

pid$target::port_getn:entry
{
        self->x = timestamp;
        self->fd = args[0];
        self->ngetp = args[3];
        self->nget = *self->ngetp;

        printf("%u [%d/%d:%s] port_getn(%d, nget %p = %d)...\n",
            timestamp, pid, tid, execname,
            self->fd, self->ngetp, self->nget);
}

syscall::portfs:entry
/self->x/
{
        printf("%u [%d/%d:%s] SYSCALL portfs %d entry\n",
            timestamp, pid, tid, execname,
            (int)arg0);
}

syscall::portfs:return
/self->x/
{
        printf("%u [%d/%d:%s] SYSCALL portfs return %d (errno %d)\n",
            timestamp, pid, tid, execname,
            (int)arg0, errno);
}

pid$target::port_getn:return
/self->x/
{
        this->dur = (timestamp - self->x) / 1000;

        /*
         * Get the per-thread errno location for this thread:
         */
        this->fsbase = curthread->t_lwp->lwp_pcb.pcb_fsbase;
        this->ulwp = (userland struct pid`ulwp *)this->fsbase;
        this->errnop = this->ulwp->ul_errnop;

        printf("%u [%d/%d:%s] -> %d (errnop %p = %d, nget = %d) %dus\n",
            timestamp, pid, tid, execname,
            (int)arg1, this->errnop, *this->errnop, *self->ngetp, this->dur);

        if ((int)arg1 < 0 && *this->errnop == 0) {
                printf("WARNING: oddball return detected\n");
                ustack();
                printf("\n");
        }

        self->x = 0;
        self->fd = 0;
        self->ngetp = 0;
        self->nget = 0;
}

e.g.,

some output from an acme-dns process:

dns2 # /var/tmp/portgetn.d -p $(pgrep acme-dns)
13334996135948040 [597/6:acme-dns] port_getn(4, nget fffffc7feb3ff1c8 = 1)...
13334996135962747 [597/6:acme-dns] SYSCALL portfs 6 entry
13334996135968032 [597/6:acme-dns] SYSCALL portfs return 0 (errno 0)
13334996135973638 [597/6:acme-dns] -> 0 (errnop fffffc7feec8234c = 0, nget = 0) 25us

13334996135986941 [597/6:acme-dns] port_getn(4, nget fffffc7feb3ff1c8 = 1)...
13334996135991899 [597/6:acme-dns] SYSCALL portfs 6 entry
13335017554884999 [597/6:acme-dns] SYSCALL portfs return 1 (errno 0)
13335017554908883 [597/6:acme-dns] -> 0 (errnop fffffc7feec8234c = 0, nget = 1) 21418921us

13335017555435247 [597/6:acme-dns] port_getn(4, nget fffffc7feb3ff1c8 = 1)...
13335017555446540 [597/6:acme-dns] SYSCALL portfs 6 entry
13335017555451788 [597/6:acme-dns] SYSCALL portfs return 0 (errno 0)
13335017555456734 [597/6:acme-dns] -> 0 (errnop fffffc7feec8234c = 0, nget = 0) 21us

13335017555465954 [597/6:acme-dns] port_getn(4, nget fffffc7feb3ff1c8 = 1)...
13335017555470793 [597/6:acme-dns] SYSCALL portfs 6 entry

Critically this script looks at two things:

  • the libc-level port_getn() call, including the per-thread errno value that Go should be collecting via errno()
  • the underlying portfs system call made by port_getn() which is what truss is showing you as well, including the errno value that the kernel put in place for the system call return (this is separate but related to the libc errno value)

There is a conditional print if we detect a failure return (-1) where errno is 0, which should print something like:

WARNING: oddball return detected

              libc.so.1`port_getn+0x51
              acme-dns`runtime.netpoll+0x1ef
              acme-dns`runtime.startTheWorldWithSema+0x236
              acme-dns`runtime.gcMarkTermination.func3+0x26
              acme-dns`runtime.systemstack+0x73
              acme-dns`runtime.gcMarkDone+0x1ff
              acme-dns`runtime.gcBgMarkWorker+0x29d
              acme-dns`runtime.goexit+0x1

It would be good to capture this sort of trace output while the problem is occurring, so that we can determine whether the C library is mishandling the error numbers here, or if it is some part of the Go machinery for fetching thread local errno after a C library call, or something else.