go: net: deadlock in TestDialParallelSpuriousConnection on darwin-arm*

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 18 (15 by maintainers)

Commits related to this issue

Most upvoted comments

Cool, thanks. I sent an e-mail to a couple of contacts at Apple.

This appears to be a macOS bug. I’ve reproduced it in C.

Client:

#include <arpa/inet.h>
#include <netinet/in.h>
#include <stdio.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <sys/uio.h>
#include <unistd.h>

int main() {
        int i = 0;
        while (1) {
                int sock = socket(PF_INET, SOCK_STREAM, 0);
                if (sock < 0) {
                        perror("socket");
                        return 1;
                }

                struct sockaddr_in raddr;
                raddr.sin_family = AF_INET;
                raddr.sin_port = htons(2020);
                if (inet_aton("127.0.0.1", &raddr.sin_addr) <= 0) {
                        perror("inet_aton");
                        return 1;
                }

                if (connect(sock, (struct sockaddr *)&raddr, sizeof(raddr)) < 0) {
                        perror("connect");
                        return 1;
                }

                close(sock);

                printf("%d\n", i);
                i++;
        }
}

Server:

#include <arpa/inet.h>
#include <netinet/in.h>
#include <stdio.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <sys/uio.h>
#include <unistd.h>

int main() {
        int sock = socket(PF_INET, SOCK_STREAM, 0);
        if (sock < 0) {
                perror("socket");
                return 1;
        }

        struct sockaddr_in laddr;
        laddr.sin_family = AF_INET;
        laddr.sin_port = htons(2020);
        laddr.sin_addr.s_addr = 0;
        if (bind(sock, (struct sockaddr *)&laddr, sizeof(laddr)) < 0) {
                perror("bind");
                return 1;
        }

        if (listen(sock, 10) < 0) {
                perror("listen");
                return 1;
        }

        while (1) {
                struct sockaddr_in sin;
                socklen_t slen = sizeof(sin);
                int c = accept(sock, (struct sockaddr *)&sin, &slen);
                if (c < 0) {
                        perror("accept");
                        return 1;
                }

                char buf[1];
                if (read(c, buf, 1) < 0) {
                        perror("read");
                        return 1;
                }

                close(c);
        }
}

Run server. Run client. After a small number of iterations (takes milliseconds on my laptop), the server will hang in the read() call for 60 seconds before the call fails with ECONNRESET.

This looks like a real failure.

I’ve reproduced it in a simpler test case:

func TestDialRace(t *testing.T) {
        handlerc := make(chan struct{})
        time.Sleep(10 * time.Millisecond) // slow down to avoid address exhaustioin

        readDeadline := time.Now().Add(70 * time.Second)
        handler := func(ls *localServer, ln Listener) {
                defer close(handlerc)
                c, err := ln.Accept()
                if err != nil {
                        t.Fatal(err)
                }
                var b [1]byte
                if _, err := c.Read(b[:]); err != io.EOF {
                        t.Errorf("got %v; want %v", err, io.EOF)
                }
                c.Close()
        }
        ls := newLocalServer(t, "tcp")
        defer ls.teardown()
        if err := ls.buildup(handler); err != nil {
                t.Fatal(err)
        }

        d := Dialer{}
        c, err := d.Dial("tcp", ls.Listener.Addr().String())
        if err != nil {
                t.Fatal(err)
        }
        c.Close()

        <-handlerc
}

After running for a few minutes on my darwin/arm64 laptop:

=== RUN   TestDialRace
    dial_test.go:530: got read tcp4 127.0.0.1:54421->127.0.0.1:54423: read: connection reset by peer; want EOF
--- FAIL: TestDialRace (60.01s)

The client dials and immediately closes a connection. The listener accepts and reads from the connection. On rare occasions, the read hangs for 60 seconds before returning “connection reset by peer”.

Running the test 100’000 times on a darwin/amd64 machine did not reproduce the problem:

$ go test -count=100000 -timeout=10h -run=TestDialParallelSpuriousConnection net
ok  	net	10214.424s

Running the same on a darwin/arm64 machine seems to reproduce after a while:

$ go test -count=100000 -timeout=10h -run=TestDialParallelSpuriousConnection net
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:63672->127.0.0.1:63674: read: connection reset by peer; want EOF
[...]
FAIL
FAIL	net	14609.803s
FAIL
Complete output
$ go test -count=100000 -timeout=10h -run=TestDialParallelSpuriousConnection net
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:63672->127.0.0.1:63674: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:54481->[::1]:54482: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:55077->[::1]:55078: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:57926->127.0.0.1:57928: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:63669->[::1]:63670: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:51744->[::1]:51746: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:55708->127.0.0.1:55710: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.15s)
    dial_test.go:445: got read tcp6 [::1]:65484->[::1]:65485: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:49742->127.0.0.1:49744: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp6 [::1]:51501->[::1]:51503: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:52425->127.0.0.1:52427: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:60940->127.0.0.1:60942: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:54599->127.0.0.1:54600: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:51920->127.0.0.1:51922: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:59242->127.0.0.1:59244: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:56513->[::1]:56514: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.13s)
    dial_test.go:445: got read tcp6 [::1]:58734->[::1]:58735: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:59301->[::1]:59302: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.55s)
    dial_test.go:445: got read tcp4 127.0.0.1:63063->127.0.0.1:63065: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp6 [::1]:64137->[::1]:64138: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp6 [::1]:65499->[::1]:65501: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp4 127.0.0.1:53488->127.0.0.1:53489: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp6 [::1]:53839->[::1]:53840: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp6 [::1]:60746->[::1]:60747: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:55986->[::1]:55988: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:57475->127.0.0.1:57477: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.19s)
    dial_test.go:445: got read tcp4 127.0.0.1:59726->127.0.0.1:59728: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:60149->127.0.0.1:60151: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.57s)
    dial_test.go:445: got read tcp4 127.0.0.1:60743->127.0.0.1:60745: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:62183->[::1]:62184: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:49745->[::1]:49746: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:51823->127.0.0.1:51824: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp4 127.0.0.1:52795->127.0.0.1:52796: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:56493->[::1]:56495: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:56652->127.0.0.1:56653: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:57838->[::1]:57839: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp4 127.0.0.1:59977->127.0.0.1:59979: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:62461->[::1]:62462: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:53526->[::1]:53528: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:50751->127.0.0.1:50753: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:51924->127.0.0.1:51926: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp6 [::1]:55363->[::1]:55365: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp6 [::1]:58250->[::1]:58251: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:60293->[::1]:60295: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:61091->127.0.0.1:61092: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.36s)
    dial_test.go:445: got read tcp4 127.0.0.1:63470->127.0.0.1:63472: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:63614->127.0.0.1:63616: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp4 127.0.0.1:49644->127.0.0.1:49646: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp4 127.0.0.1:50178->127.0.0.1:50179: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:54815->127.0.0.1:54817: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:56958->127.0.0.1:56960: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:57606->[::1]:57607: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:62890->127.0.0.1:62891: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:64451->[::1]:64452: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp4 127.0.0.1:51542->127.0.0.1:51544: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp4 127.0.0.1:52580->127.0.0.1:52582: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp6 [::1]:52769->[::1]:52770: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:53588->[::1]:53590: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp4 127.0.0.1:57702->127.0.0.1:57704: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.14s)
    dial_test.go:445: got read tcp6 [::1]:59139->[::1]:59140: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp4 127.0.0.1:60891->127.0.0.1:60892: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:61188->127.0.0.1:61189: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:61320->[::1]:61321: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp4 127.0.0.1:63618->127.0.0.1:63619: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:63942->127.0.0.1:63944: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:65073->[::1]:65074: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:62988->127.0.0.1:62990: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:53143->[::1]:53144: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:57219->[::1]:57220: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp6 [::1]:62580->[::1]:62581: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:63111->[::1]:63112: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:55112->127.0.0.1:55113: read: connection reset by peer; want EOF
FAIL
FAIL	net	14609.803s
FAIL

Tested with Go 1.18.2, both with macOS 12 (though slightly different minor versions).

greplogs --dashboard -md -l -e 'FAIL: TestDialParallelSpuriousConnection.*\n.*i/o timeout' --since=2020-10-20

2020-10-20T00:59:23-a505312/ios-arm64-corellium

Looks fixed to me: none of these in over a year.