grpc-go: gRPC connection closing with unexpected EOF on high-latency connections

What version of gRPC are you using?

1.46.2

What version of Go are you using (go version)?

go version go1.18 darwin/amd64

What operating system (Linux, Windows, …) and version?

Client is running MacOs: ProductName: macOS ProductVersion: 11.6 BuildVersion: 20G165

Server is running Linux (CentOS 7)

What did you do?

Note: I have posted a small repo that can be used to reproduce this bug, please check it out here: https://github.com/brianneville/grpcbug


For high-latency connections, I noticed that RPCs were having their connection terminated with an internal error: code = Internal desc = unexpected EOF

This hits on both unary and streaming RPCs, and the ease with which this can be reproduced seems to vary with the client-server latency. For high-latency connections (such 150ms ping), this error hits pretty much every time an RPC is called. For low-latency connections (<2ms ping) this error is much more infrequent, and hundreds of thousands of messages may be streamed over this connection before the error is hit even once.

I did a bit of digging through the grpc library code, and found two ways that the error can be prevented:

1. Client side - configure window sizes

The error can be prevented from the client side by disabling the dynamic window and BDP estimation for flow control when dialing the grpc server. That is, setting the DialOptions on the client side to use:

    opts := []grpc.DialOption{
       grpc.WithInitialWindowSize(largerWindowSize),
       grpc.WithInitialConnWindowSize(largerWindowSize),
    }

Where both:

  • largerWindowSize is greater than 65535 (so that dynamic window/flow estimation is turned off)
  • largerWindowSize is greater than the size of the largest RPC response messages (with a bit of overhead for some reason).

2. Server-side - delay ending the stream

The error can be prevented by delaying the calls that write the StatusOk into the transport. Specifically, if the END_STREAM header is delayed from being put into the controlBuf at google.golang.org/grpc/internal/transport/http2_server.go#finishStream. That is, you can make any of the following changes to *http2Server.finishStream, and the error will not be present:

    time.Sleep(1 * time.Second) // <-- sleeping before put = EOF error is prevented
    t.controlBuf.put(hdr)

or

    go func() {         // <-- allowing finishStream to progress and sleep before put = EOF error is prevented
      time.Sleep(1 * time.Second)
      t.controlBuf.put(hdr)
    }()

When the t.controlBuf.put(hdr) line is delayed in this way, the RPC is allowed to complete normally, and the client will see the response as intended.

Note if you add the sleep after the t.controlBuf.put(hdr), then the error will still be present (i.e. delaying the finishStream function is not what causes the error to be prevented)

    t.controlBuf.put(hdr)
    time.Sleep(1 * time.Second) // <-- sleeping after put = EOF error still occurs 

Would anyone know what might be going on here, or be able to give me some insight/advice for continuing to debug this issue?

What did you expect to see?

Connection latency does not affect RPCs

What did you see instead?

High-latency connection (such as ~150ms ping) reliably hit the error code = Internal desc = unexpected EOF

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 4
  • Comments: 38 (21 by maintainers)

Commits related to this issue

Most upvoted comments

I also have the same EOF message rpc error: code = Internal desc = unexpected EOF in etcdctl

When I use ‘get’ command in etcdctl on mac, It always outputs ‘unexpected EOF’ . (ping endpoit: 15ms) But it’s OK for another Linux client (ping endpoint: < 2ms) or short results like (etcdctl get --prefix “/blabla” --limit 100)

I modified etcdctl source like below and I got all results. https://github.com/etcd-io/etcd/blob/87e271701b3449c35d9ef41f855c7792488773ce/client/v3/client.go#L285 opts := append(dopts, grpc.WithResolvers(c.resolver), grpc.WithInitialWindowSize(1_000_000), grpc.WithInitialConnWindowSize(1_000_000))

I compiled and ran the server and client for the repo that reproduces this bug (https://github.com/brianneville/grpcbug) using go1.17.6 and the issue is still present, so I’d be hesitant to say that https://github.com/golang/go/issues/53253 is at fault here.

Jus to explore further though @hexfusion could you try appending the following DialOptions to the client?

var opts []grpc.DialOption

// add any other DialOptions...

const maxWindowSize int32 = (1 << 20) * 16
opts = append(opts,
    grpc.WithInitialWindowSize(maxWindowSize),
    grpc.WithInitialConnWindowSize(maxWindowSize),
)

These DialOptions are a workaround for this bug (https://github.com/grpc/grpc-go/issues/5358) so it would be interesting to see if they have any effect in your case.

@zasweq curious if we think it could be related to this bug in net/http2[1]? Could you help drive priority? This is causing huge problems for us in production with golang 1.18.

Unlikely, since grpc-go uses its own HTTP server implementation, and just the framer part of the net/http2 library. But your link to the net/http2 issue is a link back to this issue, so I can’t see what you are referencing.

EDIT: Oh sorry, I also see your back-reference from that bug to this one. Still unlikely for the same reason, but I’ll take a look.

One thing we suspect now is that the server might not be setting the linger socket option (need to check on this), which could lead to this kind of behavior if there is any TCP error after the server finishes writing the RPC data and closes the connection.

Update: it looks like Go sets this by default, so I don’t suspect this has anything to do with it anymore.

I will create a new issue for this but the whole reason we are using ServerMaxConnectionAge is because we would overflow the stream id every now and then. invalid stream ID. So maybe its related.

Interesting. Apparently gRPC-Java will automatically reconnect when stream IDs are 50% exhausted. We don’t do that in gRPC-Go, but it would be a good idea.

If the connection is lost in the middle of an RPC then it’s expected to get RPC failures.

You could use a MaxAgeGrace setting to help work around that issue when it’s caused by MaxAge connection termination from the server.

The options grpc.WithInitialWindowSize and grpc.WithInitialConnWindowSize options are independent of the grpc.MaxCallRecvMsgSize and grpc.MaxCallSendMsgSize options. You can combine these options if you wish (I’d say try it, because I’m curious to see if the workaround has any effect in your case).

Setting grpc.WithInitialWindowSize and grpc.WithInitialConnWindowSize will disable BDP and manually force the window size used to whatever value you provide it. By comparison grpc.MaxCallRecvMsgSize and grpc.MaxCallSendMsgSize are for only setting the maximum size of the overall message that can be received/sent (with this message being sent in multiple data frames using a dynamic window size unless BDP is disabled).

https://github.com/golang/go/issues/53253 appears completely unrelated: 1. We don’t use the same client/server implementation, and 2. the symptoms are different; that was a 5s delay caused by a delay in receipt of a frame with END_STREAM set – this appears to be streams failing due to flow control violations (either the sender using more quota than it had or the recipient incorrectly believing that was the case).

Anecdotal evidence only, but I don’t recall seeing these failures prior to our Go 1.18 upgrade…