grpc-go: gRPC connection closing with unexpected EOF on high-latency connections

What version of gRPC are you using?

1.46.2

What version of Go are you using (`go version`)?

go version go1.18 darwin/amd64

What operating system (Linux, Windows, …) and version?

Client is running MacOs: ProductName: macOS ProductVersion: 11.6 BuildVersion: 20G165

Server is running Linux (CentOS 7)

What did you do?

Note: I have posted a small repo that can be used to reproduce this bug, please check it out here: https://github.com/brianneville/grpcbug

For high-latency connections, I noticed that RPCs were having their connection terminated with an internal error: code = Internal desc = unexpected EOF

This hits on both unary and streaming RPCs, and the ease with which this can be reproduced seems to vary with the client-server latency. For high-latency connections (such 150ms ping), this error hits pretty much every time an RPC is called. For low-latency connections (<2ms ping) this error is much more infrequent, and hundreds of thousands of messages may be streamed over this connection before the error is hit even once.

I did a bit of digging through the grpc library code, and found two ways that the error can be prevented:

1. Client side - configure window sizes

The error can be prevented from the client side by disabling the dynamic window and BDP estimation for flow control when dialing the grpc server. That is, setting the DialOptions on the client side to use:

    opts := []grpc.DialOption{
       grpc.WithInitialWindowSize(largerWindowSize),
       grpc.WithInitialConnWindowSize(largerWindowSize),
    }

Where both:

largerWindowSize is greater than 65535 (so that dynamic window/flow estimation is turned off)
largerWindowSize is greater than the size of the largest RPC response messages (with a bit of overhead for some reason).

2. Server-side - delay ending the stream

The error can be prevented by delaying the calls that write the StatusOk into the transport. Specifically, if the END_STREAM header is delayed from being put into the controlBuf at google.golang.org/grpc/internal/transport/http2_server.go#finishStream. That is, you can make any of the following changes to *http2Server.finishStream, and the error will not be present:

    time.Sleep(1 * time.Second) // <-- sleeping before put = EOF error is prevented
    t.controlBuf.put(hdr)

    go func() {         // <-- allowing finishStream to progress and sleep before put = EOF error is prevented
      time.Sleep(1 * time.Second)
      t.controlBuf.put(hdr)
    }()

When the t.controlBuf.put(hdr) line is delayed in this way, the RPC is allowed to complete normally, and the client will see the response as intended.

Note if you add the sleep after the t.controlBuf.put(hdr), then the error will still be present (i.e. delaying the finishStream function is not what causes the error to be prevented)

    t.controlBuf.put(hdr)
    time.Sleep(1 * time.Second) // <-- sleeping after put = EOF error still occurs

Would anyone know what might be going on here, or be able to give me some insight/advice for continuing to debug this issue?

What did you expect to see?

Connection latency does not affect RPCs

What did you see instead?

High-latency connection (such as ~150ms ping) reliably hit the error code = Internal desc = unexpected EOF

About this issue

Original URL
State: open
Created 2 years ago
Reactions: 4
Comments: 38 (21 by maintainers)

Commits related to this issue

gnmi/client: add bdp flag Setting this flag to false will disable BDP and dynamic windows for the grpc connection that the client uses. This is a workaround for https://github.com/grpc/grpc-go/issue... — committed to aristanetworks/goarista by brianneville 2 years ago
Workaround for GRPC unexpected EOF A number of our GRPC tests have been failing with an unexpected EOF error. It appears to be related to grpc/grpc-go#5358. Apply the suggested window size workaroun... — committed to gravitational/teleport by zmb3 2 years ago
Update grpc-go Bumps google.golang.org/grpc to v1.49.0 which includes a fix for https://github.com/grpc/grpc-go/issues/5358 which has been causing tests to fail. — committed to gravitational/teleport by rosstimothy 2 years ago
Update grpc-go (#15926) Bumps google.golang.org/grpc to v1.49.0 which includes a fix for https://github.com/grpc/grpc-go/issues/5358 which has been causing tests to fail. — committed to gravitational/teleport by rosstimothy 2 years ago
Update grpc-go Bumps google.golang.org/grpc to v1.49.0 which includes a fix for https://github.com/grpc/grpc-go/issues/5358 which has been causing tests to fail. — committed to gravitational/teleport by rosstimothy 2 years ago
Update grpc-go (#15926) Bumps google.golang.org/grpc to v1.49.0 which includes a fix for https://github.com/grpc/grpc-go/issues/5358 which has been causing tests to fail. — committed to gravitational/teleport by rosstimothy 2 years ago
Update grpc-go (#15926) (#16207) Bumps google.golang.org/grpc to v1.49.0 which includes a fix for https://github.com/grpc/grpc-go/issues/5358 which has been causing tests to fail. — committed to gravitational/teleport by rosstimothy 2 years ago
Update grpc-go (#15926) Bumps google.golang.org/grpc to v1.49.0 which includes a fix for https://github.com/grpc/grpc-go/issues/5358 which has been causing tests to fail. — committed to gravitational/teleport by rosstimothy 2 years ago
Update grpc-go (#15926) (#16206) Bumps google.golang.org/grpc to v1.49.0 which includes a fix for https://github.com/grpc/grpc-go/issues/5358 which has been causing tests to fail. — committed to gravitational/teleport by rosstimothy 2 years ago
Update grpc-go (#16199) Bumps google.golang.org/grpc to v1.49.0 which includes a fix for https://github.com/grpc/grpc-go/issues/5358 which has been causing tests to fail. — committed to gravitational/teleport by rosstimothy 2 years ago
Update grpc-go (#15926) Bumps google.golang.org/grpc to v1.49.0 which includes a fix for https://github.com/grpc/grpc-go/issues/5358 which has been causing tests to fail. — committed to gravitational/teleport by rosstimothy 2 years ago

Most upvoted comments

I also have the same EOF message rpc error: code = Internal desc = unexpected EOF in etcdctl

When I use ‘get’ command in etcdctl on mac, It always outputs ‘unexpected EOF’ . (ping endpoit: 15ms) But it’s OK for another Linux client (ping endpoint: < 2ms) or short results like (etcdctl get --prefix “/blabla” --limit 100)

I modified etcdctl source like below and I got all results. https://github.com/etcd-io/etcd/blob/87e271701b3449c35d9ef41f855c7792488773ce/client/v3/client.go#L285 opts := append(dopts, grpc.WithResolvers(c.resolver), grpc.WithInitialWindowSize(1_000_000), grpc.WithInitialConnWindowSize(1_000_000))

ysoftman on Feb 11, 2023

I compiled and ran the server and client for the repo that reproduces this bug (https://github.com/brianneville/grpcbug) using go1.17.6 and the issue is still present, so I’d be hesitant to say that https://github.com/golang/go/issues/53253 is at fault here.

Jus to explore further though @hexfusion could you try appending the following DialOptions to the client?

var opts []grpc.DialOption

// add any other DialOptions...

const maxWindowSize int32 = (1 << 20) * 16
opts = append(opts,
    grpc.WithInitialWindowSize(maxWindowSize),
    grpc.WithInitialConnWindowSize(maxWindowSize),
)

These DialOptions are a workaround for this bug (https://github.com/grpc/grpc-go/issues/5358) so it would be interesting to see if they have any effect in your case.

brianneville on Aug 9, 2022

@zasweq curious if we think it could be related to this bug in net/http2[1]? Could you help drive priority? This is causing huge problems for us in production with golang 1.18.

Unlikely, since grpc-go uses its own HTTP server implementation, and just the framer part of the net/http2 library. ~~But your link to the net/http2 issue is a link back to this issue, so I can’t see what you are referencing.~~

EDIT: Oh sorry, I also see your back-reference from that bug to this one. Still unlikely for the same reason, but I’ll take a look.

dfawley on Aug 9, 2022

One thing we suspect now is that the server might not be setting the linger socket option (need to check on this), which could lead to this kind of behavior if there is any TCP error after the server finishes writing the RPC data and closes the connection.

Update: it looks like Go sets this by default, so I don’t suspect this has anything to do with it anymore.

dfawley on Oct 23, 2023

I will create a new issue for this but the whole reason we are using ServerMaxConnectionAge is because we would overflow the stream id every now and then. invalid stream ID. So maybe its related.

Interesting. Apparently gRPC-Java will automatically reconnect when stream IDs are 50% exhausted. We don’t do that in gRPC-Go, but it would be a good idea.

dfawley on Aug 16, 2022

If the connection is lost in the middle of an RPC then it’s expected to get RPC failures.

You could use a MaxAgeGrace setting to help work around that issue when it’s caused by MaxAge connection termination from the server.

dfawley on Aug 9, 2022

The options grpc.WithInitialWindowSize and grpc.WithInitialConnWindowSize options are independent of the grpc.MaxCallRecvMsgSize and grpc.MaxCallSendMsgSize options. You can combine these options if you wish (I’d say try it, because I’m curious to see if the workaround has any effect in your case).

Setting grpc.WithInitialWindowSize and grpc.WithInitialConnWindowSize will disable BDP and manually force the window size used to whatever value you provide it. By comparison grpc.MaxCallRecvMsgSize and grpc.MaxCallSendMsgSize are for only setting the maximum size of the overall message that can be received/sent (with this message being sent in multiple data frames using a dynamic window size unless BDP is disabled).

brianneville on Aug 9, 2022

https://github.com/golang/go/issues/53253 appears completely unrelated: 1. We don’t use the same client/server implementation, and 2. the symptoms are different; that was a 5s delay caused by a delay in receipt of a frame with END_STREAM set – this appears to be streams failing due to flow control violations (either the sender using more quota than it had or the recipient incorrectly believing that was the case).

dfawley on Aug 9, 2022

Anecdotal evidence only, but I don’t recall seeing these failures prior to our Go 1.18 upgrade…

zmb3 on Aug 9, 2022

grpc-go: gRPC connection closing with unexpected EOF on high-latency connections

What version of gRPC are you using?

What version of Go are you using (go version)?

What operating system (Linux, Windows, …) and version?

What did you do?

1. Client side - configure window sizes

2. Server-side - delay ending the stream

What did you expect to see?

What did you see instead?

About this issue

Commits related to this issue

Most upvoted comments

What version of Go are you using (`go version`)?