grpc-go: gRPC connection closing with unexpected EOF on high-latency connections
What version of gRPC are you using?
1.46.2
What version of Go are you using (go version
)?
go version go1.18 darwin/amd64
What operating system (Linux, Windows, …) and version?
Client is running MacOs: ProductName: macOS ProductVersion: 11.6 BuildVersion: 20G165
Server is running Linux (CentOS 7)
What did you do?
Note: I have posted a small repo that can be used to reproduce this bug, please check it out here: https://github.com/brianneville/grpcbug
For high-latency connections, I noticed that RPCs were having their connection terminated with an internal error:
code = Internal desc = unexpected EOF
This hits on both unary and streaming RPCs, and the ease with which this can be reproduced seems to vary with the client-server latency. For high-latency connections (such 150ms ping), this error hits pretty much every time an RPC is called. For low-latency connections (<2ms ping) this error is much more infrequent, and hundreds of thousands of messages may be streamed over this connection before the error is hit even once.
I did a bit of digging through the grpc library code, and found two ways that the error can be prevented:
1. Client side - configure window sizes
The error can be prevented from the client side by disabling the dynamic window and BDP estimation for flow control when dialing the grpc server. That is, setting the DialOptions on the client side to use:
opts := []grpc.DialOption{
grpc.WithInitialWindowSize(largerWindowSize),
grpc.WithInitialConnWindowSize(largerWindowSize),
}
Where both:
largerWindowSize
is greater than 65535 (so that dynamic window/flow estimation is turned off)largerWindowSize
is greater than the size of the largest RPC response messages (with a bit of overhead for some reason).
2. Server-side - delay ending the stream
The error can be prevented by delaying the calls that write the StatusOk
into the transport. Specifically, if the END_STREAM header is delayed from being put into the controlBuf
at google.golang.org/grpc/internal/transport/http2_server.go#finishStream
.
That is, you can make any of the following changes to *http2Server.finishStream
, and the error will not be present:
time.Sleep(1 * time.Second) // <-- sleeping before put = EOF error is prevented
t.controlBuf.put(hdr)
or
go func() { // <-- allowing finishStream to progress and sleep before put = EOF error is prevented
time.Sleep(1 * time.Second)
t.controlBuf.put(hdr)
}()
When the t.controlBuf.put(hdr)
line is delayed in this way, the RPC is allowed to complete normally, and the client will see the response as intended.
Note if you add the sleep after the t.controlBuf.put(hdr)
, then the error will still be present (i.e. delaying the finishStream function is not what causes the error to be prevented)
t.controlBuf.put(hdr)
time.Sleep(1 * time.Second) // <-- sleeping after put = EOF error still occurs
Would anyone know what might be going on here, or be able to give me some insight/advice for continuing to debug this issue?
What did you expect to see?
Connection latency does not affect RPCs
What did you see instead?
High-latency connection (such as ~150ms ping) reliably hit the error code = Internal desc = unexpected EOF
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 4
- Comments: 38 (21 by maintainers)
Commits related to this issue
- gnmi/client: add bdp flag Setting this flag to false will disable BDP and dynamic windows for the grpc connection that the client uses. This is a workaround for https://github.com/grpc/grpc-go/issue... — committed to aristanetworks/goarista by brianneville 2 years ago
- Workaround for GRPC unexpected EOF A number of our GRPC tests have been failing with an unexpected EOF error. It appears to be related to grpc/grpc-go#5358. Apply the suggested window size workaroun... — committed to gravitational/teleport by zmb3 2 years ago
- Update grpc-go Bumps google.golang.org/grpc to v1.49.0 which includes a fix for https://github.com/grpc/grpc-go/issues/5358 which has been causing tests to fail. — committed to gravitational/teleport by rosstimothy 2 years ago
- Update grpc-go (#15926) Bumps google.golang.org/grpc to v1.49.0 which includes a fix for https://github.com/grpc/grpc-go/issues/5358 which has been causing tests to fail. — committed to gravitational/teleport by rosstimothy 2 years ago
- Update grpc-go Bumps google.golang.org/grpc to v1.49.0 which includes a fix for https://github.com/grpc/grpc-go/issues/5358 which has been causing tests to fail. — committed to gravitational/teleport by rosstimothy 2 years ago
- Update grpc-go (#15926) Bumps google.golang.org/grpc to v1.49.0 which includes a fix for https://github.com/grpc/grpc-go/issues/5358 which has been causing tests to fail. — committed to gravitational/teleport by rosstimothy 2 years ago
- Update grpc-go (#15926) (#16207) Bumps google.golang.org/grpc to v1.49.0 which includes a fix for https://github.com/grpc/grpc-go/issues/5358 which has been causing tests to fail. — committed to gravitational/teleport by rosstimothy 2 years ago
- Update grpc-go (#15926) Bumps google.golang.org/grpc to v1.49.0 which includes a fix for https://github.com/grpc/grpc-go/issues/5358 which has been causing tests to fail. — committed to gravitational/teleport by rosstimothy 2 years ago
- Update grpc-go (#15926) (#16206) Bumps google.golang.org/grpc to v1.49.0 which includes a fix for https://github.com/grpc/grpc-go/issues/5358 which has been causing tests to fail. — committed to gravitational/teleport by rosstimothy 2 years ago
- Update grpc-go (#16199) Bumps google.golang.org/grpc to v1.49.0 which includes a fix for https://github.com/grpc/grpc-go/issues/5358 which has been causing tests to fail. — committed to gravitational/teleport by rosstimothy 2 years ago
- Update grpc-go (#15926) Bumps google.golang.org/grpc to v1.49.0 which includes a fix for https://github.com/grpc/grpc-go/issues/5358 which has been causing tests to fail. — committed to gravitational/teleport by rosstimothy 2 years ago
I also have the same EOF message
rpc error: code = Internal desc = unexpected EOF
in etcdctlWhen I use ‘get’ command in etcdctl on mac, It always outputs ‘unexpected EOF’ . (ping endpoit: 15ms) But it’s OK for another Linux client (ping endpoint: < 2ms) or short results like (etcdctl get --prefix “/blabla” --limit 100)
I modified etcdctl source like below and I got all results. https://github.com/etcd-io/etcd/blob/87e271701b3449c35d9ef41f855c7792488773ce/client/v3/client.go#L285 opts := append(dopts, grpc.WithResolvers(c.resolver), grpc.WithInitialWindowSize(1_000_000), grpc.WithInitialConnWindowSize(1_000_000))
I compiled and ran the server and client for the repo that reproduces this bug (https://github.com/brianneville/grpcbug) using go1.17.6 and the issue is still present, so I’d be hesitant to say that https://github.com/golang/go/issues/53253 is at fault here.
Jus to explore further though @hexfusion could you try appending the following DialOptions to the client?
These DialOptions are a workaround for this bug (https://github.com/grpc/grpc-go/issues/5358) so it would be interesting to see if they have any effect in your case.
Unlikely, since grpc-go uses its own HTTP server implementation, and just the framer part of the net/http2 library.
But your link to the net/http2 issue is a link back to this issue, so I can’t see what you are referencing.EDIT: Oh sorry, I also see your back-reference from that bug to this one. Still unlikely for the same reason, but I’ll take a look.
One thing we suspect now is that the server might not be setting the linger socket option (need to check on this), which could lead to this kind of behavior if there is any TCP error after the server finishes writing the RPC data and closes the connection.
Update: it looks like Go sets this by default, so I don’t suspect this has anything to do with it anymore.
Interesting. Apparently gRPC-Java will automatically reconnect when stream IDs are 50% exhausted. We don’t do that in gRPC-Go, but it would be a good idea.
If the connection is lost in the middle of an RPC then it’s expected to get RPC failures.
You could use a
MaxAgeGrace
setting to help work around that issue when it’s caused byMaxAge
connection termination from the server.The options
grpc.WithInitialWindowSize
andgrpc.WithInitialConnWindowSize
options are independent of thegrpc.MaxCallRecvMsgSize
andgrpc.MaxCallSendMsgSize
options. You can combine these options if you wish (I’d say try it, because I’m curious to see if the workaround has any effect in your case).Setting
grpc.WithInitialWindowSize
andgrpc.WithInitialConnWindowSize
will disable BDP and manually force the window size used to whatever value you provide it. By comparisongrpc.MaxCallRecvMsgSize
andgrpc.MaxCallSendMsgSize
are for only setting the maximum size of the overall message that can be received/sent (with this message being sent in multiple data frames using a dynamic window size unless BDP is disabled).https://github.com/golang/go/issues/53253 appears completely unrelated: 1. We don’t use the same client/server implementation, and 2. the symptoms are different; that was a 5s delay caused by a delay in receipt of a frame with END_STREAM set – this appears to be streams failing due to flow control violations (either the sender using more quota than it had or the recipient incorrectly believing that was the case).
Anecdotal evidence only, but I don’t recall seeing these failures prior to our Go 1.18 upgrade…