grpc-go: Transport is closing and transient failure errors since upgrading to 1.18
Please answer these questions before submitting your issue.
What version of gRPC are you using?
1.18.0
What version of Go are you using (go version
)?
1.11.5
What operating system (Linux, Windows, …) and version?
Alpine Linux 3.9 running on Google Kubernetes Engine
What did you do?
Since upgrading, I’ve noticed a large increase in grpc errors leading to failed requests and service disruptions.
We don’t use any connection settings on the clients, nor do we have any special settings on the server (keepalives, fastfail, etc). The only thing we have on the connections is Mutual TLS. It’s roughly configured like this:
Client:
creds := grpc.WithTransportCredentials(credentials.NewTLS(&tls.Config{
RootCAs: certPool,
Certificates: []tls.Certificate{cert},
MinVersion: tls.VersionTLS12,
}))
grpc.Dial("foo.foo.svc.cluster.local:80", creds)
Server:
creds := grpc.Creds(credentials.NewTLS(&tls.Config{
Certificates: []tls.Certificate{cert},
ClientCAs: certPool,
ClientAuth: tls.RequireAndVerifyClientCert,
MinVersion: tls.VersionTLS12,
NextProtos: []string{"h2"},
PreferServerCipherSuites: true,
CurvePreferences: []tls.CurveID{
tls.CurveP256,
tls.X25519,
},
CipherSuites: []uint16{
tls.TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,
tls.TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,
tls.TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,
tls.TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,
tls.TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,
tls.TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,
},
}))
grpc.NewServer(creds)
This may be related to one of these other two issues https://github.com/grpc/grpc-go/issues/2653 https://github.com/grpc/grpc-go/issues/2636
Based on the feedback in one of the above linked issues, I set these ENV vars on one service that had a lot of failures: GRPC_GO_LOG_VERBOSITY_LEVEL=99 GRPC_GO_LOG_SEVERITY_LEVEL=info
I added (__GRPC CALL HERE__)
at the end of the log lines below where the error was logged as a failed grpc call that we made.
WARNING: 2019/03/01 04:41:42 grpc: addrConn.createTransport failed to connect to {foo.foo.svc.cluster.local:80 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: context deadline exceeded". Reconnecting...
INFO: 2019/03/01 04:41:42 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 04:41:42 pickfirstBalancer: HandleSubConnStateChange: 0xc000806290, TRANSIENT_FAILURE
timestamp=2019-03-01T04:41:42.887562062Z level=error error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: authentication handshake failed: context deadline exceeded\"" (__GRPC CALL HERE))
INFO: 2019/03/01 04:41:43 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 04:41:43 pickfirstBalancer: HandleSubConnStateChange: 0xc000806290, CONNECTING
INFO: 2019/03/01 04:41:44 Subchannel Connectivity change to READY
INFO: 2019/03/01 04:41:44 pickfirstBalancer: HandleSubConnStateChange: 0xc000806290, READY
INFO: 2019/03/01 04:45:22 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 04:45:22 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 04:45:22 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 04:45:22 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 04:45:23 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 04:45:23 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 04:45:28 Subchannel Connectivity change to READY
INFO: 2019/03/01 04:45:28 Subchannel Connectivity change to READY
INFO: 2019/03/01 05:34:23 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 05:34:23 Subchannel Connectivity change to READY
INFO: 2019/03/01 05:35:25 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 05:35:25 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 05:35:27 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 05:35:32 Subchannel Connectivity change to READY
INFO: 2019/03/01 05:35:59 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 05:35:59 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 05:36:01 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 05:36:01 Subchannel Connectivity change to READY
INFO: 2019/03/01 05:38:19 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 05:38:19 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 05:38:19 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 05:38:19 pickfirstBalancer: HandleSubConnStateChange: 0xc0008293d0, TRANSIENT_FAILURE
INFO: 2019/03/01 05:38:19 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
timestamp=2019-03-01T05:38:19.972966567Z level=error error="rpc error: code = Unavailable desc = transport is closing" (__GRPC CALL HERE__)
INFO: 2019/03/01 05:38:19 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 05:38:19 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 05:38:19 pickfirstBalancer: HandleSubConnStateChange: 0xc000806290, TRANSIENT_FAILURE
INFO: 2019/03/01 05:38:21 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 05:38:21 pickfirstBalancer: HandleSubConnStateChange: 0xc0008293d0, CONNECTING
INFO: 2019/03/01 05:38:21 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 05:38:21 pickfirstBalancer: HandleSubConnStateChange: 0xc000806290, CONNECTING
INFO: 2019/03/01 05:38:21 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 05:38:21 Subchannel Connectivity change to READY
INFO: 2019/03/01 05:38:21 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 16:13:05 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 16:13:05 pickfirstBalancer: HandleSubConnStateChange: 0xc0008293d0, TRANSIENT_FAILURE
INFO: 2019/03/01 16:13:05 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
timestamp=2019-03-01T16:13:05.92908361Z level=error error="rpc error: code = Unavailable desc = transport is closing" (__GRPC CALL HERE__)
INFO: 2019/03/01 16:13:07 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 16:13:07 pickfirstBalancer: HandleSubConnStateChange: 0xc0008293d0, CONNECTING
INFO: 2019/03/01 16:13:07 Subchannel Connectivity change to READY
INFO: 2019/03/01 16:13:07 pickfirstBalancer: HandleSubConnStateChange: 0xc0008293d0, READY
INFO: 2019/03/01 16:20:17 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 16:20:17 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 16:20:18 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 16:20:23 Subchannel Connectivity change to READY
INFO: 2019/03/01 16:42:24 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 16:42:24 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 16:42:25 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 16:42:25 Subchannel Connectivity change to READY
INFO: 2019/03/01 17:20:47 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 17:20:47 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 17:20:49 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 17:20:54 Subchannel Connectivity change to READY
INFO: 2019/03/01 17:43:01 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 17:43:01 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 17:43:02 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 17:43:07 Subchannel Connectivity change to READY
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 18 (11 by maintainers)
Links to this issue
Commits related to this issue
- client: reset backoff to 0 after a connection is established (#2669) #2663 #2636 — committed to grpc/grpc-go by menghanl 5 years ago
- client: reset backoff to 0 after a connection is established (#2669) #2663 #2636 — committed to grpc/grpc-go by menghanl 5 years ago
- client: reset backoff to 0 after a connection is established (#2669) #2663 #2636 — committed to grpc/grpc-go by menghanl 5 years ago
- vendor: upgrade grpc from 1.13.0 to 1.21.2 This PR upgrades gRPC from 1.13.0 to 1.21.2. The primary motivation for this upgrade is to eliminate the disconnections caused by https://github.com/grpc/gr... — committed to ajwerner/cockroach by ajwerner 5 years ago
- Merge #39041 39041: vendor: upgrade grpc from 1.13.0 to 1.21.2 r=ajwerner a=ajwerner This PR upgrades gRPC from 1.13.0 to 1.21.2. The primary motivation for this upgrade is to eliminate the disconn... — committed to cockroachdb/cockroach by deleted user 5 years ago
@menghanl I have experienced same issue here. Here is a reproduce procedure.
Example program
client
server
Results
With above programs, I run the server in v1.19.0 and the client with v1.17, 1.18, 1.19. The server kills the connection from the client 10 second after connected. The client should reconnect to the server but its behavior is different in client versions. With 1.18 and 1.19 the client takes over 1 second to recconect the server.
There’s a bug in the reconnect logic, the backoff time is not reset after creating a connection.
The fix is in #2669. Can you guys try it and see if it fixes the problems you are seeing? Thanks!