grpc-node: Intermittently the client enters a state where it doesnt receive response sent by server
Problem description
Intermittently our grpc client is entering a state where the server is sending a response to the client, but the client doesnt receive it and throws a DEADLINE_EXCEEDED error. The error persists on retries until the server or client is restarted.
Reproduction steps
Unknown - appears to eventually enter this state in longer lived environments.
Environment
- OS name, version and architecture: Alpine
- Node version: 18
- Package name and version: grpc-js 1.8.14
Additional context
Client Logs
D 2023-07-12T20:00:37.761Z | resolving_call | [4] Created
D 2023-07-12T20:00:37.761Z | channel | (43) dns:<redacted ip>createResolvingCall [4] method=“<redacted method>”, deadline=2023-07-12T20:01:22.760Z
D 2023-07-12T20:00:37.762Z | resolving_call | [4] start called
D 2023-07-12T20:00:37.762Z | resolving_call | [4] Deadline will be reached in 44998ms
D 2023-07-12T20:00:37.762Z | resolving_call | [4] Deadline: 2023-07-12T20:01:22.760Z
D 2023-07-12T20:00:37.763Z | resolving_call | [4] startRead called
D 2023-07-12T20:00:37.764Z | resolving_call | [4] halfClose called
D 2023-07-12T20:00:37.764Z | resolving_call | [4] write() called with message of length 38
D 2023-07-12T20:00:37.764Z | resolving_call | [4] Created child [5]
D 2023-07-12T20:00:37.764Z | channel | (43) dns:<redacted ip> createRetryingCall [5] method="<redacted method>"
D 2023-07-12T20:00:37.765Z | load_balancing_call | [6] start called
D 2023-07-12T20:00:37.765Z | retrying_call | [5] Created child call [6] for attempt 1
D 2023-07-12T20:00:37.765Z | channel | (43) dns:<redacted ip> createLoadBalancingCall [6] method="<redacted method>"
D 2023-07-12T20:00:37.765Z | retrying_call | [5] start called
D 2023-07-12T20:00:37.766Z | load_balancing_call | [6] Pick called
D 2023-07-12T20:00:37.766Z | load_balancing_call | [6] Pick result: COMPLETE subchannel: (44) <redacted ip> status: undefined undefined
D 2023-07-12T20:00:37.766Z | retrying_call | [5] startRead called
D 2023-07-12T20:00:37.770Z | load_balancing_call | [6] Created child call [7]
D 2023-07-12T20:00:37.770Z | transport_internals | (45) <redacted ip> session.closed=false session.destroyed=false session.socket.destroyed=false
D 2023-07-12T20:00:37.770Z | transport_flowctrl | (45) <redacted ip> local window size: 65535 remote window size: 65535
D 2023-07-12T20:00:37.771Z | retrying_call | [5] write() called with message of length 43
D 2023-07-12T20:00:37.771Z | subchannel_call | [7] sending data chunk of length 43
D 2023-07-12T20:00:37.771Z | subchannel_call | [7] write() called with message of length 43
D 2023-07-12T20:00:37.771Z | load_balancing_call | [6] write() called with message of length 43
D 2023-07-12T20:00:37.772Z | retrying_call | [5] halfClose called
D 2023-07-12T20:00:37.773Z | subchannel_call | [7] calling end() on HTTP/2 stream
D 2023-07-12T20:00:37.773Z | subchannel_call | [7] end() called
D 2023-07-12T20:00:37.773Z | load_balancing_call | [6] halfClose called
D 2023-07-12T20:01:22.760Z | resolving_call | [4] cancelWithStatus code: 4 details: "Deadline exceeded"
D 2023-07-12T20:01:22.760Z | retrying_call | [5] cancelWithStatus code: 4 details: "Deadline exceeded"
D 2023-07-12T20:01:22.761Z | retrying_call | [5] ended with status: code=4 details="Deadline exceeded"
D 2023-07-12T20:01:22.761Z | load_balancing_call | [6] cancelWithStatus code: 4 details: "Deadline exceeded"
D 2023-07-12T20:01:22.761Z | subchannel_call | [7] cancelWithStatus code: 4 details: "Deadline exceeded"
D 2023-07-12T20:01:22.761Z | subchannel_call | [7] ended with status: code=4 details="Deadline exceeded"
D 2023-07-12T20:01:22.762Z | retrying_call | [5] state=TRANSPARENT_ONLY handling status with progress PROCESSED from child [6] in state ACTIVE
D 2023-07-12T20:01:22.762Z | retrying_call | [5] Received status from child [6]
D 2023-07-12T20:01:22.762Z | load_balancing_call | [6] ended with status: code=4 details="Deadline exceeded"
D 2023-07-12T20:01:22.762Z | subchannel_call | [7] close http2 stream with code 8
D 2023-07-12T20:01:22.763Z | resolving_call | [4] Received status
D 2023-07-12T20:01:22.763Z | load_balancing_call | [6] Received status
D 2023-07-12T20:01:22.763Z | resolving_call | [4] Received status
D 2023-07-12T20:01:22.763Z | resolving_call | [4] ended with status: code=4 details="Deadline exceeded"
D 2023-07-12T20:01:22.763Z | retrying_call | [5] ended with status: code=4 details="Deadline exceeded"
D 2023-07-12T20:01:22.864Z | subchannel_call | [7] HTTP/2 stream closed with code 8
Server Logs
D 2023-07-12T20:00:37.774Z | server | (1) Received call to method <redacted method> at address null
D 2023-07-12T20:00:37.774Z | server_call | Request to <redacted method> received headers {"trackingid”:[“<trackingId>”],”grpc-accept-encoding":["identity,deflate,gzip"],"accept-encoding":["identity"],"grpc-timeout":["44993m"],"user-agent":["grpc-node-js/1.8.14"],"content-type":["application/grpc"],"te":["trailers"]}
D 2023-07-12T20:00:37.777Z | server_call | Request to method <redacted method> stream closed with rstCode 0
D 2023-07-12T20:00:37.777Z | server_call | Request to method <redacted method> ended with status code: OK details: OK
As you can see the server responds well within the deadline but the client never gets the response. I know transient failures can happen but since it persists on retries it appears there is something deeper going on here.
About this issue
- Original URL
- State: open
- Created a year ago
- Reactions: 5
- Comments: 30 (15 by maintainers)
Based on the information in https://github.com/nodejs/node/issues/49147#issuecomment-1679515331, I published a change in grpc-js version 1.9.1 that defers all actions in the write callback using
process.nextTick
. Please try it out, to see if it improves anything.Sorry about that standard practice for most of our internal logs. https://gist.github.com/krrose27/b9e31023bccfbcda02fb828c5f6317d7
@murgatroid99 Yes passing
"grpc.keepalive_time_ms": 10000
.I am seeing this under very similar conditions to the ones linked in that other thread. Since I can get this within a day consistently on 1.9.1 it shouldn’t be a problem for me to get a tcpdump without ssl enabled
I managed to strace the client side of the reproduction with all gRPC tracers active, and the output is interesting, and might help with a Node issue. Here are the most relevant lines (not all consecutive):
So, FD 23 corresponds to the socket that appears to connect. Then writing to it results in
EPIPE
, and then the nextepoll_wait
call tries to watch that FD anyway. Immediately after that, anepoll_ctl
call deletes FD 23 from the poll set. After that there are no more references to FD 23 in thestrace
output that I could see. So, it looks like something knows that that FD is unusable, but that information doesn’t propagate up to the parts of the Node API that we see.Please try updating your dependencies so that you pick up
@grpc/grpc-js
version1.8.19
, and then enable keepalives. As far as I understand, you can do this with pubsub by constructing the instance like this:The
as any
is only needed if you are using TypeScript. If you are already passing other options to that constructor, these options can simply be added to the existing options object. The specific numbers there are suggested values, you can change them if necessary.If that doesn’t help, we can look into investigating further with trace logs.
The
DEADLINE_EXCEEDED
error you linked from google-gax has the error textTotal timeout of API ${apiName} exceeded ${retry.backoffSettings.totalTimeoutMillis} milliseconds before any response was received.
but the stack trace you shared has the error textDeadline exceeded
. It should only be one of those or the other, so can you clarify what you are seeing there?The channelz error you are seeing may indicate a channelz bug in the client. That component is not well-tested, so I wouldn’t be surprised. Can you double check that you cannot get info on any of the listed subchannels?
I’ll look into this more on Monday.