grpc-node: Intermittently the client enters a state where it doesnt receive response sent by server

Problem description

Intermittently our grpc client is entering a state where the server is sending a response to the client, but the client doesnt receive it and throws a DEADLINE_EXCEEDED error. The error persists on retries until the server or client is restarted.

Reproduction steps

Unknown - appears to eventually enter this state in longer lived environments.

Environment

OS name, version and architecture: Alpine
Node version: 18
Package name and version: grpc-js 1.8.14

Additional context

Client Logs


D 2023-07-12T20:00:37.761Z | resolving_call | [4] Created

D 2023-07-12T20:00:37.761Z | channel | (43) dns:<redacted ip>createResolvingCall [4] method=“<redacted method>”, deadline=2023-07-12T20:01:22.760Z

D 2023-07-12T20:00:37.762Z | resolving_call | [4] start called

D 2023-07-12T20:00:37.762Z | resolving_call | [4] Deadline will be reached in 44998ms

D 2023-07-12T20:00:37.762Z | resolving_call | [4] Deadline: 2023-07-12T20:01:22.760Z

D 2023-07-12T20:00:37.763Z | resolving_call | [4] startRead called

D 2023-07-12T20:00:37.764Z | resolving_call | [4] halfClose called

D 2023-07-12T20:00:37.764Z | resolving_call | [4] write() called with message of length 38

D 2023-07-12T20:00:37.764Z | resolving_call | [4] Created child [5]

D 2023-07-12T20:00:37.764Z | channel | (43) dns:<redacted ip> createRetryingCall [5] method="<redacted method>"

D 2023-07-12T20:00:37.765Z | load_balancing_call | [6] start called

D 2023-07-12T20:00:37.765Z | retrying_call | [5] Created child call [6] for attempt 1

D 2023-07-12T20:00:37.765Z | channel | (43) dns:<redacted ip> createLoadBalancingCall [6] method="<redacted method>"

D 2023-07-12T20:00:37.765Z | retrying_call | [5] start called

D 2023-07-12T20:00:37.766Z | load_balancing_call | [6] Pick called

D 2023-07-12T20:00:37.766Z | load_balancing_call | [6] Pick result: COMPLETE subchannel: (44) <redacted ip> status: undefined undefined

D 2023-07-12T20:00:37.766Z | retrying_call | [5] startRead called

D 2023-07-12T20:00:37.770Z | load_balancing_call | [6] Created child call [7]

D 2023-07-12T20:00:37.770Z | transport_internals | (45) <redacted ip> session.closed=false session.destroyed=false session.socket.destroyed=false

D 2023-07-12T20:00:37.770Z | transport_flowctrl | (45) <redacted ip> local window size: 65535 remote window size: 65535

D 2023-07-12T20:00:37.771Z | retrying_call | [5] write() called with message of length 43

D 2023-07-12T20:00:37.771Z | subchannel_call | [7] sending data chunk of length 43

D 2023-07-12T20:00:37.771Z | subchannel_call | [7] write() called with message of length 43

D 2023-07-12T20:00:37.771Z | load_balancing_call | [6] write() called with message of length 43

D 2023-07-12T20:00:37.772Z | retrying_call | [5] halfClose called

D 2023-07-12T20:00:37.773Z | subchannel_call | [7] calling end() on HTTP/2 stream

D 2023-07-12T20:00:37.773Z | subchannel_call | [7] end() called

D 2023-07-12T20:00:37.773Z | load_balancing_call | [6] halfClose called

D 2023-07-12T20:01:22.760Z | resolving_call | [4] cancelWithStatus code: 4 details: "Deadline exceeded"

D 2023-07-12T20:01:22.760Z | retrying_call | [5] cancelWithStatus code: 4 details: "Deadline exceeded"

D 2023-07-12T20:01:22.761Z | retrying_call | [5] ended with status: code=4 details="Deadline exceeded"

D 2023-07-12T20:01:22.761Z | load_balancing_call | [6] cancelWithStatus code: 4 details: "Deadline exceeded"

D 2023-07-12T20:01:22.761Z | subchannel_call | [7] cancelWithStatus code: 4 details: "Deadline exceeded"

D 2023-07-12T20:01:22.761Z | subchannel_call | [7] ended with status: code=4 details="Deadline exceeded"

D 2023-07-12T20:01:22.762Z | retrying_call | [5] state=TRANSPARENT_ONLY handling status with progress PROCESSED from child [6] in state ACTIVE

D 2023-07-12T20:01:22.762Z | retrying_call | [5] Received status from child [6]

D 2023-07-12T20:01:22.762Z | load_balancing_call | [6] ended with status: code=4 details="Deadline exceeded"

D 2023-07-12T20:01:22.762Z | subchannel_call | [7] close http2 stream with code 8

D 2023-07-12T20:01:22.763Z | resolving_call | [4] Received status

D 2023-07-12T20:01:22.763Z | load_balancing_call | [6] Received status

D 2023-07-12T20:01:22.763Z | resolving_call | [4] Received status

D 2023-07-12T20:01:22.763Z | resolving_call | [4] ended with status: code=4 details="Deadline exceeded"

D 2023-07-12T20:01:22.763Z | retrying_call | [5] ended with status: code=4 details="Deadline exceeded"

D 2023-07-12T20:01:22.864Z | subchannel_call | [7] HTTP/2 stream closed with code 8

Server Logs


D 2023-07-12T20:00:37.774Z | server | (1) Received call to method <redacted method> at address null

D 2023-07-12T20:00:37.774Z | server_call | Request to <redacted method> received headers {"trackingid”:[“<trackingId>”],”grpc-accept-encoding":["identity,deflate,gzip"],"accept-encoding":["identity"],"grpc-timeout":["44993m"],"user-agent":["grpc-node-js/1.8.14"],"content-type":["application/grpc"],"te":["trailers"]}

D 2023-07-12T20:00:37.777Z | server_call | Request to method <redacted method> stream closed with rstCode 0

D 2023-07-12T20:00:37.777Z | server_call | Request to method <redacted method> ended with status code: OK details: OK

As you can see the server responds well within the deadline but the client never gets the response. I know transient failures can happen but since it persists on retries it appears there is something deeper going on here.

About this issue

Original URL
State: open
Created a year ago
Reactions: 5
Comments: 30 (15 by maintainers)

Most upvoted comments

Based on the information in https://github.com/nodejs/node/issues/49147#issuecomment-1679515331, I published a change in grpc-js version 1.9.1 that defers all actions in the write callback using process.nextTick. Please try it out, to see if it improves anything.

murgatroid99 on Aug 22, 2023

Sorry about that standard practice for most of our internal logs. https://gist.github.com/krrose27/b9e31023bccfbcda02fb828c5f6317d7

krrose27 on Oct 2, 2023

@murgatroid99 Yes passing "grpc.keepalive_time_ms": 10000.

I am seeing this under very similar conditions to the ones linked in that other thread. Since I can get this within a day consistently on 1.9.1 it shouldn’t be a problem for me to get a tcpdump without ssl enabled

P0rth0s on Aug 30, 2023

I managed to strace the client side of the reproduction with all gRPC tracers active, and the output is interesting, and might help with a Node issue. Here are the most relevant lines (not all consecutive):

socket(AF_INET6, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 23
setsockopt(23, SOL_TCP, TCP_NODELAY, [1], 4) = 0
connect(23, {sa_family=AF_INET6, sin6_port=htons(50051), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress)
write(23, "\26\3\1\1\177\1\0\1{\3\00316\23\230K\202\rS\215Y7y\24a\341\275\376\235/.\207"..., 388) = 388
epoll_wait(13, [{events=EPOLLIN, data={u32=23, u64=23}}], 1024, 1) = 1
read(23, "\26\3\3\0z\2\0\0v\3\3\346\325[\341\6\17\3353\225\245 $6\220\323\302\257\256\20\26\214"..., 65536) = 2332
D 2023-08-09T21:49:17.512Z | subchannel | (5) ::1:50051 CONNECTING -> READY
D 2023-08-09T21:49:17.512Z | pick_first | Pick subchannel with address ::1:50051
write(23, "\24\3\3\0\1\1\27\3\3\0005r\10\0\301=\346.\37\237\264\344\325\357\2025=o\353~R\230"..., 64) = 64
write(23, "\27\3\3\1\177\223[5\223\221\354\206\252\303:\330\231c\312{\243~\33\235\336~\r\371\360\322hx"..., 388) = -1 EPIPE (Broken pipe)
epoll_wait(13, [{events=EPOLLIN|EPOLLHUP, data={u32=23, u64=23}}, {events=EPOLLIN|EPOLLERR|EPOLLHUP, data={u32=24, u64=24}}, {events=EPOLLIN, data={u32=21, u64=21}}], 1024, 0) = 3
epoll_ctl(13, EPOLL_CTL_DEL, 23, 0x7ffc71511b70) = 0

So, FD 23 corresponds to the socket that appears to connect. Then writing to it results in EPIPE, and then the next epoll_wait call tries to watch that FD anyway. Immediately after that, an epoll_ctl call deletes FD 23 from the poll set. After that there are no more references to FD 23 in the strace output that I could see. So, it looks like something knows that that FD is unusable, but that information doesn’t propagate up to the parts of the Node API that we see.

murgatroid99 on Aug 9, 2023

Please try updating your dependencies so that you pick up @grpc/grpc-js version 1.8.19, and then enable keepalives. As far as I understand, you can do this with pubsub by constructing the instance like this:

const pubsubClient = new PubSub({ 'grpc.keepalive_timeout_ms': 10000, 'grpc.keepalive_time_ms': 30000 } as any);

The as any is only needed if you are using TypeScript. If you are already passing other options to that constructor, these options can simply be added to the existing options object. The specific numbers there are suggested values, you can change them if necessary.

If that doesn’t help, we can look into investigating further with trace logs.

murgatroid99 on Jul 24, 2023

The DEADLINE_EXCEEDED error you linked from google-gax has the error text Total timeout of API ${apiName} exceeded ${retry.backoffSettings.totalTimeoutMillis} milliseconds before any response was received. but the stack trace you shared has the error text Deadline exceeded. It should only be one of those or the other, so can you clarify what you are seeing there?

The channelz error you are seeing may indicate a channelz bug in the client. That component is not well-tested, so I wouldn’t be surprised. Can you double check that you cannot get info on any of the listed subchannels?

I’ll look into this more on Monday.

murgatroid99 on Jul 21, 2023