grpc-java: "GOAWAY received","file":"src/core/ext/transport/chttp2/transport/chttp2_transport.cc","file_line":1083,"http2_error":11,"raw_bytes":"too_many_pings"

We are running streaming grpc connection to transfer a lot of messages and data. We constantly see that the gRPC server fails after sometime with “too_many_pings” errors.

The error is very consistent happening roughly at the same time and data volume.

Code for server creation https://github.com/apache/beam/blob/04a647d8c90ecf70d77aaa9f16a93f3cc568f9d1/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/ServerFactory.java#L247

Code for client creation https://github.com/apache/beam/blob/04a647d8c90ecf70d77aaa9f16a93f3cc568f9d1/sdks/python/apache_beam/runners/worker/data_plane.py#L353

Below are the relevant grpc debug logs. Please let me know if more logs are needed.

What version of gRPC are you using?

Server: Java 1.13.1 Client: Python 1.12.0

What did you expect to see?

Long running GRPC connection with no error. Relevant log on server

2018-12-19 11:41:51,017 DEBUG org.apache.beam.vendor.grpc.v1_13_1.io.grpc.netty.NettyServerHandler  - [id: 0xd975c621, L:/0:0:0:0:0:0:0:1:37459 - R:/0:0:0:0:0:0:0:1:54512] INBOUND PING: ack=false bytes=10
2018-12-19 11:41:51,017 DEBUG org.apache.beam.vendor.grpc.v1_13_1.io.grpc.netty.NettyServerHandler  - [id: 0xd975c621, L:/0:0:0:0:0:0:0:1:37459 - R:/0:0:0:0:0:0:0:1:54512] OUTBOUND PING: ack=true bytes=10
2018-12-19 11:41:51,017 DEBUG org.apache.beam.vendor.grpc.v1_13_1.io.grpc.netty.NettyServerHandler  - [id: 0xd975c621, L:/0:0:0:0:0:0:0:1:37459 - R:/0:0:0:0:0:0:0:1:54512] OUTBOUND GO_AWAY: lastStreamId=1 errorCode=11 length=14 bytes=746f6f5f6d616e795f70696e6773
2018-12-19 11:41:51,031 DEBUG org.apache.beam.vendor.grpc.v1_13_1.io.grpc.netty.NettyServerHandler  - [id: 0xd975c621, L:/0:0:0:0:0:0:0:1:37459 ! R:/0:0:0:0:0:0:0:1:54512] OUTBOUND RST_STREAM: streamId=1 errorCode=8
2018-12-19 11:41:51,031 ERROR org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer         - Failed to handle for unknown endpoint
org.apache.beam.vendor.grpc.v1_13_1.io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
        at org.apache.beam.vendor.grpc.v1_13_1.io.grpc.Status.asRuntimeException(Status.java:517)
        at org.apache.beam.vendor.grpc.v1_13_1.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onCancel(ServerCalls.java:272)
        at org.apache.beam.vendor.grpc.v1_13_1.io.grpc.PartialForwardingServerCallListener.onCancel(PartialForwardingServerCallListener.java:40)
        at org.apache.beam.vendor.grpc.v1_13_1.io.grpc.ForwardingServerCallListener.onCancel(ForwardingServerCallListener.java:23)
        at org.apache.beam.vendor.grpc.v1_13_1.io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onCancel(ForwardingServerCallListener.java:40)
        at org.apache.beam.vendor.grpc.v1_13_1.io.grpc.Contexts$ContextualizedServerCallListener.onCancel(Contexts.java:96)
        at org.apache.beam.vendor.grpc.v1_13_1.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.closed(ServerCallImpl.java:293)
        at org.apache.beam.vendor.grpc.v1_13_1.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1Closed.runInContext(ServerImpl.java:738)
        at org.apache.beam.vendor.grpc.v1_13_1.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at org.apache.beam.vendor.grpc.v1_13_1.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Relevant logs on client

I1219 20:16:42.822471431     220 ev_epoll1_linux.cc:928]     PS:0x55beb3b22bc8 END_WORKER:0x7f84ee563d90
I1219 20:16:42.822475861      62 call_combiner.cc:120]         checking queue
I1219 20:16:42.822503665     220 ev_epoll1_linux.cc:908]      .. mark pollset 0x55beb3b22bc8 inactive
I1219 20:16:42.822517153      62 call_combiner.cc:134]         EXECUTING FROM QUEUE: closure=0x7f84c00030c0 error="No Error"
I1219 20:16:42.822527045     220 ev_epoll1_linux.cc:879]      .. choose next poller to be 0x7f84d11f5d90
I1219 20:16:42.822540820      62 call_combiner.cc:69]        ==> grpc_call_combiner_start() [0x7f84c0001e60] closure=0x7f84c00030f8 [continue recv_trailing_metadata] error="No Error"
I1219 20:16:42.822564311     220 ev_epoll1_linux.cc:990]      .. remove worker
I1219 20:16:42.822571738      58 ev_epoll1_linux.cc:836]     PS:0x55beb3695478 BEGIN_DONE:0x7f84d11f5d90 kick_state=DESIGNATED_POLLER shutdown=0 kicked_without_poller: 0
I1219 20:16:42.822564482      62 call_combiner.cc:78]          size: 1 -> 2
I1219 20:16:42.822600785     220 completion_queue.cc:1063]   RETURN_EVENT[0x55beb3b22b00]: OP_COMPLETE: tag:0x7f84f78f3d70 OK
I1219 20:16:42.822614871      62 call_combiner.cc:93]          QUEUING
I1219 20:16:42.822636361      62 call_combiner.cc:106]       ==> grpc_call_combiner_stop() [0x7f84c0001e60] [recv_initial_metadata_ready]
I1219 20:16:42.822653326      62 call_combiner.cc:113]         size: 2 -> 1
I1219 20:16:42.822668732      62 call_combiner.cc:120]         checking queue
I1219 20:16:42.822687261      62 call_combiner.cc:134]         EXECUTING FROM QUEUE: closure=0x7f84c00030f8 error="No Error"
I1219 20:16:42.822711978      62 completion_queue.cc:699]    cq_end_op_for_next(cq=0x55beb3b22b00, tag=0x7f83ee0befb0, error="No Error", done=0x7f84f835d3f0, done_arg=0x7f84c0004390, storage=0x7f84c00043e0)
D1219 20:16:42.822741423      62 ev_epoll1_linux.cc:1081]    PS:0x55beb3b22bc8 KICK:(nil) curps=0x7f84bc003258 curworker=(nil) root=(nil)
I1219 20:16:42.822757894      62 ev_epoll1_linux.cc:1092]     .. kicked_without_poller
I1219 20:16:42.822778130      62 call_combiner.cc:106]       ==> grpc_call_combiner_stop() [0x7f84c0001e60] [recv_trailing_metadata_ready]
I1219 20:16:42.822796773      62 call_combiner.cc:113]         size: 1 -> 0
I1219 20:16:42.822816656      62 call_combiner.cc:141]         queue empty
D1219 20:16:42.822844119      62 call.cc:719]                set_final_status CLI
D1219 20:16:42.822876527      62 call.cc:720]                {"created":"@1545250602.822833237","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1036,"grpc_message":"Socket closed","grpc_status":14}
I1219 20:16:42.822907338      62 completion_queue.cc:699]    cq_end_op_for_next(cq=0x55beb3b22b00, tag=0x7f83ee245470, error="No Error", done=0x7f84f835d3f0, done_arg=0x7f84c0002d20, storage=0x7f84c0002d70)
D1219 20:16:42.822931677      62 flow_control.cc:113]        0x7f84c000af18[0][cli] | t updt sent | trw:                       1047380, ttw:                       4194309, taw:                       4194309, srw:                              , slw:                              , saw:                              
I1219 20:16:42.822955643      62 chttp2_transport.cc:852]    W:0x7f84c00098d0 CLIENT state WRITING -> WRITING [begin write in current thread]
I1219 20:16:42.823004118     220 completion_queue.cc:963]    grpc_completion_queue_next(cq=0x55beb3b22b00, deadline=gpr_timespec { tv_sec: 1545250603, tv_nsec: 22984696, clock_type: 1 }, reserved=(nil))
I1219 20:16:42.823046652      58 ev_epoll1_linux.cc:728]     ps: 0x55beb3695478 poll got 1 events
I1219 20:16:42.823048633      62 subchannel_list.h:447]      [pick_first 0x7f84c40024e0] subchannel list 0x7f84c4013ff0 index 0 of 2 (subchannel 0x7f84c4003800): connectivity changed: state=TRANSIENT_FAILURE, error={"created":"@1545250602.821364969","description":"GOAWAY received","file":"src/core/ext/transport/chttp
2/transport/chttp2_transport.cc","file_line":1147,"http2_error":11,"raw_bytes":"too_many_pings"}, shutting_down=0
D1219 20:16:42.823063371     220 ev_epoll1_linux.cc:1081]    PS:0x55beb3b22bc8 KICK:(nil) curps=(nil) curworker=(nil) root=(nil)
E1219 20:16:42.823092992      58 lockfree_event.cc:213]      LockfreeEvent::SetReady: 0x7f84c4000c88 curr=(nil)
I1219 20:16:42.823101846      62 pick_first.cc:461]          Pick First 0x7f84c40024e0 connectivity changed for selected subchannel
I1219 20:16:42.823121047      62 connectivity_state.cc:164]  SET: 0x7f84c40026c8 pick_first: READY --> IDLE [selected_changed+reresolve] error=(nil) "No Error"
I1219 20:16:42.823114973     220 ev_epoll1_linux.cc:1092]     .. kicked_without_poller
E1219 20:16:42.823115591      58 lockfree_event.cc:213]      LockfreeEvent::SetReady: 0x7f84c4000c78 curr=0x2
I1219 20:16:42.823137959      62 connectivity_state.cc:190]  NOTIFY: 0x7f84c40026c8 pick_first: 0x7f84c0007c18
I1219 20:16:42.823150194     220 completion_queue.cc:1063]   RETURN_EVENT[0x55beb3b22b00]: OP_COMPLETE: tag:0x7f83ee0befb0 OK
E1219 20:16:42.823167604      58 lockfree_event.cc:213]      LockfreeEvent::SetReady: 0x7f84c4000c80 curr=0x2
I1219 20:16:42.823177467      62 lb_policy.cc:47]            pick_first 0x7f84c40024e0: scheduling re-resolution closure with error="No Error".
I1219 20:16:42.823189882      58 ev_epoll1_linux.cc:928]     PS:0x55beb3695478 END_WORKER:0x7f84d11f5d90
I1219 20:16:42.823195980      62 subchannel_list.h:374]      [pick_first 0x7f84c40024e0] subchannel list 0x7f84c4013ff0 index 0 of 2 (subchannel 0x7f84c4003800): stopping connectivity watch
I1219 20:16:42.823220322      58 ev_epoll1_linux.cc:908]      .. mark pollset 0x55beb3695478 inactive
I1219 20:16:42.823234335      62 client_channel.cc:214]      chand=0x55beb3b09e78: lb_policy=0x7f84c40024e0 state changed to IDLE
I1219 20:16:42.823238624      58 ev_epoll1_linux.cc:990]      .. remove worker
I1219 20:16:42.823261347      62 client_channel.cc:196]      chand=0x55beb3b09e78: setting connectivity state to IDLE
I1219 20:16:42.823281386      58 ev_epoll1_linux.cc:748]     PS:0x55beb3695478 BEGIN_STARTS:0x7f84d11f5d90
I1219 20:16:42.823298138      62 connectivity_state.cc:164]  SET: 0x55beb3b09f00 client_channel: READY --> IDLE [lb_changed] error=(nil) "No Error"
I1219 20:16:42.823298643      58 ev_epoll1_linux.cc:767]     PS:0x55beb3695478 BEGIN_REORG:0x7f84d11f5d90 kick_state=UNKICKED is_reassigning=1
I1219 20:16:42.823340813      62 connectivity_state.cc:116]  CONWATCH: 0x7f84c40026c8 pick_first: from IDLE [cur=IDLE] notify=0x7f8490001b48
I1219 20:16:42.823354538      58 ev_epoll1_linux.cc:836]     PS:0x55beb3695478 BEGIN_DONE:0x7f84d11f5d90 kick_state=DESIGNATED_POLLER shutdown=0 kicked_without_poller: 0
I1219 20:16:42.823358721      62 client_channel.cc:306]      chand=0x55beb3b09e78: started name re-resolving
D1219 20:16:42.823394577      62 dns_resolver.cc:279]        Start resolving.
I1219 20:16:42.823418368      62 executor.cc:243]            EXECUTOR (resolver-executor) try to schedule 0x7f8490001000 (short) to thread 0
I1219 20:16:42.823451488      62 chttp2_transport.cc:852]    W:0x7f84c00098d0 CLIENT state WRITING -> IDLE [finish writing]
I1219 20:16:42.823470502      44 executor.cc:188]            EXECUTOR (resolver-executor) [0]: execute
I1219 20:16:42.823473482      62 chttp2_transport.cc:2790]   set connectivity_state=4
I1219 20:16:42.823518606      44 executor.cc:73]             EXECUTOR (resolver-executor) run 0x7f8490001000
I1219 20:16:42.823586626      62 connectivity_state.cc:164]  SET: 0x7f84c0009b78 client_transport: TRANSIENT_FAILURE --> SHUTDOWN [close_transport] error=0x7f84900014f0 {"created":"@1545250602.822223145","description":"Delayed close due to in-progress write","file":"src/core/ext/transport/chttp2/transport/chttp2_tran
sport.cc","file_line":609,"referenced_errors":[{"created":"@1545250602.821928707","description":"Endpoint read failed","file":"src/core/ext/transport/chttp2/transport/chttp2_transport.cc","file_line":2510,"occurred_during_write":0,"referenced_errors":[{"created":"@1545250602.821703024","description":"Socket closed","
fd":6,"file":"src/core/lib/iomgr/tcp_posix.cc","file_line":465,"grpc_status":14,"target_address":"ipv6:[::1]:44439"},{"created":"@1545250602.821364969","description":"GOAWAY received","file":"src/core/ext/transport/chttp2/transport/chttp2_transport.cc","file_line":1147,"http2_error":11,"raw_bytes":"too_many_pings"}]}
]}
I1219 20:16:42.823608583      62 timer_generic.cc:467]       TIMER 0x7f84c000b128: CANCEL pending=true
I1219 20:16:42.823626241      62 ev_posix.cc:269]            (fd-trace) fd_shutdown(6)
E1219 20:16:42.823656433      62 lockfree_event.cc:164]      LockfreeEvent::SetShutdown: 0x7f84c4000c78 curr=0x2 err={"created":"@1545250602.822223145","description":"Delayed close due to in-progress write","file":"src/core/ext/transport/chttp2/transport/chttp2_transport.cc","file_line":609,"referenced_errors":[{"cre
ated":"@1545250602.821928707","description":"Endpoint read failed","file":"src/core/ext/transport/chttp2/transport/chttp2_transport.cc","file_line":2510,"occurred_during_write":0,"referenced_errors":[{"created":"@1545250602.821703024","description":"Socket closed","fd":6,"file":"src/core/lib/iomgr/tcp_posix.cc","file
_line":465,"grpc_status":14,"target_address":"ipv6:[::1]:44439"},{"created":"@1545250602.821364969","description":"GOAWAY received","file":"src/core/ext/transport/chttp2/transport/chttp2_transport.cc","file_line":1147,"http2_error":11,"raw_bytes":"too_many_pings"}]}]}
E1219 20:16:42.823691701      62 lockfree_event.cc:164]      LockfreeEvent::SetShutdown: 0x7f84c4000c80 curr=0x2 err={"created":"@1545250602.822223145","description":"Delayed close due to in-progress write","file":"src/core/ext/transport/chttp2/transport/chttp2_transport.cc","file_line":609,"referenced_errors":[{"cre
ated":"@1545250602.821928707","description":"Endpoint read failed","file":"src/core/ext/transport/chttp2/transport/chttp2_transport.cc","file_line":2510,"occurred_during_write":0,"referenced_errors":[{"created":"@1545250602.821703024","description":"Socket closed","fd":6,"file":"src/core/lib/iomgr/tcp_posix.cc","file
_line":465,"grpc_status":14,"target_address":"ipv6:[::1]:44439"},{"created":"@1545250602.821364969","description":"GOAWAY received","file":"src/core/ext/transport/chttp2/transport/chttp2_transport.cc","file_line":1147,"http2_error":11,"raw_bytes":"too_many_pings"}]}]}
I1219 20:16:42.823703027      58 ev_epoll1_linux.cc:728]     ps: 0x55beb3695478 poll got 1 events
E1219 20:16:42.823713652      62 lockfree_event.cc:164]      LockfreeEvent::SetShutdown: 0x7f84c4000c88 curr=0x2 err={"created":"@1545250602.822223145","description":"Delayed close due to in-progress write","file":"src/core/ext/transport/chttp2/transport/chttp2_transport.cc","file_line":609,"referenced_errors":[{"cre
ated":"@1545250602.821928707","description":"Endpoint read failed","file":"src/core/ext/transport/chttp2/transport/chttp2_transport.cc","file_line":2510,"occurred_during_write":0,"referenced_errors":[{"created":"@1545250602.821703024","description":"Socket closed","fd":6,"file":"src/core/lib/iomgr/tcp_posix.cc","file
_line":465,"grpc_status":14,"target_address":"ipv6:[::1]:44439"},{"created":"@1545250602.821364969","description":"GOAWAY received","file":"src/core/ext/transport/chttp2/transport/chttp2_transport.cc","file_line":1147,"http2_error":11,"raw_bytes":"too_many_pings"}]}]}

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 16 (10 by maintainers)

Commits related to this issue

Most upvoted comments

@pkwarren, sent out #5795 to remove the API.