grpc: Need additional support to debug why a stream is terminated in a Python client?

What version of gRPC and what language are you using?

Python client - gRPC 1.18.0 & 1.7.0 C++ server - gRPC 1.18.0

What operating system (Linux, Windows,…) and version?

Client in CentOS 7.2 and Windows 10 Server in CentOS 7.2

What runtime / compiler are you using (e.g. python version or version of gcc)

Client - python 2.7.16

What did you do?

I have a simple gRPC subscription client like:

try:
    responses = stub.Subscribe(req_iterator, options.timeout, metadata=metadata)
    for response in responses:
        if response.HasField('sync_response'):
            # do something
        elif response.HasField('update'):
            # do something

The client works as expected in gRPC 1.7.0 with the server (C++ in 1.7 then 1.18). But once we upgraded the client to 1.18, the subscription stream always ends right after getting a sync_response and update protobuf message.

Debugging it led me to _Rendezvous._next() where its _state.code has been changed to grpc.StatusCode.OK. Unless I missed something, this is set by the wrapped C++ implementation in a separate thread. The server team looked at their log and concluded the client closed this stream. But the client isn’t doing anything but wait… So I am stuck; I can’t determine what actually happened in the wrapped C++code to do something. Turning on GRPC_TRACE does not help much since the output is really not specific enough to understand at the application level. Looking at the output, I just wonder why this is so difficult for gRPC to let the application know (1) who closed the stream and (2) if it is the client, why?

For now, we have to keep running the client under 1.7.0.

What did you expect to see?

Critical information with regard to stream state changes and operations (who & what) taken place in the wrapped gRPC implementation should be made meaningful to applications; e.g.

  • the stream termination detected (i.e., the server closed the stream)
  • closing the stream because <reason>

What did you see instead?

Lots of

I0430 15:16:24.858246379   18320 channel_stack.cc:226]       OP[connected:0x8cee418]:  RECV_MESSAGE
I0430 15:16:24.858252866   18320 chttp2_transport.cc:1683]   perform_stream_op[s=0x8ceef78]:  RECV_MESSAGE
I0430 15:16:24.858259090   18320 call_combiner.cc:153]       ==> grpc_call_combiner_stop() [0x8ced320] [passed batch to transport]
I0430 15:16:24.858264920   18320 call_combiner.cc:160]         size: 1 -> 0
I0430 15:16:24.858269876   18320 call_combiner.cc:189]         queue empty
I0430 15:16:24.858276621   18320 chttp2_transport.cc:1379]   perform_stream_op_locked:  RECV_MESSAGE; on_complete = (nil)
I0430 15:16:24.858288782   18320 stream_lists.cc:125]        0x8eef950[3][cli]: add to writable
I0430 15:16:24.858312429   18320 chttp2_transport.cc:1228]   complete_closure_step: t=0x8eef950 0x7f5db0001fe8 refs=0 flags=0x0000 desc=op->on_complete err="No Error" write_state=IDLE

See TROUBLESHOOTING.md for how to diagnose problems better.

Anything else we should know about your project / environment?

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 21 (8 by maintainers)

Most upvoted comments

@lidizheng As stated early, only “the last line” (and not all) of the code snippet was replaced. I have 3 lines there to ensure the right context can be identified.

@lidizheng Got a break! Looking back at changes we made from past years, I realize we had this problem when gRPC subscription interface was added to our support initially. During the time, we did encounter the same problem with gRPC v1.7.0. But not knowing enough then (and remain so now), we tried and had it fixed it by replacing the last line of the following in _consume_request_iterator() in _channel.py :

                            state.condition.wait(condition_wait_timeout)
                            cygrpc.block_if_fork_in_progress(state)
                            if state.code is None:

to

                            if state.code is not  None:

I just did not remember this till today. After making the same change for v1.18.0, the problem is gone. Still I don’t know if we should have done something different in our own code. Hope to get a feedback.

Also, the original intent of why I wrote this issue stands. There should be better information provided for applications to understand such situations.