google-cloud-python: Uncaught exceptions within the streaming pull code.

This comes from a StackOverflow question. There are internal exceptions that are not being caught and result in the client library no longer delivery messages.

Exception in thread Thread-LeaseMaintainer:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/google/api_core/grpc_helpers.py", line 57, in error_remapped_callable
    return callable_(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 549, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 466, in _end_unary_response_blocking
    raise _Rendezvous(state, None, None, deadline)
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
    status = StatusCode.UNAVAILABLE
    details = "channel is in state TRANSIENT_FAILURE"
    debug_error_string = "{"created":"@1554568036.075280756","description":"channel is in state TRANSIENT_FAILURE","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":2294,"grpc_status":14}"
>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/google/api_core/retry.py", line 179, in retry_target
    return target()
  File "/usr/local/lib/python3.6/site-packages/google/api_core/timeout.py", line 214, in func_with_timeout
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/google/api_core/grpc_helpers.py", line 59, in error_remapped_callable
    six.raise_from(exceptions.from_grpc_error(exc), exc)
  File "<string>", line 3, in raise_from
google.api_core.exceptions.ServiceUnavailable: 503 channel is in state TRANSIENT_FAILURE

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/_protocol/leaser.py", line 146, in maintain_leases
    [requests.ModAckRequest(ack_id, p99) for ack_id in ack_ids]
  File "/usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/_protocol/dispatcher.py", line 152, in modify_ack_deadline
    self._manager.send(request)
  File "/usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/_protocol/streaming_pull_manager.py", line 268, in send
    self._send_unary_request(request)
  File "/usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/_protocol/streaming_pull_manager.py", line 259, in _send_unary_request
    ack_deadline_seconds=deadline,
  File "/usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/_gapic.py", line 45, in <lambda>
    fx = lambda self, *a, **kw: wrapped_fx(self.api, *a, **kw)  # noqa
  File "/usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/gapic/subscriber_client.py", line 723, in modify_ack_deadline
    request, retry=retry, timeout=timeout, metadata=metadata
  File "/usr/local/lib/python3.6/site-packages/google/api_core/gapic_v1/method.py", line 143, in __call__
    return wrapped_func(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/google/api_core/retry.py", line 270, in retry_wrapped_func
    on_error=on_error,
  File "/usr/local/lib/python3.6/site-packages/google/api_core/retry.py", line 199, in retry_target
    last_exc,
  File "<string>", line 3, in raise_from
google.api_core.exceptions.RetryError: Deadline of 600.0s exceeded while calling functools.partial(<function _wrap_unary_errors.<locals>.error_remapped_callable at 0x7f86228cd400>, subscription: "projects/xxxxx-dev/subscriptions/telemetry-sub"
ack_deadline_seconds: 10
ack_ids: "QBJMJwFESVMrQwsqWBFOBCEhPjA-RVNEUAYWLF1GSFE3GQhoUQ5PXiM_NSAoRRoHIGoKOUJdEmJoXFx1B1ALEHQoYnxvWRYFCEdReF1YHQdodGxXOFUEHnN1Y3xtWhQDAEFXf3f8gIrJ38BtZho9WxJLLD5-LDRFQV4"
, metadata=[('x-goog-api-client', 'gl-python/3.6.8 grpc/1.19.0 gax/1.8.2 gapic/0.40.0')]), last exception: 503 channel is in state TRANSIENT_FAILURE

Thread-ConsumeBidirectionalStream caught unexpected exception Deadline of 600.0s exceeded while calling functools.partial(<function _wrap_unary_errors.<locals>.error_remapped_callable at 0x7f86228cda60>, subscription: "projects/xxxxx-dev/subscriptions/telemetry-sub"
ack_deadline_seconds: 10
ack_ids: "QBJMJwFESVMrQwsqWBFOBCEhPjA-RVNEUAYWLF1GSFE3GQhoUQ5PXiM_NSAoRRoHIGoKOUJdEmJoXFx1B1ALEHQoYnxvWRYFCEdReF1YHAdodGxXOFUEHnN1aXVoWxAIBEdXeXf8gIrJ38BtZho9WxJLLD5-LDRFQV4"
, metadata=[('x-goog-api-client', 'gl-python/3.6.8 grpc/1.19.0 gax/1.8.2 gapic/0.40.0')]), last exception: 503 channel is in state TRANSIENT_FAILURE and will exit.

The user who reported the error was using the following versions:

python == 3.6.5 google-cloud-pubsub == 0.40.0 # but this has behaved similarly for at least the last several versions google-api-core == 1.8.2 google-api-python-client == 1.7.8

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 6
Comments: 22 (11 by maintainers)

Most upvoted comments

@jakeczyz AFAIK there have been several independent reports of the same (or similar) bug in the past, including the non-Python clients, but it was (very) difficult to reproduce it. I could not reproduce it either, thus only suspect that this is the true cause kicking in on random occasions. The tracebacks are very similar, though, which is promising.

I do not have an ETA yet, but expect to discuss this with others next week - will post more, when I know more. 😃

plamut on Apr 26, 2019

I believe this error occurs if the underlying channel enters the TRANSIENT_FAILURE state and remains in it for too long, i.e. longer than the total_timeout_millis setting of the subscriber client.

I was not able to produce the bug with a sample pub/sub application running on Kubernetes, but I did manage to trigger the reported scenario locally by doing the the following:

Patch the _end_unary_response_blocking() function of the grpc dependency with the following:

--- /home/peter/workspace/google-cloud-python/venv-3.6/lib/python3.6/site-packages/grpc/_channel.py     2019-04-23 17:01:39.282064676 +0200
+++ /home/peter/workspace/google-cloud-python/venv-3.6/lib/python3.6/site-packages/grpc/_channel.py     2019-04-25 15:49:05.220317794 +0200
@@ -456,6 +456,16 @@
 
 
 def _end_unary_response_blocking(state, call, with_call, deadline):
+    #####################
+    import datetime
+    minute = datetime.datetime.now().minute
+    if 45 <= minute <= 56:
+        state.code = grpc.StatusCode.UNAVAILABLE
+        state.details = "channel is in **fake** TRANSIENT_FAILURE state"
+        state.debug_error_string = (
+            "transient failure is faked during a fixed time window in an hour"
+        )
+    ###########################
     if state.code is grpc.StatusCode.OK:
         if with_call:
             rendezvous = _Rendezvous(state, call, None, deadline)

The patch fakes a channel error during particular minutes in an hour (adjust as necessary).

Start publishing messages and start a subscriber using the streaming pull (FWIW, I used my own test publisher and subscriber - link).
Wait for 10 minutes and some moderate amount of time more, then check the logs.

Result: Eventually a RetryError is raised, and several threads exit (e.g. Thread-ConsumeBidirectionalStream). However, the main thread keeps running, but no messages are received and processed anymore. Subscriber must be manually be stopped with <kbd>Ctrl</kbd>+<kbd>C</kbd>.

NOTE: It is not necessary to wait for full 10+ minutes, one can reduce the total_timeout_millis setting in subscriber settings.

What happens is that if the subscriber has been retrying for too long, a RetryError is raised in the retry wrapper. This error is considered non-retryable, and the subscriber stopping pulling messages is actually expected behavior IMO. Will look into it.

What should happen, however, is propagating the error to the main thread (and shutting everything down cleanly in the background), giving users a chance to catch the error and react to it as they see fit.

Will discuss if this is the the expected way of handling this, and then work on a fix. Thank you for reporting the issue!

plamut on Apr 26, 2019

@sreetamdas I actually did manage to reproduce the reported behavior, but I still appreciate your willingness to help!

Since this is a synchronous pull (as opposed to the asynchronous streaming pull this issue is about), I will open a separate issue for easier traceability.

Update: Issue created - https://github.com/googleapis/google-cloud-python/issues/9822

plamut on Nov 18, 2019

Just as quick update, it appears to me that in order to propagate the RetryError to the user code, a change might be necessary in one of the Pub/Sub client dependencies (API core, specifically).

Right now the background consumer thread does not propagate any errors and assumes that all error handling is done though the underlying RPC. However, if a RetryError occurs, the consumer thread terminates, but the underlying gRPC channel does not terminate (it’s in the TRANSIENT_FAILURE state after all).

The subscriber client shuts itself down when the channel terminates, but since the latter does not happen, the client shutdown does not happen as well, and the future result never gets set, despite the consumer thread not running anymore.

~~Changes to bidi.BackgroundConsumer might be needed, although that will have to be coordinated with the teams working on other libraries that could be affected by that.~~

Update: API core changes will not be needed after all, the subscriber client can properly respond to retry errors on its own.

plamut on May 6, 2019