google-cloud-python: Uncaught exceptions within the streaming pull code.
This comes from a StackOverflow question. There are internal exceptions that are not being caught and result in the client library no longer delivery messages.
Exception in thread Thread-LeaseMaintainer:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/google/api_core/grpc_helpers.py", line 57, in error_remapped_callable
return callable_(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 549, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 466, in _end_unary_response_blocking
raise _Rendezvous(state, None, None, deadline)
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "channel is in state TRANSIENT_FAILURE"
debug_error_string = "{"created":"@1554568036.075280756","description":"channel is in state TRANSIENT_FAILURE","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":2294,"grpc_status":14}"
>
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/google/api_core/retry.py", line 179, in retry_target
return target()
File "/usr/local/lib/python3.6/site-packages/google/api_core/timeout.py", line 214, in func_with_timeout
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/google/api_core/grpc_helpers.py", line 59, in error_remapped_callable
six.raise_from(exceptions.from_grpc_error(exc), exc)
File "<string>", line 3, in raise_from
google.api_core.exceptions.ServiceUnavailable: 503 channel is in state TRANSIENT_FAILURE
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/_protocol/leaser.py", line 146, in maintain_leases
[requests.ModAckRequest(ack_id, p99) for ack_id in ack_ids]
File "/usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/_protocol/dispatcher.py", line 152, in modify_ack_deadline
self._manager.send(request)
File "/usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/_protocol/streaming_pull_manager.py", line 268, in send
self._send_unary_request(request)
File "/usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/_protocol/streaming_pull_manager.py", line 259, in _send_unary_request
ack_deadline_seconds=deadline,
File "/usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/_gapic.py", line 45, in <lambda>
fx = lambda self, *a, **kw: wrapped_fx(self.api, *a, **kw) # noqa
File "/usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/gapic/subscriber_client.py", line 723, in modify_ack_deadline
request, retry=retry, timeout=timeout, metadata=metadata
File "/usr/local/lib/python3.6/site-packages/google/api_core/gapic_v1/method.py", line 143, in __call__
return wrapped_func(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/google/api_core/retry.py", line 270, in retry_wrapped_func
on_error=on_error,
File "/usr/local/lib/python3.6/site-packages/google/api_core/retry.py", line 199, in retry_target
last_exc,
File "<string>", line 3, in raise_from
google.api_core.exceptions.RetryError: Deadline of 600.0s exceeded while calling functools.partial(<function _wrap_unary_errors.<locals>.error_remapped_callable at 0x7f86228cd400>, subscription: "projects/xxxxx-dev/subscriptions/telemetry-sub"
ack_deadline_seconds: 10
ack_ids: "QBJMJwFESVMrQwsqWBFOBCEhPjA-RVNEUAYWLF1GSFE3GQhoUQ5PXiM_NSAoRRoHIGoKOUJdEmJoXFx1B1ALEHQoYnxvWRYFCEdReF1YHQdodGxXOFUEHnN1Y3xtWhQDAEFXf3f8gIrJ38BtZho9WxJLLD5-LDRFQV4"
, metadata=[('x-goog-api-client', 'gl-python/3.6.8 grpc/1.19.0 gax/1.8.2 gapic/0.40.0')]), last exception: 503 channel is in state TRANSIENT_FAILURE
Thread-ConsumeBidirectionalStream caught unexpected exception Deadline of 600.0s exceeded while calling functools.partial(<function _wrap_unary_errors.<locals>.error_remapped_callable at 0x7f86228cda60>, subscription: "projects/xxxxx-dev/subscriptions/telemetry-sub"
ack_deadline_seconds: 10
ack_ids: "QBJMJwFESVMrQwsqWBFOBCEhPjA-RVNEUAYWLF1GSFE3GQhoUQ5PXiM_NSAoRRoHIGoKOUJdEmJoXFx1B1ALEHQoYnxvWRYFCEdReF1YHAdodGxXOFUEHnN1aXVoWxAIBEdXeXf8gIrJ38BtZho9WxJLLD5-LDRFQV4"
, metadata=[('x-goog-api-client', 'gl-python/3.6.8 grpc/1.19.0 gax/1.8.2 gapic/0.40.0')]), last exception: 503 channel is in state TRANSIENT_FAILURE and will exit.
The user who reported the error was using the following versions:
python == 3.6.5 google-cloud-pubsub == 0.40.0 # but this has behaved similarly for at least the last several versions google-api-core == 1.8.2 google-api-python-client == 1.7.8
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 6
- Comments: 22 (11 by maintainers)
@jakeczyz AFAIK there have been several independent reports of the same (or similar) bug in the past, including the non-Python clients, but it was (very) difficult to reproduce it. I could not reproduce it either, thus only suspect that this is the true cause kicking in on random occasions. The tracebacks are very similar, though, which is promising.
I do not have an ETA yet, but expect to discuss this with others next week - will post more, when I know more. ๐
I believe this error occurs if the underlying channel enters the TRANSIENT_FAILURE state and remains in it for too long, i.e. longer than the
total_timeout_millissetting of the subscriber client.I was not able to produce the bug with a sample pub/sub application running on Kubernetes, but I did manage to trigger the reported scenario locally by doing the the following:
grpcdependency with the following:The patch fakes a channel error during particular minutes in an hour (adjust as necessary).
Result: Eventually a
RetryErroris raised, and several threads exit (e.g.Thread-ConsumeBidirectionalStream). However, the main thread keeps running, but no messages are received and processed anymore. Subscriber must be manually be stopped with <kbd>Ctrl</kbd>+<kbd>C</kbd>.What happens is that if the subscriber has been retrying for too long, a RetryError is raised in the retry wrapper. This error is considered non-retryable, and the subscriber stopping pulling messages is actually expected behavior IMO. Will look into it.
What should happen, however, is propagating the error to the main thread (and shutting everything down cleanly in the background), giving users a chance to catch the error and react to it as they see fit.
Will discuss if this is the the expected way of handling this, and then work on a fix. Thank you for reporting the issue!
@sreetamdas I actually did manage to reproduce the reported behavior, but I still appreciate your willingness to help!
Since this is a synchronous pull (as opposed to the asynchronous streaming pull this issue is about), I will open a separate issue for easier traceability.
Update: Issue created - https://github.com/googleapis/google-cloud-python/issues/9822
Just as quick update, it appears to me that in order to propagate the
RetryErrorto the user code, a change might be necessary in one of the Pub/Sub client dependencies (API core, specifically).Right now the background consumer thread does not propagate any errors and assumes that all error handling is done though the underlying RPC. However, if a
RetryErroroccurs, the consumer thread terminates, but the underlying gRPC channel does not terminate (itโs in the TRANSIENT_FAILURE state after all).The subscriber client shuts itself down when the channel terminates, but since the latter does not happen, the client shutdown does not happen as well, and the future result never gets set, despite the consumer thread not running anymore.
Changes to bidi.BackgroundConsumer might be needed, although that will have to be coordinated with the teams working on other libraries that could be affected by that.Update: API core changes will not be needed after all, the subscriber client can properly respond to retry errors on its own.