google-cloud-python: Pub/Sub Subscriber does not catch & retry UNAVAILABLE errors

A basic Pub/Sub message consumer stops consuming messages after a retryable error (see stack trace below, but in short _Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.UNAVAILABLE, The service was unable to fulfill your request. Please try again. [code=8a75])>). The app does not crash but the stream never recovers and continue to receive messages. Interesting observations;

If I simply turn off WiFi on my laptop and run the same code, it keeps retrying until the machine is connected to the network and functions as expected. This tells me that this is a reaction to the specific StatusCode
The exception sometimes happens on startup sometimes mid-stream.

Expected behavior:

The application code would continue retrying to build the streamingPull connection and eventually recover and receive messages.
This would be handled and surfaced as a warning, rather than a thread-killing exception.

This might be the same issue as 2683. This comment, in particular, seems like the solution that I would expect the client library to implement.

Answers to standard questions:

OS type and version MacOS Sierra 10.12.6
Python version and virtual environment information python --version Python 2.7.10 (running in virtualenv)
google-cloud-python version pip show google-cloud, pip show google-<service> or pip freeze

$ pip show google-cloud
Name: google-cloud
Version: 0.27.0
pip show google-cloud-pubsub
Name: google-cloud-pubsub
Version: 0.28.4

Stacktrace if available

Exception in thread Consumer helper: consume bidirectional stream:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/Users/kir/cloud/env/lib/python2.7/site-packages/google/cloud/pubsub_v1/subscriber/_consumer.py", line 248, in _blocking_consume
    self._policy.on_exception(exc)
  File "/Users/kir/cloud/env/lib/python2.7/site-packages/google/cloud/pubsub_v1/subscriber/policy/thread.py", line 140, in on_exception
    raise exception
_Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.UNAVAILABLE, The service was unable to fulfill your request. Please try again. [code=8a75])>

Steps to reproduce

I was not able to reproduce this consistently. But it would happen ~1 in 10 times I ran the code.

Code example

import time, datetime, sys
from google.cloud import pubsub_v1 as pubsub

subscription_name = "projects/%s/subscriptions/%s"%(sys.argv[1], sys.argv[2])
sleep_time_ms = 0
try:
    sleep_time_ms = int(sys.argv[3])
except Exception:
    print "Could not parse custom sleep time."
print "Using sleep time %g ms"%sleep_time_ms

def callback(message):
    t = time.time()
    time.sleep(float(sleep_time_ms)/1000)
    print "Message " + message.data + " acked in %g second"%(time.time() - t)
    message.ack()

subscription = pubsub.SubscriberClient().subscribe(subscription_name).open(callback=callback)
time.sleep(10000)

About this issue

Original URL
State: closed
Created 7 years ago
Reactions: 1
Comments: 59 (24 by maintainers)

Commits related to this issue

Making `thread.Policy.on_exception` more robust. - Adding special handling for API core exceptions - Retrying on both types of idempotent error - Also doing a "drive-by" hygiene fix changing a global... — committed to dhermes/google-cloud-python by dhermes 7 years ago
Making `thread.Policy.on_exception` more robust. - Adding special handling for API core exceptions - Retrying on both types of idempotent error - Also doing a "drive-by" hygiene fix changing a global... — committed to dhermes/google-cloud-python by dhermes 7 years ago
Making `thread.Policy.on_exception` more robust. - Adding special handling for API core exceptions - Retrying on both types of idempotent error - Also doing a "drive-by" hygiene fix changing a global... — committed to dhermes/google-cloud-python by dhermes 7 years ago
PubSub: Making `thread.Policy.on_exception` more robust. (#4444) - Adding special handling for API core exceptions - Retrying on both types of idempotent error Towards #4234. — committed to googleapis/google-cloud-python by dhermes 7 years ago

Most upvoted comments

As far as I’m aware, until this issue is fixed, the python pub/sub library is fundamentally broken for consumers.

The docs basic examples themselves lead to code that will break within minutes of deployment
It breaks locally run on my mac
It breaks in a docker container
It breaks on GKE
It breaks on a linux workstation that my coworker and I share.

I’m not currently aware of any environment that can use this release of the code and have it not hit this essentially fatal error within minutes, completely stalling any subscriber workers. If anyone does, I’m all ears.

Having code this broken in master and having it been broken for a month since report is truly puzzling. I attempted to implement the hack listed earlier (which really should not be a recommended fix for a production service…), but it leaves tons of dangling threads which eat up CPU resources, and I’ve found that a number of my messages were not being properly ack’d, leading to lots of duplication of work, making my message queues very messy.

As @gcbirzan noted - their first message 28 days ago - having this lib in pip in this state is not really acceptable considering that we are paying for a service with an unusable library implementation. Recommending in #4286 to use the most recent version to @ilyanesterov is irresponsible and demonstrates what appears to be a lack of understanding at how severe this issue is.

I’ve gone back to using the gcloud pip library for now because it actually works. Fortunately I had old code lying around as an example otherwise I don’t know what I’d do besides choose a completely different language, as any documentation before the massive API change seems to be gone with the wind. Would have been nice to use v0.24.0 but the examples in google official python docs no longer line up with that, and I’m not too keen on reading the source code to figure out which parts of the API to use.

+14

rclough on Nov 20, 2017

Everything said in the above comment is true, sadly. 😦

Having code this broken in master and having it been broken for a month since report is truly puzzling.

This is on me. It is a bug in the flow control code that I wrote. Essentially what has been going on is that this bug has spent weeks in the “this needs to be the very next thing that I tackle” spot, with one competing priority always winning. I did look at it some, but as you note, it is a somewhat thorny issue. What I should have done is effectively call in the cavalry (I am not the only person who commits code here, after all), and I did not do that.

I have since asked @dhermes to rescue me on this one, and he is looking into it (today) and hopefully a fix should emerge soon. (Do not throw any tomatoes at him on how long this bug has been open though; throw those at me, it is my fault.)

lukesneeringer on Nov 20, 2017

Fixed by #4444. I ran a reproducible test case against the current master for 659 seconds and it did not fail with UNAVAILABLE (typically had failed within 90-200s):

...
00638401:DEBUG:google.cloud.pubsub_v1.subscriber.policy.base:Thread-9              :The current p99 value is 10 seconds.
00638401:DEBUG:google.cloud.pubsub_v1.subscriber.policy.base:Thread-9              :Renewing lease for 0 ack IDs.
00638401:DEBUG:google.cloud.pubsub_v1.subscriber.policy.base:Thread-9              :Snoozing lease management for 8.997289 seconds.
00647408:DEBUG:google.cloud.pubsub_v1.subscriber.policy.base:Thread-9              :The current p99 value is 10 seconds.
00647408:DEBUG:google.cloud.pubsub_v1.subscriber.policy.base:Thread-9              :Renewing lease for 0 ack IDs.
00647408:DEBUG:google.cloud.pubsub_v1.subscriber.policy.base:Thread-9              :Snoozing lease management for 4.038803 seconds.
00651451:DEBUG:google.cloud.pubsub_v1.subscriber.policy.base:Thread-9              :The current p99 value is 10 seconds.
00651452:DEBUG:google.cloud.pubsub_v1.subscriber.policy.base:Thread-9              :Renewing lease for 0 ack IDs.
00651452:DEBUG:google.cloud.pubsub_v1.subscriber.policy.base:Thread-9              :Snoozing lease management for 7.508992 seconds.
00658687:DEBUG:google.cloud.pubsub_v1.subscriber._consumer:Thread-41             :Sending initial request: subscription: "projects/precise-truck-742/subscriptions/s-djh-local-1511820764773"
stream_ack_deadline_seconds: 10

00658966:DEBUG:google.cloud.pubsub_v1.subscriber.policy.base:Thread-9              :The current p99 value is 10 seconds.
00658966:DEBUG:google.cloud.pubsub_v1.subscriber.policy.base:Thread-9              :Renewing lease for 0 ack IDs.
00658966:DEBUG:google.cloud.pubsub_v1.subscriber.policy.base:Thread-9              :Snoozing lease management for 5.914027 seconds.

dhermes on Nov 27, 2017

@lukesneeringer I’m guessing the update isn’t coming this week. Would’ve been nice to let people know. Or, pull this lib from pypi. Or handle this in any way like you’re a company we’re paying for a service.

gcbirzan on Nov 18, 2017

@ericbbraga and I are also seeing this issue.

edgartanaka on Nov 16, 2017

I just started using the new pubsub today, I had originally used the old pip gcloud API and everything was working fine. I realized that was an old package, so I switched to the new API, and basically used the examples from the docs out of the box- I have a pretty textbook implementation. Almost immediately I started hitting this issue when I deployed to GKE. My apps that subscribe die within 5 minutes with StatusCode.UNAVAILABLE, The service was unable to fulfill your request. Please try again. [code=8a75]. Basically completely unusable/unreliable for me.

rclough on Nov 14, 2017

FYI for all these following this issue, I have pushed a release (0.29.1) and it is getting built right now (will end with a push to PyPI).

I don’t think all issues have been resolved, but this at least will “gracefully” handle inactivity. (The UNAVAILABLE in the case of inactivity is almost certainly caused by the local gRPC client, not the backend.)

dhermes on Nov 27, 2017

Thanks @jonparrott for references to the previous documentation. It’s unclear to me at which version this particular bug crept in, but useful to have in case others want to attempt to move forward with implementations.

Thanks @lukesneeringer for gracefully taking the heat, communicating, and calling in for help.

I apologize for the severity of my criticism, as I understand there are always plenty of bugs to go around and that anger doesn’t fix things faster. It is just disappointing to see such a crippling bug in public release for so long, rather than seeing a revert until the actual usability is addressed. It’s not even just about production- when your consumer threads die within minutes, how can you develop or test your system in the first place?

Fair enough that it’s a beta library, but as far as I’m aware in the GC docs, this is the library chosen for reference implementation of pub/sub in Python, and there is no indication that the Python library is beta unless you visit the GitHub page (which I had only done when I encountered this bug), or I guess extrapolate from semver when you get the pip package. It’s not clear whether ANY of the libraries in any language are in or out of beta for that matter.

If there is a library implementation that can be trusted to be stable then it should perhaps be noted in the documentation for the product. Otherwise it would seem unfair to me that libraries referenced in official docs could be effectively categorically broken for 50% of the services usefulness.

Anywho, I’m going to try to stop being mad, thank you for your attention to the issue, and I look forward to the fixes 😅

rclough on Nov 21, 2017

Why this issue is closed? With new version I still experience a lot of dangling threads and 100% consumption of one core on my machines. Such behavior appears after some running time. I don’t really want to restart my workers periodically to reduce useless CPU consumption

makrusak on Nov 29, 2017

@kir-titievsky, the solution you mentioned seems to work at first, but the problem with this approach seems to be that the CPU usage goes up every time the UNAVAILABLE exception is handled this way. Thus, a fix seems to be necessary.

murataksoy on Oct 25, 2017

We’re able to reproduce this consistently within 1-2 minutes of starting a snippet virtually identical to the one on https://cloud.google.com/pubsub/docs/pull#pubsub-pull-messages-async-python. There are no messages published on the topic. Our stacktrace is a bit different, though:

Exception in thread Consumer helper: consume bidirectional stream:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/gcbirzan/.virtualenvs/pubsub/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/_consumer.py", line 248, in _blocking_consume
    self._policy.on_exception(exc)
  File "/home/gcbirzan/.virtualenvs/pubsub/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/policy/thread.py", line 140, in on_exception
    raise exception
  File "/home/gcbirzan/.virtualenvs/pubsub/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/_consumer.py", line 234, in _blocking_consume
    for response in response_generator:
  File "/home/gcbirzan/.virtualenvs/pubsub/lib/python3.6/site-packages/grpc/_channel.py", line 363, in __next__
    return self._next()
  File "/home/gcbirzan/.virtualenvs/pubsub/lib/python3.6/site-packages/grpc/_channel.py", line 357, in _next
    raise self
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.UNAVAILABLE, The service was unable to fulfill your request. Please try again. [code=8a75])>

gcbirzan on Oct 23, 2017