google-cloud-python: Pub/Sub Subscriber does not catch & retry UNAVAILABLE errors
A basic Pub/Sub message consumer stops consuming messages after a retryable error (see stack trace below, but in short _Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.UNAVAILABLE, The service was unable to fulfill your request. Please try again. [code=8a75])>). The app does not crash but the stream never recovers and continue to receive messages. Interesting observations;
- If I simply turn off WiFi on my laptop and run the same code, it keeps retrying until the machine is connected to the network and functions as expected. This tells me that this is a reaction to the specific StatusCode
- The exception sometimes happens on startup sometimes mid-stream.
Expected behavior:
- The application code would continue retrying to build the streamingPull connection and eventually recover and receive messages.
- This would be handled and surfaced as a warning, rather than a thread-killing exception.
This might be the same issue as 2683. This comment, in particular, seems like the solution that I would expect the client library to implement.
Answers to standard questions:
- OS type and version MacOS Sierra 10.12.6
- Python version and virtual environment information
python --versionPython 2.7.10 (running in virtualenv) - google-cloud-python version
pip show google-cloud,pip show google-<service>orpip freeze
$ pip show google-cloud Name: google-cloud Version: 0.27.0 pip show google-cloud-pubsub Name: google-cloud-pubsub Version: 0.28.4
- Stacktrace if available
Exception in thread Consumer helper: consume bidirectional stream:
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "/Users/kir/cloud/env/lib/python2.7/site-packages/google/cloud/pubsub_v1/subscriber/_consumer.py", line 248, in _blocking_consume
self._policy.on_exception(exc)
File "/Users/kir/cloud/env/lib/python2.7/site-packages/google/cloud/pubsub_v1/subscriber/policy/thread.py", line 140, in on_exception
raise exception
_Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.UNAVAILABLE, The service was unable to fulfill your request. Please try again. [code=8a75])>
- Steps to reproduce
- I was not able to reproduce this consistently. But it would happen ~1 in 10 times I ran the code.
- Code example
import time, datetime, sys
from google.cloud import pubsub_v1 as pubsub
subscription_name = "projects/%s/subscriptions/%s"%(sys.argv[1], sys.argv[2])
sleep_time_ms = 0
try:
sleep_time_ms = int(sys.argv[3])
except Exception:
print "Could not parse custom sleep time."
print "Using sleep time %g ms"%sleep_time_ms
def callback(message):
t = time.time()
time.sleep(float(sleep_time_ms)/1000)
print "Message " + message.data + " acked in %g second"%(time.time() - t)
message.ack()
subscription = pubsub.SubscriberClient().subscribe(subscription_name).open(callback=callback)
time.sleep(10000)
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 1
- Comments: 59 (24 by maintainers)
Commits related to this issue
- Making `thread.Policy.on_exception` more robust. - Adding special handling for API core exceptions - Retrying on both types of idempotent error - Also doing a "drive-by" hygiene fix changing a global... — committed to dhermes/google-cloud-python by dhermes 7 years ago
- Making `thread.Policy.on_exception` more robust. - Adding special handling for API core exceptions - Retrying on both types of idempotent error - Also doing a "drive-by" hygiene fix changing a global... — committed to dhermes/google-cloud-python by dhermes 7 years ago
- Making `thread.Policy.on_exception` more robust. - Adding special handling for API core exceptions - Retrying on both types of idempotent error - Also doing a "drive-by" hygiene fix changing a global... — committed to dhermes/google-cloud-python by dhermes 7 years ago
- PubSub: Making `thread.Policy.on_exception` more robust. (#4444) - Adding special handling for API core exceptions - Retrying on both types of idempotent error Towards #4234. — committed to googleapis/google-cloud-python by dhermes 7 years ago
As far as I’m aware, until this issue is fixed, the python pub/sub library is fundamentally broken for consumers.
I’m not currently aware of any environment that can use this release of the code and have it not hit this essentially fatal error within minutes, completely stalling any subscriber workers. If anyone does, I’m all ears.
Having code this broken in master and having it been broken for a month since report is truly puzzling. I attempted to implement the hack listed earlier (which really should not be a recommended fix for a production service…), but it leaves tons of dangling threads which eat up CPU resources, and I’ve found that a number of my messages were not being properly ack’d, leading to lots of duplication of work, making my message queues very messy.
As @gcbirzan noted - their first message 28 days ago - having this lib in pip in this state is not really acceptable considering that we are paying for a service with an unusable library implementation. Recommending in #4286 to use the most recent version to @ilyanesterov is irresponsible and demonstrates what appears to be a lack of understanding at how severe this issue is.
I’ve gone back to using the gcloud pip library for now because it actually works. Fortunately I had old code lying around as an example otherwise I don’t know what I’d do besides choose a completely different language, as any documentation before the massive API change seems to be gone with the wind. Would have been nice to use v0.24.0 but the examples in google official python docs no longer line up with that, and I’m not too keen on reading the source code to figure out which parts of the API to use.
Everything said in the above comment is true, sadly. 😦
This is on me. It is a bug in the flow control code that I wrote. Essentially what has been going on is that this bug has spent weeks in the “this needs to be the very next thing that I tackle” spot, with one competing priority always winning. I did look at it some, but as you note, it is a somewhat thorny issue. What I should have done is effectively call in the cavalry (I am not the only person who commits code here, after all), and I did not do that.
I have since asked @dhermes to rescue me on this one, and he is looking into it (today) and hopefully a fix should emerge soon. (Do not throw any tomatoes at him on how long this bug has been open though; throw those at me, it is my fault.)
Fixed by #4444. I ran a reproducible test case against the current master for 659 seconds and it did not fail with
UNAVAILABLE(typically had failed within 90-200s):@lukesneeringer I’m guessing the update isn’t coming this week. Would’ve been nice to let people know. Or, pull this lib from pypi. Or handle this in any way like you’re a company we’re paying for a service.
@ericbbraga and I are also seeing this issue.
I just started using the new pubsub today, I had originally used the old pip gcloud API and everything was working fine. I realized that was an old package, so I switched to the new API, and basically used the examples from the docs out of the box- I have a pretty textbook implementation. Almost immediately I started hitting this issue when I deployed to GKE. My apps that subscribe die within 5 minutes with
StatusCode.UNAVAILABLE, The service was unable to fulfill your request. Please try again. [code=8a75]. Basically completely unusable/unreliable for me.FYI for all these following this issue, I have pushed a release (
0.29.1) and it is getting built right now (will end with a push to PyPI).I don’t think all issues have been resolved, but this at least will “gracefully” handle inactivity. (The
UNAVAILABLEin the case of inactivity is almost certainly caused by the local gRPC client, not the backend.)Thanks @jonparrott for references to the previous documentation. It’s unclear to me at which version this particular bug crept in, but useful to have in case others want to attempt to move forward with implementations.
Thanks @lukesneeringer for gracefully taking the heat, communicating, and calling in for help.
I apologize for the severity of my criticism, as I understand there are always plenty of bugs to go around and that anger doesn’t fix things faster. It is just disappointing to see such a crippling bug in public release for so long, rather than seeing a revert until the actual usability is addressed. It’s not even just about production- when your consumer threads die within minutes, how can you develop or test your system in the first place?
Fair enough that it’s a beta library, but as far as I’m aware in the GC docs, this is the library chosen for reference implementation of pub/sub in Python, and there is no indication that the Python library is beta unless you visit the GitHub page (which I had only done when I encountered this bug), or I guess extrapolate from semver when you get the pip package. It’s not clear whether ANY of the libraries in any language are in or out of beta for that matter.
If there is a library implementation that can be trusted to be stable then it should perhaps be noted in the documentation for the product. Otherwise it would seem unfair to me that libraries referenced in official docs could be effectively categorically broken for 50% of the services usefulness.
Anywho, I’m going to try to stop being mad, thank you for your attention to the issue, and I look forward to the fixes 😅
Why this issue is closed? With new version I still experience a lot of dangling threads and 100% consumption of one core on my machines. Such behavior appears after some running time. I don’t really want to restart my workers periodically to reduce useless CPU consumption
@kir-titievsky, the solution you mentioned seems to work at first, but the problem with this approach seems to be that the CPU usage goes up every time the UNAVAILABLE exception is handled this way. Thus, a fix seems to be necessary.
We’re able to reproduce this consistently within 1-2 minutes of starting a snippet virtually identical to the one on https://cloud.google.com/pubsub/docs/pull#pubsub-pull-messages-async-python. There are no messages published on the topic. Our stacktrace is a bit different, though: