google-cloud-go: pupsub: Receive sometimes deadlocks if error codes.Unavailable
Client
PubSub
Describe Your Environment
Alpine:3.7 on GCE
Expected Behavior
Receive retries and eventually returns an error
Actual Behavior
Receive hangs indefinitely, never returning an error or continuing to process messages. Because of this the application that is using pubsub client with Receive is completely locked and I must restart the docker container.
Further Description
I have been noticing that messages stop processing periodically using pubsub client in my application. I am handiling Receive errors properly, however an error is never returned. I then added logging into my app with the Receive callback. As soon as I notice messages are no longer processing, nothing is logged within the callback and Receive is completely locked up. I am not sure how to handle this.
I added logging within the pupsub isRetryable(err error) bool function. and each time I experience this it seems there is a corresponding Unavailable log:
Unavailable The service was unable to fulfill your request. Please try again. [code=8a75]
Oddly, it seems that isRetryable should return true and retry in this case, so I am unsure where the lock is happening. An immediate restart of the Docker container resolves the issue, but only for a while.
This seems similar to issue: https://github.com/GoogleCloudPlatform/google-cloud-go/issues/1156
However, in that issue the reported error is the cause of a Context error, which I am not seeing and therefore I think they may be separate bugs.
Example Code
Here is an example of a simple subscriber package I created that uses pubsub client:
https://gist.github.com/erickertz/2090853580dcead48682886b2cb7f0d4
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 52 (28 by maintainers)
Thanks so much for those dumps, that is very useful. We suspect that it may be a gRPC issue, noting that the picker seems stuck in each of the RPC calls. It may be caused by https://github.com/grpc/grpc-go/issues/2341 or https://github.com/grpc/grpc-go/issues/2340: I’m working on a fix for those now and will follow up as soon as it’s out.
@erickertz Great to hear. I’ll leave this issue open over the weekend and close early next week if y’all are not seeing anything suspicious. Note that you should get the latest code as of now, since we just recently fixed a deadlock as well as removed Canceled from the retry logic as described in https://github.com/GoogleCloudPlatform/google-cloud-go/issues/1156.
I think I may have found the source of the problem. See https://code-review.googlesource.com/c/gocloud/+/33711 for a fix. Please try with this updated code and let me know if y’all still see issues. Sincere apologies for the inconvenience, it was my fault.