google-cloud-python: Cloud Functions & Storage: fails intermittently with ProtocolError + ConnectionResetError
This is a cross post originally detailed at https://issuetracker.google.com/issues/113672049
Essentially the problem is that in a Google Cloud Functions python endpoint the google-cloud-storage API is intermittently throwing a ProtocolError and ConnectionResetError when getting a blob.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 3
- Comments: 37 (13 by maintainers)
Two weeks ago I’ve started getting ConnectionResetError 10% of the calls to blob.download_to_filename() from a GCE instance. I’ve tried wrapping download_to_filename with Retry but still getting the same error.
Is there something wrong with my code? How can I verify that ConnectionResetError was caught and the download failed several times?
@brianmhunt OK, good to know. Here is a workaround:
@frankyn This is a request for the same feature as
[Python] Storage: automatic retry behavior for transient server failures (exponential backoff + jitter)in our feature backlog (we would just need to ensure thatProtocolErrorandConnectionErrorare tracked as transient errors for that feature).I get the same error as OP @brianmhunt and @arvindnrbt.
Connection reset by peerIt happens during:
storage.bucket(bucket).get_blob(path)bigquery_client.insert_rows(table, rows_to_insert).This is running Google Cloud Functions with Python 3.7 and
google-cloud-storage==1.11.0.Not all the time, about 10% failure rate. Function deployed to
us-east1(I also triedus-central1, about the same).We are working to correct the issues with Python libraries retry strategy and will continue to update on-going work in https://github.com/googleapis/google-cloud-python/issues/9298.
As stated by @tseaver, the workaround is the following:
Thank you for your patience.
After a lot of research and interactions with G support, it turns out that indeed, connection may either timeout (our case, resumable upload) or just disconnect sporadically - happens at other cloud storage providers. No big deal as ConnectionResetError (104) is a retriable error but this must be handled on the client side.
Unfortunately
google-client-python- and many others - is delegating exponential backup to lower-level dependencies - signaled by deprecatednum_retries- which means less control on when retries should trigger. And these dependencies do not catch 104 as retriable.If you search out there, you will see there is a lot of debate (resistance ?) on where and how best to address this, in
googleapilibs,requestsorurllib3(as I even read this may be Py3 specifc),… For me, this is as simple as adding 104 to the list of transient errors (500-505 range or so) but I may oversimplify. Despite the numerous posts - many recent ones -, no real resolution.If you can’t wait for resolution or patch your own fork, you can look at
gcsfswhich we use for streaming from/to GCS (from GCF). The ConnectionResetError is retriable there - it does log an exception (looks like an Error in SD logging) but the retries do happen and function does not end abruptly.@anyone-watching, is this still occurring ? We were considering migrating from AWS Lambda but this may hold us up.
@tseaver Thanks. The problem was not the time limit. i.e. it can fail in the first ~5 seconds.
@brianmhunt
ConnectionResetErrorseems like the kind of error one might see when the VM is being torn down. Can you tell whether your function is failing due to a time limit? If so, there isn’t really much we can do ingoogle-cloud-storageto mitigate the issue.@crwilcox @frankyn @jkwlui @tseaver (Is there a storage team alias?)
Would you mind looking into the internal issue? It’s been closed as ‘Won’t fix - not reproducible’, but folks have commented here and over on the issue since then.
Thanks, I’d like to keep this discussion as much as possible through Github. So if someone else hits a similar issue they can find it later.
Is PdfReader wrapping around the google-cloud-storage package? Could you share a portion of that code as well as PdfWriter?