gcsfs: Retry after HttpError code 400
Google Cloud Storage occasionally throws an HTTP Error 400 (which is nominally a ‘bad request’. See the Google Cloud docs on HTTP response 400). But this happens on requests that have worked previously and work again after retrying. I’ve seen these spurious HTTP Error 400s when calling gcs.du and when using dask to read data from Google Cloud.
The error message from GCP is: Error 400 (Bad Request)! That's an error. Your client has issued a malformed or illegal request. That's all we know.
Monkey-patching gcsfs.utils.is_retriable fixes the issue for me:
import gcsfs
# Override is_retriable. Google Cloud sometimes throws
# a HttpError code 400. gcsfs considers this to not be retriable.
# But it is retriable!
def is_retriable(exception):
"""Returns True if this exception is retriable."""
errs = list(range(500, 505)) + [
# Jack's addition. Google Cloud occasionally throws Bad Requests for no apparent reason.
400,
# Request Timeout
408,
# Too Many Requests
429,
]
errs += [str(e) for e in errs]
if isinstance(exception, gcsfs.utils.HttpError):
return exception.code in errs
return isinstance(exception, gcsfs.utils.RETRIABLE_EXCEPTIONS)
gcsfs.utils.is_retriable = is_retriable
In a perfect world, I guess the best solution would be to ask Google Cloud to not throw spurious HTTP Error 400s. But perhaps a pragmatic approach is to modify gcsfs to retry after HTTP Error 400s 😃
Environment:
- Dask version: 2.28.0
- Python version: 3.8.5
- Operating System: Ubuntu 20.04 on a Google Cloud VM
- Install method: conda
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 17 (8 by maintainers)
Commits related to this issue
- Use multithreaded zarr metadata consolidation Consolidating metadata of zarr's is slow and sometimes unreliable (see https://github.com/dask/gcsfs/issues/290). This uses @nbren12's custom multi-threa... — committed to ai2cm/fv3net by deleted user 4 years ago
- fix 400 errors on retries in _call The kwargs was being overwritten by ._get_args. If any retries are needed, then on the second iteration datain will always be 'None'. This will cause uploading ste... — committed to ai2cm/gcsfs by nbren12 3 years ago
- fix 400 errors on retries in _call (#380) * fix 400 errors on retries in _call The kwargs was being overwritten by ._get_args. If any retries are needed, then on the second iteration datain will ... — committed to fsspec/gcsfs by nbren12 3 years ago
- fix 400 errors on retries in _call (#380) * fix 400 errors on retries in _call The kwargs was being overwritten by ._get_args. If any retries are needed, then on the second iteration datain will ... — committed to hanseaston/gcsfs by nbren12 3 years ago
I think I fixed it! Was a tricky bug to find.
Thanks loads for the replies! I don’t have logs to hand right now but if I come across this problem again then I’ll be sure to follow-up here with more details (including logs and details of the project).
Hi folks,
HTTP 400 is not considered retriable, but read through the background internal issue (ignoring because it’s a 410/404* – please correct me if it’s still related).
I’m not convinced that the 400 is a service issue and would like to validate in the backend (if possible).
Do y’all have request logs with body payload for the transient 400 errors?
If not: if this occurred in the last two weeks, could you send me an email at coderfrank@google.com with the following:
Thank you for your patience.