gcsfs: Retry after HttpError code 400

Google Cloud Storage occasionally throws an HTTP Error 400 (which is nominally a ‘bad request’. See the Google Cloud docs on HTTP response 400). But this happens on requests that have worked previously and work again after retrying. I’ve seen these spurious HTTP Error 400s when calling gcs.du and when using dask to read data from Google Cloud.

The error message from GCP is: Error 400 (Bad Request)! That's an error. Your client has issued a malformed or illegal request. That's all we know.

Monkey-patching gcsfs.utils.is_retriable fixes the issue for me:

import gcsfs

# Override is_retriable.  Google Cloud sometimes throws
# a HttpError code 400.  gcsfs considers this to not be retriable.
# But it is retriable!

def is_retriable(exception):
    """Returns True if this exception is retriable."""
    errs = list(range(500, 505)) + [
        # Jack's addition.  Google Cloud occasionally throws Bad Requests for no apparent reason.
        400,
        # Request Timeout
        408,
        # Too Many Requests
        429,
    ]
    errs += [str(e) for e in errs]
    if isinstance(exception, gcsfs.utils.HttpError):
        return exception.code in errs

    return isinstance(exception, gcsfs.utils.RETRIABLE_EXCEPTIONS)

gcsfs.utils.is_retriable = is_retriable

In a perfect world, I guess the best solution would be to ask Google Cloud to not throw spurious HTTP Error 400s. But perhaps a pragmatic approach is to modify gcsfs to retry after HTTP Error 400s 😃

Environment:

Dask version: 2.28.0
Python version: 3.8.5
Operating System: Ubuntu 20.04 on a Google Cloud VM
Install method: conda

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 17 (8 by maintainers)

Commits related to this issue

Use multithreaded zarr metadata consolidation Consolidating metadata of zarr's is slow and sometimes unreliable (see https://github.com/dask/gcsfs/issues/290). This uses @nbren12's custom multi-threa... — committed to ai2cm/fv3net by deleted user 4 years ago
fix 400 errors on retries in _call The kwargs was being overwritten by ._get_args. If any retries are needed, then on the second iteration datain will always be 'None'. This will cause uploading ste... — committed to ai2cm/gcsfs by nbren12 3 years ago
fix 400 errors on retries in _call (#380) * fix 400 errors on retries in _call The kwargs was being overwritten by ._get_args. If any retries are needed, then on the second iteration datain will ... — committed to fsspec/gcsfs by nbren12 3 years ago
fix 400 errors on retries in _call (#380) * fix 400 errors on retries in _call The kwargs was being overwritten by ._get_args. If any retries are needed, then on the second iteration datain will ... — committed to hanseaston/gcsfs by nbren12 3 years ago

Most upvoted comments

I think I fixed it! Was a tricky bug to find.

nbren12 on Apr 28, 2021

Thanks loads for the replies! I don’t have logs to hand right now but if I come across this problem again then I’ll be sure to follow-up here with more details (including logs and details of the project).

JackKelly on Oct 27, 2020

Hi folks,

HTTP 400 is not considered retriable, but read through the background internal issue (ignoring because it’s a 410/404* – please correct me if it’s still related).

I’m not convinced that the 400 is a service issue and would like to validate in the backend (if possible).

Do y’all have request logs with body payload for the transient 400 errors?

If not: if this occurred in the last two weeks, could you send me an email at coderfrank@google.com with the following:

Project id
bucket name
Time frame of 400 failure

Thank you for your patience.

frankyn on Oct 27, 2020