gcsfs: Retry after HttpError code 400

Google Cloud Storage occasionally throws an HTTP Error 400 (which is nominally a ‘bad request’. See the Google Cloud docs on HTTP response 400). But this happens on requests that have worked previously and work again after retrying. I’ve seen these spurious HTTP Error 400s when calling gcs.du and when using dask to read data from Google Cloud.

The error message from GCP is: Error 400 (Bad Request)! That's an error. Your client has issued a malformed or illegal request. That's all we know.

Monkey-patching gcsfs.utils.is_retriable fixes the issue for me:

import gcsfs

# Override is_retriable.  Google Cloud sometimes throws
# a HttpError code 400.  gcsfs considers this to not be retriable.
# But it is retriable!

def is_retriable(exception):
    """Returns True if this exception is retriable."""
    errs = list(range(500, 505)) + [
        # Jack's addition.  Google Cloud occasionally throws Bad Requests for no apparent reason.
        400,
        # Request Timeout
        408,
        # Too Many Requests
        429,
    ]
    errs += [str(e) for e in errs]
    if isinstance(exception, gcsfs.utils.HttpError):
        return exception.code in errs

    return isinstance(exception, gcsfs.utils.RETRIABLE_EXCEPTIONS)

gcsfs.utils.is_retriable = is_retriable

In a perfect world, I guess the best solution would be to ask Google Cloud to not throw spurious HTTP Error 400s. But perhaps a pragmatic approach is to modify gcsfs to retry after HTTP Error 400s 😃

Environment:

  • Dask version: 2.28.0
  • Python version: 3.8.5
  • Operating System: Ubuntu 20.04 on a Google Cloud VM
  • Install method: conda

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 17 (8 by maintainers)

Commits related to this issue

Most upvoted comments

I think I fixed it! Was a tricky bug to find.

Thanks loads for the replies! I don’t have logs to hand right now but if I come across this problem again then I’ll be sure to follow-up here with more details (including logs and details of the project).

Hi folks,

HTTP 400 is not considered retriable, but read through the background internal issue (ignoring because it’s a 410/404* – please correct me if it’s still related).

I’m not convinced that the 400 is a service issue and would like to validate in the backend (if possible).

Do y’all have request logs with body payload for the transient 400 errors?

If not: if this occurred in the last two weeks, could you send me an email at coderfrank@google.com with the following:

  1. Project id
  2. bucket name
  3. Time frame of 400 failure

Thank you for your patience.