google-cloud-python: BigQuery: DEFAULT_RETRY doesn't catch rate 'rateLimitExceeded'.
OS: MacOS 10.14.3 Python: 3.7.3 google-cloud-bigquery: 1.15.0
I’m trying to reproduce the condition where BigQuery is executing 100 concurrent queries and refuses to execute other jobs by raising Forbidden: 403 Exceeded rate limits: too many concurrent queries for this project_and_region.
Let’s say that my BigQuery instance is already executing 100 queries and I request a new job execution using client.query(). It seems to me that the DEFAULT_RETRY has been designed to retry the operation if the rateLimitExceeded error is received, but it doesn’t seem to work. What happens is that client.query() returns a job with the error already set and job.result() finally throws the error.
What I’d like to happen is that client.query() understands that there are concurrent queries and tries to execute later the query, according to the retry object.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 5
- Comments: 17 (8 by maintainers)
I see the problem now: the
retrylogic gets triggered only if an exception gets raised; this exception might get raised in theapi_requestmethod defined in the core library (here).Unfortunately, even when the rate limit is exceeded, the api response returns
status_code = 200!! Therefore no exception is raised and the retry logic is not triggered.An API that returns status_code = 200 and a 403 error as a string in the json response seems like a bad designed API to me… do you have any control over BigQuery APIs?
@ralbertazzi Thank you for the code to reproduce.
Unfortunately, this is expected behavior. The
rateLimitExceedederror you are encountering is due to a limit on the number of concurrentINTERACTIVEpriority queries. In the backend, the job is successfully created, but then fails. This means we can’t retry the job in the same way we retry HTTP error codes.To work around this restriction, it’s recommended that you run your queries with
BATCHpriority, instead.Just to be sure I tried with 1 hour deadline:
As I already said in the previous post, the
client.query()call returns in 1-2 seconds (it doesn’t wait for 1 hour) and the returned job contains the error. Callingjob.result()finally raises the Forbidden error.I tried also putting a breakpoint in the
_should_retryfunc defined here and it never gets triggered. So I think that the problem is not a wrong retry logic but the retrial logic not getting called.