azure-cli: az commands can trigger 429 / "too many requests" failures and provides no recourse for recovery.

Describe the bug

Running az commands can generate 429 “too many requests” exceptions from Azure (possibly related to az aks? or possibly all commands – I’ve definitely seen this at random from Azure before). It seems this happens with long running commands after they have already executed and az is polling for a result from Azure.

Ideally, when this happens, az should just exponentially backoff (i.e. increase the timeout and try again). (Sometimes in the 429 response, there is even a Retry-After header that tells you exactly how long to wait!)

IMO, the REAL issue is that, you get back a failure message, and the command aborts, with no results – even if the command was successful (e.g. you can’t even just try to rerun the command at that point). – Basically, the command shouldn’t throw a perma-error unless it has actually failed. If the command is still running and might possibly succeed but you just failed to poll for a result, you should do a backoff and retry.

Command Name az aks nodepool add --resource-group MyResourceGroup --cluster-name MyClusterName --os-type Windows --node-vm-size "Standard_B2s" --name window --node-count 2 --kubernetes-version 1.13.12 --min-count 2 --max-count 6 --enable-cluster-autoscaler

Errors:

WARNING: The behavior of this command has been altered by the following extension: aks-preview
ERROR: Deployment failed. Correlation ID: de22582b-9a0c-462b-b15a-7fd3d85d07e2. VMSSAgentPoolReconciler retry failed: autorest/azure: Service returned an error. Status=429 
Code="OperationNotAllowed" Message="The server rejected the request because too many requests have been received for this subscription." Details=[{"code":"TooManyRequests","message":"{\"operationGroup\":\
"HighCostGetVMScaleSet30Min\",\"startTime\":\"2020-01-17T17:29:36.1768987+00:00\",\"endTime\":\"2020-01-17T17:44:36.1768987+00:00\",\"allowedRequestCount\":1329,\"measuredRequestCount\":1419}","target":"H
ighCostGetVMScaleSet30Min"}] InnerError={"internalErrorCode":"TooManyRequestsReceived"}

To Reproduce:

Steps to reproduce the behavior.

  • Run a long running command that continually polls azure for a result while your subscription is under heavy load (possibly from other such commands running in parallel?), until an http response with a status of 429 (“Too many requests”) is returned by the Azure API that is being called.

Expected Behavior

  • Az.exe shouldn’t fail when the initial command turns out to be successful – because it leaves the user in an unrecoverable state (e.g. the initial command appears to have failed, there is no output results, and re-running the command also fails, because now the resource exists! – so you not only don’t handle the 429 yourself, but you prevent the user from handling it too!).

  • Specifically, calls to Azure made by Az.exe which return a 429 status should have transient fault handling baked in – as specified by MSDN, best practices for cloud applications: All applications that communicate with remote services and resources must be sensitive to transient faults.

Environment Summary

Windows-10-10.0.18362-SP0
Python 3.6.6
Shell: cmd.exe

azure-cli 2.0.80

Extensions:
aks-preview 0.4.27
application-insights 0.1.1

Additional Context

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 1
  • Comments: 20 (7 by maintainers)

Most upvoted comments

FYI – for anyone experiencing this issue: https://github.com/Azure/AKS/issues/1413

Since cli core is already tagging an owner to investigate, will reopen and defer to them to handle issue closure.

The VMSS throttling is still under investigation, we’ve made multiple improvements in the cloud provider for latest 1.15 patch versions and latest CA addon versions. There is an additional set of work we are engaging for ETAs, please look to azure/aks for an issue summary in the coming days.

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @Azure/aks-pm

Hi, we’re sending this friendly reminder because we haven’t heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don’t hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

I have a similar problem happening with az storage list. The command will fail with a 429 message and does not appear to respect the throttling:

ValidationError: ("{\"error\":{\"code\":\"ResourceCollectionRequestsThrottled\",\"message\":\"Operation 'Microsoft.Storage/storageAccounts/read' failed as server encountered too many requests. Please try after '17' seconds. Tracking Id is '78db6dda-b7da-45cf-b763-202cdedf94f4'.\"}}")

This cascades down and breaks items that depend on the Azure CLI such as Terraform.

While I believe the source of the problem is the actual backend (the list request is only 4 actual requests), the Azure CLI should have a way to handle this gracefully.

Edit: Removed the Terraform note. They’re using the REST / Go SDK so it appears to be the same issue with simply listing the accounts but not directly related.

As for item 2, I think it is hard-coded in msrestazure to only retry 4 times. @lmazuel, do you think if it is possible to expose this config so that users can control the behavior? Maybe via environment variables.

As for item 3, indeed it is good practice to make az commands idempotent. Commands like az storage account create does follow this pattern. @jluk, is it possible for your team to make this command idempotent as requested?

Similar here:

VMSSAgentPoolReconciler retry failed: autorest/azure: Service returned an error. Status=429 Code=\"OperationNotAllowed\" 
Message=\"The server rejected the request because too many requests have been received for this subscription.\" 
Details=[{\"code\":\"TooManyRequests\",\"message\":\"{\\\"operationGroup\\\":\\\"HighCostGetVMScaleSet3Min\\\",\\\"startTime\\\":\\\"2020-01-27T17:20:31.4086198+00:00\\\",\\\"endTime\\\":\\\"2020-01-27T17:22:31.4086198+00:00\\\",
\\\"allowedRequestCount\\\":200,\\\"measuredRequestCount\\\":472}\",\"target\":\"HighCostGetVMScaleSet3Min\"},
{\"code\":\"TooManyRequests\",\"message\":\\{\\\"operationGroup\\\":\\\"HighCostGetVMScaleSet30Min\\\",\\\"startTime\\\":\\\"2020-01-27T17:20:31.4086198+00:00\\\",\\\"endTime\\\":\\\"2020-01-27T17:45:00+00:00\\\",\\\"allowedRequestCount\\\":1002,\\\"measuredRequestCount\\\":3030}\",\"target\":\"HighCostGetVMScaleSet30Min\"}] 
InnerError={\"internalErrorCode\":\"TooManyRequestsReceived\"}

the only difference is that two errors thrown, for HighCostGetVMScaleSet30Min and for HighCostGetVMScaleSet3Min. I was running terraform to add a node pool.

Found issue in Azure health center, will try disabling autoscaler for now

Azure CLI internally uses msrestazure and msrest for long running operation polling.

Judging from the error, the throttling is triggered by HighCostGetVMScaleSet30Min. I checked the code of msrestazure but it doesn’t seem to handle throttling error 429’s Retry-After from Compute Resource Provider.

https://github.com/Azure/msrestazure-for-python/blob/0a07586816351e3b7abd60a55a8b736e007fb7a8/msrestazure/polling/arm_polling.py#L128-L140

    def _raise_if_bad_http_status_and_method(self, response):
        """Check response status code is valid for a Put or Patch
        request. Must be 200, 201, 202, or 204.

        :raises: BadStatus if invalid status.
        """
        code = response.status_code
        if code in {200, 202} or \
           (code == 201 and self.method in {'PUT', 'PATCH'}) or \
           (code == 204 and self.method in {'DELETE', 'POST'}):
            return
        raise BadStatus(
            "Invalid return status for {!r} operation".format(self.method))

@lmazuel, could you please help confirm?

At the same time, I am reaching to ARM and AKS team to check the root cause for the throttling.