azure-cli: az commands can trigger 429 / "too many requests" failures and provides no recourse for recovery.
Describe the bug
Running az commands can generate 429 “too many requests” exceptions from Azure (possibly related to az aks
? or possibly all commands – I’ve definitely seen this at random from Azure before). It seems this happens with long running commands after they have already executed and az is polling for a result from Azure.
Ideally, when this happens, az should just exponentially backoff (i.e. increase the timeout and try again). (Sometimes in the 429 response, there is even a Retry-After
header that tells you exactly how long to wait!)
IMO, the REAL issue is that, you get back a failure message, and the command aborts, with no results – even if the command was successful (e.g. you can’t even just try to rerun the command at that point). – Basically, the command shouldn’t throw a perma-error unless it has actually failed. If the command is still running and might possibly succeed but you just failed to poll for a result, you should do a backoff and retry.
Command Name
az aks nodepool add --resource-group MyResourceGroup --cluster-name MyClusterName --os-type Windows --node-vm-size "Standard_B2s" --name window --node-count 2 --kubernetes-version 1.13.12 --min-count 2 --max-count 6 --enable-cluster-autoscaler
Errors:
WARNING: The behavior of this command has been altered by the following extension: aks-preview
ERROR: Deployment failed. Correlation ID: de22582b-9a0c-462b-b15a-7fd3d85d07e2. VMSSAgentPoolReconciler retry failed: autorest/azure: Service returned an error. Status=429
Code="OperationNotAllowed" Message="The server rejected the request because too many requests have been received for this subscription." Details=[{"code":"TooManyRequests","message":"{\"operationGroup\":\
"HighCostGetVMScaleSet30Min\",\"startTime\":\"2020-01-17T17:29:36.1768987+00:00\",\"endTime\":\"2020-01-17T17:44:36.1768987+00:00\",\"allowedRequestCount\":1329,\"measuredRequestCount\":1419}","target":"H
ighCostGetVMScaleSet30Min"}] InnerError={"internalErrorCode":"TooManyRequestsReceived"}
To Reproduce:
Steps to reproduce the behavior.
- Run a long running command that continually polls azure for a result while your subscription is under heavy load (possibly from other such commands running in parallel?), until an http response with a status of 429 (“Too many requests”) is returned by the Azure API that is being called.
Expected Behavior
-
Az.exe shouldn’t fail when the initial command turns out to be successful – because it leaves the user in an unrecoverable state (e.g. the initial command appears to have failed, there is no output results, and re-running the command also fails, because now the resource exists! – so you not only don’t handle the 429 yourself, but you prevent the user from handling it too!).
-
Specifically, calls to Azure made by Az.exe which return a 429 status should have transient fault handling baked in – as specified by MSDN, best practices for cloud applications:
All applications that communicate with remote services and resources must be sensitive to transient faults.
Environment Summary
Windows-10-10.0.18362-SP0
Python 3.6.6
Shell: cmd.exe
azure-cli 2.0.80
Extensions:
aks-preview 0.4.27
application-insights 0.1.1
Additional Context
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 20 (7 by maintainers)
FYI – for anyone experiencing this issue: https://github.com/Azure/AKS/issues/1413
Since cli core is already tagging an owner to investigate, will reopen and defer to them to handle issue closure.
The VMSS throttling is still under investigation, we’ve made multiple improvements in the cloud provider for latest 1.15 patch versions and latest CA addon versions. There is an additional set of work we are engaging for ETAs, please look to azure/aks for an issue summary in the coming days.
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @Azure/aks-pm
Hi, we’re sending this friendly reminder because we haven’t heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don’t hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!
I have a similar problem happening with
az storage list
. The command will fail with a 429 message and does not appear to respect the throttling:This cascades down and breaks items that depend on the Azure CLI such as Terraform.While I believe the source of the problem is the actual backend (the list request is only 4 actual requests), the Azure CLI should have a way to handle this gracefully.
Edit: Removed the Terraform note. They’re using the REST / Go SDK so it appears to be the same issue with simply listing the accounts but not directly related.
As for item 2, I think it is hard-coded in msrestazure to only retry 4 times. @lmazuel, do you think if it is possible to expose this config so that users can control the behavior? Maybe via environment variables.
As for item 3, indeed it is good practice to make az commands idempotent. Commands like
az storage account create
does follow this pattern. @jluk, is it possible for your team to make this command idempotent as requested?Similar here:
the only difference is that two errors thrown, for
HighCostGetVMScaleSet30Min
and forHighCostGetVMScaleSet3Min
. I was running terraform to add a node pool.Found issue in Azure health center, will try disabling autoscaler for now
Azure CLI internally uses msrestazure and msrest for long running operation polling.
Judging from the error, the throttling is triggered by
HighCostGetVMScaleSet30Min
. I checked the code of msrestazure but it doesn’t seem to handle throttling error 429’sRetry-After
from Compute Resource Provider.https://github.com/Azure/msrestazure-for-python/blob/0a07586816351e3b7abd60a55a8b736e007fb7a8/msrestazure/polling/arm_polling.py#L128-L140
@lmazuel, could you please help confirm?
At the same time, I am reaching to ARM and AKS team to check the root cause for the throttling.