cluster-api-provider-azure: Intermittent failures in capz with StatusCode=0 while creating azure resources
/kind bug
[Before submitting an issue, have you checked the Troubleshooting Guide?]
While trying to create a workload cluster, there has been intermittent failure when CAPZ tries to list, get or create an azure resource as part of reconciliation.
There are two different category of errors that can be seen in CAPZ:
- StatusCode=0 – Original Error: context deadline exceeded
- StatusCode=0 – Context canceled
For example:
E1222 12:15:30.961325 1 controller.go:326] "msg"="Reconciler error" "error"="failed to reconcile AzureMachine:
failed to reconcile AzureMachine service inboundnatrules: failed to create resource sh-2646368-221222-023620-rg/tkg-upgrade-
9864azure-antrea-control-plane-v1-24-6-vmware-qbkbx (service: inboundnatrules):
network.InboundNatRulesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: context deadline
exceeded" "azureMachine"={"name":"tkg-upgrade-9864azure-antrea-control-plane-v1-24-6-vmware-
qbkbx","namespace":"default"} "controller"="azuremachine" "controllerGroup"="infrastructure.cluster.x-k8s.io"
"controllerKind"="AzureMachine" "name"="tkg-upgrade-9864azure-antrea-control-plane-v1-24-6-vmware-qbkbx"
"namespace"="default" "reconcileID"="62fcdc95-5180-4d22-a464-5fbed9437f87"
E1226 21:18:10.928108 44 controller.go:326] "msg"="Reconciler error" "error"="client rate limiter Wait returned an error:
context canceled" "azureCluster"={"name":"default-20004","namespace":"default"} "controller"="azurecluster"
"controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureCluster" "name"="default-20004"
"namespace"="default" "reconcileID"="60095f02-adac-4885-b1b0-e8977db79c60"
E1226 21:18:10.931279 44 controller.go:326] "msg"="Reconciler error" "error"="failed to create scope: failed to configure
azure settings and credentials for Identity: failed to create copied AzureIdentity default-20004-default-cluster-identity in capz-
system: client rate limiter Wait returned an error: context canceled" "azureCluster"={"name":"default-
20004","namespace":"default"} "controller"="azurecluster" "controllerGroup"="infrastructure.cluster.x-k8s.io"
"controllerKind"="AzureCluster" "name"="default-20004" "namespace"="default" "reconcileID"="63711da9-357a-43bb-9354-
53df525837d5"
As a side effect the condition with severity gets written to the AzureCluster which bubbles all the way up to Cluster object. Though this gets recovered in the next reconcile.
What steps did you take and what happened: [A clear and concise description of what the bug is.]
Tried creating a workload cluster using capz v1.5.3, v1.6.0 and from main branch. All of them have above issues intermittently.
What did you expect to happen: The workload cluster creation should not have this intermittent issue which causes failures especially during upgrades. This issue is highly reproducible which does not seem could be due to network slowness.
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]
Environment:
- cluster-api-provider-azure version:
- Kubernetes version: (use
kubectl version
): - OS (e.g. from
/etc/os-release
):
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 15 (15 by maintainers)
StatusCode=0
from the Azure Go SDK indicates a zero value for theStatusCode
field of theResponse
structure, which means that it never received a response from the Azure API likely due to cancelation. Since each reconcile loop in CAPZ has a deadline set in the top most context, it would be expected that time to time a reconciliation loop will fail due to context deadline exceeded and return aStatusCode=0
. However, since this case is expected, it seems like it should not always be considered a failure, but an expected transient state of the reconciler loop that should be requeued and will be resolved on subsequent reconciles.It seems like we should not mutate the ready condition when the reconciler encounters a context deadline exceeded.
In my case i tracked this down to a
DNS problem
( yeah is not a meme ) when running in kind, using the embedded DNS of docker with asearch domain
in resolv.conf coming from my system, i can see the DNS queries for myhttps://gs-fciocchetti-d4ccac8c.westeurope.cloudapp.azure.com:6443/api?timeout=10s\"
taking a very long time to resolve due tolookups for the .lan domain
… probably exceeding the 10s timeout.So, at least in my case,
this is probably a localized issue.this was a localized problem withdocker embedded dns
and my local dns setup. Might not be at all the issue you are seeing, just same symptoms