cluster-api-provider-azure: Intermittent failures in capz with StatusCode=0 while creating azure resources

/kind bug

[Before submitting an issue, have you checked the Troubleshooting Guide?]

While trying to create a workload cluster, there has been intermittent failure when CAPZ tries to list, get or create an azure resource as part of reconciliation.

There are two different category of errors that can be seen in CAPZ:

StatusCode=0 – Original Error: context deadline exceeded
StatusCode=0 – Context canceled

For example:

E1222 12:15:30.961325       1 controller.go:326]  "msg"="Reconciler error" "error"="failed to reconcile AzureMachine: 
failed to reconcile AzureMachine service inboundnatrules: failed to create resource sh-2646368-221222-023620-rg/tkg-upgrade-
9864azure-antrea-control-plane-v1-24-6-vmware-qbkbx (service: inboundnatrules):
network.InboundNatRulesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: context deadline
exceeded" "azureMachine"={"name":"tkg-upgrade-9864azure-antrea-control-plane-v1-24-6-vmware-
qbkbx","namespace":"default"} "controller"="azuremachine" "controllerGroup"="infrastructure.cluster.x-k8s.io"
"controllerKind"="AzureMachine" "name"="tkg-upgrade-9864azure-antrea-control-plane-v1-24-6-vmware-qbkbx"
"namespace"="default" "reconcileID"="62fcdc95-5180-4d22-a464-5fbed9437f87"

E1226 21:18:10.928108      44 controller.go:326]  "msg"="Reconciler error" "error"="client rate limiter Wait returned an error:
context canceled" "azureCluster"={"name":"default-20004","namespace":"default"} "controller"="azurecluster"
"controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureCluster" "name"="default-20004"
"namespace"="default" "reconcileID"="60095f02-adac-4885-b1b0-e8977db79c60"

E1226 21:18:10.931279      44 controller.go:326]  "msg"="Reconciler error" "error"="failed to create scope: failed to configure
azure settings and credentials for Identity: failed to create copied AzureIdentity default-20004-default-cluster-identity in capz-
system: client rate limiter Wait returned an error: context canceled" "azureCluster"={"name":"default-
20004","namespace":"default"} "controller"="azurecluster" "controllerGroup"="infrastructure.cluster.x-k8s.io"
"controllerKind"="AzureCluster" "name"="default-20004" "namespace"="default" "reconcileID"="63711da9-357a-43bb-9354-
53df525837d5"

As a side effect the condition with severity gets written to the AzureCluster which bubbles all the way up to Cluster object. Though this gets recovered in the next reconcile.

What steps did you take and what happened: [A clear and concise description of what the bug is.]

Tried creating a workload cluster using capz v1.5.3, v1.6.0 and from main branch. All of them have above issues intermittently.

What did you expect to happen: The workload cluster creation should not have this intermittent issue which causes failures especially during upgrades. This issue is highly reproducible which does not seem could be due to network slowness.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

cluster-api-provider-azure version:
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

About this issue

Original URL
State: closed
Created a year ago
Comments: 15 (15 by maintainers)

Most upvoted comments

StatusCode=0 from the Azure Go SDK indicates a zero value for the StatusCode field of the Response structure, which means that it never received a response from the Azure API likely due to cancelation. Since each reconcile loop in CAPZ has a deadline set in the top most context, it would be expected that time to time a reconciliation loop will fail due to context deadline exceeded and return a StatusCode=0. However, since this case is expected, it seems like it should not always be considered a failure, but an expected transient state of the reconciler loop that should be requeued and will be resolved on subsequent reconciles.

It seems like we should not mutate the ready condition when the reconciler encounters a context deadline exceeded.

devigned on Mar 23, 2023

In my case i tracked this down to a DNS problem ( yeah is not a meme ) when running in kind, using the embedded DNS of docker with a search domain in resolv.conf coming from my system, i can see the DNS queries for my https://gs-fciocchetti-d4ccac8c.westeurope.cloudapp.azure.com:6443/api?timeout=10s\" taking a very long time to resolve due to lookups for the .lan domain … probably exceeding the 10s timeout.

So, at least in my case, ~~this is probably a localized issue.~~ this was a localized problem with docker embedded dns and my local dns setup. Might not be at all the issue you are seeing, just same symptoms

primeroz on Jan 9, 2023