kubernetes: skip "instance not found" error when reconciling LB backend address pools
What happened?
When upgrade cluster to kubernetes 1.20.15 on Azure platform, I still hit “instance not found” error when ensureHostInPool, and this leads to one of the backend pool members missing since then, users have to add the member back manually. Below is the error:
...
I0624 04:33:37.521214 9 azure_backoff.go:90] GetVirtualMachineWithRetry(e05821c2-8cc5-4bfc-8ef9-6580b3130fa4): backoff success
I0624 04:33:37.531306 9 azure_wrap.go:194] Virtual machine "fc2f29a3-4d96-404b-8106-3c5212fcc811" not found
I0624 04:33:37.531341 9 azure_standard.go:754] GetPrimaryInterface(fc2f29a3-4d96-404b-8106-3c5212fcc811, k8s-prod-availability-set) abort backoff
E0624 04:33:37.531361 9 azure_standard.go:827] error: az.EnsureHostInPool(fc2f29a3-4d96-404b-8106-3c5212fcc811), az.VMSet.GetPrimaryInterface.Get(fc2f29a3-4d96-404b-8106-3c5212fcc811, k8s-prod-availability-set), err=instance not found
...
...
E0624 04:33:37.688119 9 azure_loadbalancer.go:194] reconcileLoadBalancer(pegaaddons-nbi-prod/nginx-nbi-prod-nginx-ingress-controller) failed: ensure(pegaaddons-nbi-prod/nginx-nbi-prod-nginx-ingress-controller): backendPoolID(/subscriptions/92dd83a1-b5f4-4718-b7f4-956b91378841/resourceGroups/nfcu-spoke-dlabs-prod-eus-pcf-rg/providers/Microsoft.Network/loadBalancers/pega-internal/backendAddressPools/pega) - failed to ensure host in pool: "instance not found"
I0624 04:33:37.746392 9 azure_backoff.go:304] PublicIPAddressesClient.List(nfcu-spoke-dlabs-prod-eus-pcf-rg) success
I0624 04:33:37.935691 9 azure_backoff.go:285] LoadBalancerClient.List(nfcu-spoke-dlabs-prod-eus-pcf-rg) success
I0624 04:33:37.935740 9 azure_loadbalancer.go:521] get(pegaaddons-nbi-prod/nginx-nbi-prod-nginx-ingress-controller): lb(pega-internal) - found frontend IP config, primary service: true
I0624 04:33:37.935749 9 azure_loadbalancer.go:547] getServiceLoadBalancerStatus gets ingress IP "10.200.217.70" from frontendIPConfiguration "a0ed69935dafb4d5e8698dd90fc6a385-DLABS-PROD-EUS-PCF-LINK-SUBNET" for service "pegaaddons-nbi-prod/nginx-nbi-prod-nginx-ingress-controller"
E0624 04:33:37.935938 9 controller.go:732] failed to update load balancer hosts for service pegaaddons-nbi-prod/nginx-nbi-prod-nginx-ingress-controller: ensure(pegaaddons-nbi-prod/nginx-nbi-prod-nginx-ingress-controller): backendPoolID(/subscriptions/92dd83a1-b5f4-4718-b7f4-956b91378841/resourceGroups/nfcu-spoke-dlabs-prod-eus-pcf-rg/providers/Microsoft.Network/loadBalancers/pega-internal/backendAddressPools/pega) - failed to ensure host in pool: "instance not found"
I0624 04:33:37.936226 9 controller.go:708] Detected change in list of current cluster nodes. New node set: map[057931a8-de07-4dad-88f1-81071b78d4cf:{} 07436e9a-7ade-4ffc-b17b-a735492712ce:{} 08d36d25-dd15-4010-983c-6ab24c0dcc0c:{} 0b0fa97c-a385-4f9d-9794-1a847a695bdf:{} 0b445fd5-a9f7-44b1-a86d-63ab50f869fa:{} 18f2498a-da81-448d-b20e-c647fe11a5fa:{} 1ca104f8-4643-4f6c-acea-71af0e79243c:{} 219db030-f5b6-4c21-b295-6fbea8a8239e:{} 2433f24e-379e-4bd6-a82c-c1248980e27d:{} 25242a0c-5d5b-4fd3-9bc7-737e92ec1751:{} 2620f7b5-df35-43f0-b403-e0b6319815b0:{} 2654a21c-8f08-4414-8515-7dc0146bbeae:{} 2f5a76ae-5670-4783-a6ad-d9ccd57b37c8:{} 2f8ef589-1f13-4fe9-9cd7-bf3346c2d199:{} 2fbdbfdc-fbbd-409b-a7ff-7cd248661a64:{} 34b79c67-781f-4d48-9715-2a172340ebed:{} 3571ccff-f22c-4329-90d4-3eca5914cc34:{} 3af7676f-ec9f-4204-9e9d-2be8b89ded16:{} 3b1e09d8-6dec-459e-9737-10ea315d4d7e:{} 3b668772-5d02-4a93-9743-a5d4a31f9afc:{} 3ee04ab9-9464-420f-bbe1-27f36d96d72c:{} 441cdaa5-de45-4150-96d0-9118941a8bf1:{} 4735674b-2e69-421e-81b3-52a83a1a7ed0:{} 492b7093-92ac-4fc5-86d3-fae88242aa25:{} 4c13f7ea-56fd-417b-b7fd-06a1c365d534:{} 5051ff65-29a6-47f1-bb6b-a08c3e5b677b:{} 5338d828-104f-4542-9366-0c6488c18fd8:{} 54dc8680-3243-4450-a233-d79dffa0aaf6:{} 5746fbd5-a36b-46d7-b83e-1cb7a03ce511:{} 575b6b35-d934-45ff-b9fa-a94fe423c649:{} 5d65f90a-84a6-4e38-b2f5-04dcf7c9b274:{} 5de82f7d-5f35-4338-9c12-d642f751388c:{} 63325ce3-af59-4c65-a1df-9f6ab766fc4a:{} 64e09472-74e8-48e4-97a0-cc5e1fb71fbc:{} 66c221b1-575a-4864-8aff-2688d992b502:{} 695824b6-04a5-4094-86e6-cf5646530644:{} 6ab61ff8-7796-46a4-90d3-c354c937939f:{} 6c00efa4-44e1-4a1f-9763-046bc2085ff8:{} 747ac53e-7b3a-462c-9f5d-74e2cdf548cf:{} 74cba3b8-9105-452f-890f-35725dd90d70:{} 78b5a273-a76c-442d-b3b5-6944bc610140:{} 7ac1223f-0147-45f8-aab4-baa0dfc9d3f8:{} 7d86e9ae-262a-4ad6-aaed-d7e6d345376d:{} 7ed52571-584b-494c-8951-38755851a7ae:{} 806b493a-4e6c-474c-b379-2a0e0fa40f4d:{} 80e7af7f-b6de-488b-8b0c-9a00fec17f18:{} 814691e1-39f9-4719-b60e-4d22dc67cb92:{} 831d5f90-b212-4ea9-8658-c90e75adbea5:{} 853b1704-03ea-4d20-85f4-7a29f2eac1ac:{} 85c08fc3-c62f-4d08-956d-195cffa21047:{} 86055497-e3a3-486f-b440-cbccf1ac9897:{} 86b52af0-dc98-4794-b4c7-5e161d924f6c:{} 880229a9-45ca-40a2-a9e9-fb255dd9e9aa:{} 8a334730-f894-4d75-8165-d2c901451fea:{} 8c33a3ea-00fa-4419-9e6e-e2549ac3a06a:{} 92d2aa7c-a4e7-4601-aab0-2f27f0b4f080:{} 9465ecb3-85be-45bb-870c-05a4187d5db4:{} 97342c30-1e3e-4bd0-a969-076b724cbc06:{} a3472b7a-a634-4c34-a1e6-40a9223b36f9:{} a443a7ff-5838-4551-8280-c986a6ae70b8:{} a4a1cb30-352b-4f49-8391-4d620dde9e3d:{} a8275578-8942-483b-92f9-658952b8a1c1:{} a85df358-7d8d-4b81-b4d2-6c9194b8c605:{} b2723ef0-840c-46fd-990e-89d537727635:{} bb783360-918a-4b11-9d86-314654b4e4c8:{} bc99656f-ff23-4f48-8986-51deb80cfa58:{} bd782e51-24a1-4324-90fc-b08e585859f9:{} bf2964bd-d1e4-4ee8-92eb-6fca8cc58a12:{} c0869aab-1fa6-4683-8b3d-fe73ef5ea727:{} c88c2230-fd91-4ccd-822c-ba04240a1e92:{} c90abd68-0855-40a3-b656-976fc4f21632:{} c94ed7d0-3bff-4365-94e6-c73d13f90d82:{} cc0ec6ed-7ee3-49b4-990f-3261fe2a0890:{} d37b7e11-95a4-4b63-b535-4aafe7410521:{} d4706d2b-100c-4b56-b636-cd5443a29f7a:{} d7ad4833-e414-42e3-964f-0295ea6b8e02:{} d92363fb-5f8b-4c90-8e12-2937d8a9c2d7:{} e05821c2-8cc5-4bfc-8ef9-6580b3130fa4:{} e147306e-7064-4324-a834-605705b69ba5:{} e20622be-52ad-4ab5-8089-fa0105e16ab1:{} ea68e9b9-1696-441f-8700-81aa5f8eb6fd:{} ec82a525-1868-48ad-bc06-90fb4e50ee2e:{} ef916d93-1720-4324-a283-5552f36b4e54:{} f2498a4f-75e5-4d77-bc28-f127c56b889b:{} f250e133-6a52-4c4f-abd8-f2404ef7f441:{} f46af66a-f552-4bfb-8616-960bcc12f525:{} f8933921-5d08-4e57-bd2e-ac9318e9379f:{} fc9e700c-37e2-4084-8be7-b89757455cd3:{} fcc7e4b4-acb2-41ec-82e8-1976943e7ba3:{}]
The Detected change in list of current cluster nodes should be in size of 90, but actually it only has 89 nodes here. The symptom matches issue https://github.com/kubernetes-sigs/cloud-provider-azure/issues/789 and the fix #105188 But the fix may be incomplete.
What did you expect to happen?
I expected the “instance not found” error will not affect backend pool members management, and it’s ignored somewhere in either https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/legacy-cloud-providers/azure/azure_standard.go#L823 or (1.20.15) https://github.com/kubernetes/kubernetes/blob/v1.20.15/staging/src/k8s.io/legacy-cloud-providers/azure/azure_loadbalancer.go#L194
How can we reproduce it (as minimally and precisely as possible)?
By upgrading the cluster and trigger the VM recreation, and this may not be easy to reproduce, but it indeed happens several times in our environment
Kubernetes version
Cloud provider
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 15 (6 by maintainers)
primaryAvailabilitySetName means the cluster is using VMAS. Per the codes here, the issue probably still exists.
@lzhecheng could you help for the fix?
/assign @lzhecheng