cloud-provider-azure: ipv6 + Calico + out of tree cloud-provider is broken

While moving all CAPZ templates to out-of-tree CCM, I was not able to get a reliably passing test for ipv6. This test is very flaky (it fails most of the time). Looking into it I found that the error is that kubelet can’t patch the status of one of the kube-system pods (sometimes kube-apiserver, sometimes kube-scheduler, sometimes kube-controller-manager). One example of the error can be seen here. The error is

	Pod \"kube-controller-manager-capz-e2e-4laer3-ipv6-control-plane-6v7dn\" is invalid: [status.podIPs: Invalid value: []core.PodIP{core.PodIP{IP:\"2001:1234:5678:9abc::5\"}, core.PodIP{IP:\"2001:1234:5678:9abc::5\"}}: may specify no more than one IP for each IP family, status.podIPs[1]: Duplicate value: core.PodIP{IP:\"2001:1234:5678:9abc::5\"}]"

The same error does not repro with the in-tree cloud-provider (see passing ipv6 test with CAPZ and in-tree cloud-provider here).

The cloud-provider-azure test for ipv6 is also failing (although the failure is different: https://testgrid.k8s.io/provider-azure-cloud-provider-azure#cloud-provider-azure-master-ipv6-capz)

About this issue

Original URL
State: closed
Created a year ago
Comments: 38 (36 by maintainers)

Most upvoted comments

@CecileRobertMichon @lzhecheng please check https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/3705-cloud-node-ips , this is WIP and a known and complicated problem, we also have some ammends because of the complexity https://github.com/kubernetes/enhancements/pull/3898

Please check the KEP to see if you are case is contemplated and if not, please let us know

aojea on Mar 28, 2023

Let me try without it

update: the e2e test passed several times in a row when node-ip is not set kubernetes-sigs/cluster-api-provider-azure#3221

Glad to know that. I think CNM doesn’t need to change because as you said, IPv4 first and then IPv6 is the expected behaviour for CAPZ dualstack and IPv6 only.

lzhecheng on Mar 30, 2023

Update: I think I found the root cause: kubelet is not properly processing host IPs. OOT cloud-provider behaves little differently from before, which directly leads to failure but I think it is kubelet that should handle the situation. Will do further verification tomorrow. And then make a fix and detailed root cause analysis.

lzhecheng on Mar 22, 2023

@lzhecheng yes but you’ll need to create a template based on the custom-builds.yaml template that uses out of tree + ipv6 (that’s not a combination of features we have in the templates in the repo unfortunately): https://capz.sigs.k8s.io/developers/kubernetes-developers.html#kubernetes-117

CecileRobertMichon on Mar 17, 2023

@lzhecheng the 3rd control plane node is not coming up because of the “invalid pod IPs” issue in the 2nd control plane: Cluster API checks the health of all control plane components before scaling up and in this case the check is failing because the scheduler pod is not healthy

CecileRobertMichon on Mar 6, 2023

@lzhecheng yes, see https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/3221, e2e test failed ipv6 scenario: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/3221/pull-cluster-api-provider-azure-e2e/1631444499202838528

The only change compared to https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/3105 where e2e test is passing for ipv6 is https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/3221/commits/e3701e0731e6936529ce352bcb6923cfceeb276f

CecileRobertMichon on Mar 3, 2023