cluster-api: CI failure: capi-e2e-release-1-1-1-23-1-24 is failing consistently

release-1.1 branch https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-1.1#capi-e2e-release-1-1-1-23-1-24 job is failing consistently since December 9th.

Logs from the corresponding prow job:

capi-e2e:  When upgrading a workload cluster using ClusterClass and testing K8S  conformance [Conformance] [K8s-Upgrade] Should create and upgrade a  workload cluster and run kubetest expand_less | 20m54s
-- | --
{ Failure /home/prow/go/src/sigs.k8s.io/cluster-api/test/e2e/cluster_upgrade.go:115 Failed to run Kubernetes conformance Unexpected error:     <*errors.withStack \| 0xc001dd0a80>: {         error: <*errors.withMessage \| 0xc0010a6240>{             cause: <*errors.errorString \| 0xc00086ce40>{                 s: "error container run failed with exit code 1",             },             msg: "Unable to run conformance tests",         },         stack: [0x1ad532a, 0x1b1a468, 0x73c37a, 0x73bd45, 0x73b43b, 0x7411c9, 0x740ba7, 0x7621c5, 0x761ee5, 0x761725, 0x7639d2, 0x76fb65, 0x76f97e, 0x1b34e11, 0x515662, 0x46b321],     }     Unable to run conformance tests: error container run failed with exit code 1 occurred /home/prow/go/src/sigs.k8s.io/cluster-api/test/e2e/cluster_upgrade.go:232}

Dec 15 12:41:10.938: INFO: At 2022-12-15 12:31:23 +0000 UTC - event for coredns-74f7f66b6f-s6m5s: {kubelet k8s-upgrade-and-conformance-0v458z-md-0-lcjxn-69fd749f7f-5dlnz} Failed: Failed to pull image "k8s.gcr.io/coredns:v1.8.6": rpc error: code = NotFound desc = failed to pull and unpack image "k8s.gcr.io/coredns:v1.8.6": failed to resolve reference "k8s.gcr.io/coredns:v1.8.6": k8s.gcr.io/coredns:v1.8.6: not found

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

/kind bug [One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

To sum up: This is fixed in https://github.com/kubernetes-sigs/cluster-api/pull/7787 but I will keep this open for few days to see the CI signal and close it after. In CAPI release-1.1 branch e2e tests the registry was pinned to k8s.gcr.io as default and removing the registry pinning and leaving it empty solved the CI failure reported in this issue. Also, please check the comments/suggestions from @sbueringer here and here to further understand the root cause of the issue and the possible ways to fix it forward in case you are seeing it.

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 40 (40 by maintainers)

Most upvoted comments

The following patch versions should have the change:

>= v1.22.17
>= v1.23.15
>= v1.24.9
>= v1.25.0

kubeadm init in those versions should handle coredns=>coredns/coredns for registry.k8s.io correctly, while previous versions of kubeadm will handle it correctly for k8s.gcr.io. (That’s why pinning the registry to either k8s.gcr.io or registry.k8s.io is not recommended)

I can look into the different cases, but it takes a bit until I have time for that.

EDIT: The patch versions I listed for v1.22, v1.23 and v1.24 where one too low, updated now

sbueringer on Dec 22, 2022

To confirm my theory:

@lentzi90 Would it be possible to check if the upgrade test works when upgrading to v1.24.9?

@Ankitasw Would it be possible to check if the upgrade test works when upgrading to v1.23.15?

@furkatgofurov7 I don’t know which minor versions you are using in the test, but can you please also retry with the latest Kubernetes patch releases?

sbueringer on Dec 22, 2022

We still have an issue with kubeadm v1.22.x v1.23.x v1.24.x binaries that use the old registry. I’ve opened an issue to follow-up for those cases: https://github.com/kubernetes-sigs/cluster-api/issues/7833

sbueringer on Jan 3, 2023

/triage accepted /close

it seems that we nailed down this problem across the board. amazing team work!

fabriziopandini on Dec 28, 2022

@CecileRobertMichon This would make sense if the test sets the imageRepository to registry.k8s.io, if it is not set I would have assumed both test runs should succeed.

Is the imageRepo (in KCP) set in this test?

EDIT: Checked the resources in https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/periodic-cluster-api-provider-azure-e2e-workload-upgrade-1-23-1-24-main/1603466731035037696/artifacts/clusters/bootstrap/resources/ imageRepository is not set in KCP. I wonder where this test run got registry.k8s.io from as kubeadm should have still used the old registry

EDIT 2: Okay I have a working theory what happened:

Cluster API >= v1.2.8 and >= v1.3.0 automatically changes the registry tracked in the kubeadm config ConfigMap to registry.k8s.io for Kubernetes >= 1.22.0
- It doesn’t distinguish between Kubernetes version which have kubeadm with the old or new registry
If a Kubernetes/kubeadm version with the old registry is used kubeadm join fails as kubeadm join doesn’t handle the new registry correctly
- RROR ImagePull]: failed to pull image registry.k8s.io/coredns:v1.8.6: output: time="2022-12-15T19:31:16Z" level=fatal msg="pulling image: rpc error: code = NotFound desc = failed to pull and unpack image \"registry.k8s.io/coredns:v1.8.6\": failed to resolve reference \"registry.k8s.io/coredns:v1.8.6\": registry.k8s.io/coredns:v1.8.6: not found"

If I’m correct this essentially means that Cluster API v1.2.8 and v1.3.0 are only compatible with Kubernetes >= v1.22.17, >= v1.23.15, >= v1.24.9, >= v1.25.0 (if the kubeadm providers are used)

P.S. The patch verions in https://github.com/kubernetes-sigs/cluster-api/issues/7768#issuecomment-1359379155 were initially one too low, updated now.

sbueringer on Dec 22, 2022

We don’t pin the imageRepository in CAPA, and even after using above patch versions, we are getting this error in CAPA CCM migration tests.

Ankitasw on Dec 20, 2022

My best guess after trying to replicate this is that there is a clash between the CAPI code that works to fix the coreDNS image name for older versions and the update to kubeadm which uses the new registry for coredns.

The root cause is that the version of Kubernetes in use in the e2e tests has been updated to use Kubeadm versions with the registry fix. There aren’t yet kindest/node images available.

I think this should be fixable by specifying the imageRegistry in the clusterConfiguration field of the KubeadmControlPlane. I’ve got a version which does this for the failing 1.1 branch, but I’d like to find out if it can fix the issues on CAPZ or CAPA. Starting next week I’ll build a kindest/node image with the new versions of 1.23 and 1.24 so I can properly test the fix with CAPD.

killianmuldoon on Dec 16, 2022