cluster-api: CI failure: capi-e2e-release-1-1-1-23-1-24 is failing consistently
release-1.1 branch https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-1.1#capi-e2e-release-1-1-1-23-1-24 job is failing consistently since December 9th.
Logs from the corresponding prow job:
capi-e2e: When upgrading a workload cluster using ClusterClass and testing K8S conformance [Conformance] [K8s-Upgrade] Should create and upgrade a workload cluster and run kubetest expand_less | 20m54s
-- | --
{ Failure /home/prow/go/src/sigs.k8s.io/cluster-api/test/e2e/cluster_upgrade.go:115 Failed to run Kubernetes conformance Unexpected error: <*errors.withStack \| 0xc001dd0a80>: { error: <*errors.withMessage \| 0xc0010a6240>{ cause: <*errors.errorString \| 0xc00086ce40>{ s: "error container run failed with exit code 1", }, msg: "Unable to run conformance tests", }, stack: [0x1ad532a, 0x1b1a468, 0x73c37a, 0x73bd45, 0x73b43b, 0x7411c9, 0x740ba7, 0x7621c5, 0x761ee5, 0x761725, 0x7639d2, 0x76fb65, 0x76f97e, 0x1b34e11, 0x515662, 0x46b321], } Unable to run conformance tests: error container run failed with exit code 1 occurred /home/prow/go/src/sigs.k8s.io/cluster-api/test/e2e/cluster_upgrade.go:232}
Dec 15 12:41:10.938: INFO: At 2022-12-15 12:31:23 +0000 UTC - event for coredns-74f7f66b6f-s6m5s: {kubelet k8s-upgrade-and-conformance-0v458z-md-0-lcjxn-69fd749f7f-5dlnz} Failed: Failed to pull image "k8s.gcr.io/coredns:v1.8.6": rpc error: code = NotFound desc = failed to pull and unpack image "k8s.gcr.io/coredns:v1.8.6": failed to resolve reference "k8s.gcr.io/coredns:v1.8.6": k8s.gcr.io/coredns:v1.8.6: not found
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]
/kind bug [One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]
To sum up:
This is fixed in https://github.com/kubernetes-sigs/cluster-api/pull/7787 but I will keep this open for few days to see the CI signal and close it after. In CAPI release-1.1 branch e2e tests the registry was pinned to k8s.gcr.io as default and removing the registry pinning and leaving it empty solved the CI failure reported in this issue.
Also, please check the comments/suggestions from @sbueringer here and here to further understand the root cause of the issue and the possible ways to fix it forward in case you are seeing it.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 40 (40 by maintainers)
The following patch versions should have the change:
kubeadm initin those versions should handle coredns=>coredns/coredns for registry.k8s.io correctly, while previous versions of kubeadm will handle it correctly for k8s.gcr.io. (That’s why pinning the registry to either k8s.gcr.io or registry.k8s.io is not recommended)I can look into the different cases, but it takes a bit until I have time for that.
EDIT: The patch versions I listed for v1.22, v1.23 and v1.24 where one too low, updated now
To confirm my theory:
@lentzi90 Would it be possible to check if the upgrade test works when upgrading to v1.24.9?
@Ankitasw Would it be possible to check if the upgrade test works when upgrading to v1.23.15?
@furkatgofurov7 I don’t know which minor versions you are using in the test, but can you please also retry with the latest Kubernetes patch releases?
We still have an issue with kubeadm v1.22.x v1.23.x v1.24.x binaries that use the old registry. I’ve opened an issue to follow-up for those cases: https://github.com/kubernetes-sigs/cluster-api/issues/7833
/triage accepted /close
it seems that we nailed down this problem across the board. amazing team work!
@CecileRobertMichon This would make sense if the test sets the imageRepository to
registry.k8s.io, if it is not set I would have assumed both test runs should succeed.Is the imageRepo (in KCP) set in this test?
EDIT: Checked the resources in https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/periodic-cluster-api-provider-azure-e2e-workload-upgrade-1-23-1-24-main/1603466731035037696/artifacts/clusters/bootstrap/resources/ imageRepository is not set in KCP. I wonder where this test run got
registry.k8s.iofrom as kubeadm should have still used the old registryEDIT 2: Okay I have a working theory what happened:
registry.k8s.iofor Kubernetes >= 1.22.0RROR ImagePull]: failed to pull image registry.k8s.io/coredns:v1.8.6: output: time="2022-12-15T19:31:16Z" level=fatal msg="pulling image: rpc error: code = NotFound desc = failed to pull and unpack image \"registry.k8s.io/coredns:v1.8.6\": failed to resolve reference \"registry.k8s.io/coredns:v1.8.6\": registry.k8s.io/coredns:v1.8.6: not found"If I’m correct this essentially means that Cluster API v1.2.8 and v1.3.0 are only compatible with Kubernetes >= v1.22.17, >= v1.23.15, >= v1.24.9, >= v1.25.0 (if the kubeadm providers are used)
P.S. The patch verions in https://github.com/kubernetes-sigs/cluster-api/issues/7768#issuecomment-1359379155 were initially one too low, updated now.
We don’t pin the imageRepository in CAPA, and even after using above patch versions, we are getting this error in CAPA CCM migration tests.
My best guess after trying to replicate this is that there is a clash between the CAPI code that works to fix the coreDNS image name for older versions and the update to kubeadm which uses the new registry for coredns.
The root cause is that the version of Kubernetes in use in the e2e tests has been updated to use Kubeadm versions with the registry fix. There aren’t yet kindest/node images available.
I think this should be fixable by specifying the imageRegistry in the clusterConfiguration field of the KubeadmControlPlane. I’ve got a version which does this for the failing 1.1 branch, but I’d like to find out if it can fix the issues on CAPZ or CAPA. Starting next week I’ll build a kindest/node image with the new versions of 1.23 and 1.24 so I can properly test the fix with CAPD.