cluster-api: Multiple e2e tests are flaky because of error container is not running
What steps did you take and what happened: [A clear and concise description on how to REPRODUCE the bug.]
According to testgrid capi-e2e.When following the Cluster API quick-start [PR-Blocking] Should create a workload cluster and a few others are failing from time to time.
I looked at the last two occurrences in the capi-quickstart test. In both cases a machine did not come up because mkdir -p /etc/kubernetes/pki was failing because the respective container was not running. It was retried for a while but the container didn’t come up. I tried to find any other logs but couldn’t find anything. Logs from the controllers aggregated and sorted for the affected node of this test: https://gist.github.com/sbueringer/e007c989c158d66dd6d3078f8c904f30 (ProwJob)
I think right now we don’t have the necessary data/logs to find out why this happens. I would propose to gather the logs of the Docker service which is used in those tests (the dind used in the ProwJob). Maybe there’s something interesting there. Are there any other Docker / kind / … logs which we could retrieve?
What I found in the kubekins image we’re using:
- /var/log/docker.log
What did you expect to happen:
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]
I assume the following test failures are related:
- same container not running logs in the capd manager: (I didn’t look at all logs, but they all file basically in the same line of code in the tests)
- capi-e2e: When testing K8S conformance [Conformance] Should create a workload cluster and run kubetest
- capi-e2e: When testing KCP upgrade Should successfully upgrade Kubernetes, DNS, kube-proxy, and etcd in a HA cluster
- capi-e2e: When testing KCP upgrade Should successfully upgrade Kubernetes, DNS, kube-proxy, and etcd in a HA cluster
So tl;dr apart from the MachineRemediation most of our other flaky tests are probably caused by this issue. They are usually failing in the following lines of code:
- controlplane_helpers.go:109 https://github.com/kubernetes-sigs/cluster-api/blob/7478817225e0a75acb6e14fc7b438231578073d2/test/framework/controlplane_helpers.go#L109
- controlplane_helpers.go:146 https://github.com/kubernetes-sigs/cluster-api/blob/7478817225e0a75acb6e14fc7b438231578073d2/test/framework/controlplane_helpers.go#L146
- machinedeployment_helper:120: https://github.com/kubernetes-sigs/cluster-api/blob/7478817225e0a75acb6e14fc7b438231578073d2/test/framework/machinedeployment_helpers.go#L120
- machinepool_helpers.go:85: https://github.com/kubernetes-sigs/cluster-api/blob/7478817225e0a75acb6e14fc7b438231578073d2/test/framework/machinepool_helpers.go#L85
Even if they not all have the same root cause. Fixing this error should fix most of them.
Environment:
- Cluster-api version:
- Minikube/KIND version:
- Kubernetes version: (use
kubectl version): - OS (e.g. from
/etc/os-release):
/kind bug [One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 18 (18 by maintainers)
@sbueringer we are definitely in a better shape than before. Thanks for this work!
/close
Status update: #4469 has been merged. So I’ll take a look the next few days if the “container is not running” issue is gone as expected
@fabriziopandini from the log I suspect we’re first failing here (and then at the location I linked above during the retries):
https://github.com/kubernetes-sigs/cluster-api/blob/7478817225e0a75acb6e14fc7b438231578073d2/test/infrastructure/docker/docker/machine.go#L222-L228
I think a
docker inspectmakes sense at both locations.