cluster-api: Multiple e2e tests are flaky because of error container is not running

What steps did you take and what happened: [A clear and concise description on how to REPRODUCE the bug.]

According to testgrid capi-e2e.When following the Cluster API quick-start [PR-Blocking] Should create a workload cluster and a few others are failing from time to time.

I looked at the last two occurrences in the capi-quickstart test. In both cases a machine did not come up because mkdir -p /etc/kubernetes/pki was failing because the respective container was not running. It was retried for a while but the container didn’t come up. I tried to find any other logs but couldn’t find anything. Logs from the controllers aggregated and sorted for the affected node of this test: https://gist.github.com/sbueringer/e007c989c158d66dd6d3078f8c904f30 (ProwJob)

I think right now we don’t have the necessary data/logs to find out why this happens. I would propose to gather the logs of the Docker service which is used in those tests (the dind used in the ProwJob). Maybe there’s something interesting there. Are there any other Docker / kind / … logs which we could retrieve?

What I found in the kubekins image we’re using:

  • /var/log/docker.log

What did you expect to happen:

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

I assume the following test failures are related:

So tl;dr apart from the MachineRemediation most of our other flaky tests are probably caused by this issue. They are usually failing in the following lines of code:

Even if they not all have the same root cause. Fixing this error should fix most of them.

Environment:

  • Cluster-api version:
  • Minikube/KIND version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):

/kind bug [One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 18 (18 by maintainers)

Most upvoted comments

@sbueringer we are definitely in a better shape than before. Thanks for this work!

/close

Status update: #4469 has been merged. So I’ll take a look the next few days if the “container is not running” issue is gone as expected

@fabriziopandini from the log I suspect we’re first failing here (and then at the location I linked above during the retries):

https://github.com/kubernetes-sigs/cluster-api/blob/7478817225e0a75acb6e14fc7b438231578073d2/test/infrastructure/docker/docker/machine.go#L222-L228

I0331 10:05:37.077010       1 machine.go:190] controller-runtime/manager/controller/dockermachine "msg"="Creating control plane machine container" "name"="kcp-upgrade-vs9awl-control-plane-5lql4" "namespace"="kcp-upgrade-xv6wt7" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="DockerMachine" 
I0331 10:05:47.019084       1 dockermachine_controller.go:73] controller-runtime/manager/controller/dockermachine "msg"="Waiting for Machine Controller to set OwnerRef on DockerMachine" "name"="kcp-upgrade-b3dk85-control-plane-kwgrx" "namespace"="kcp-upgrade-k135el" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="DockerMachine" 
I0331 10:05:47.043525       1 dockermachine_controller.go:73] controller-runtime/manager/controller/dockermachine "msg"="Waiting for Machine Controller to set OwnerRef on DockerMachine" "name"="kcp-upgrade-b3dk85-control-plane-kwgrx" "namespace"="kcp-upgrade-k135el" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="DockerMachine" 
I0331 10:05:47.064611       1 dockermachine_controller.go:73] controller-runtime/manager/controller/dockermachine "msg"="Waiting for Machine Controller to set OwnerRef on DockerMachine" "name"="kcp-upgrade-b3dk85-control-plane-kwgrx" "namespace"="kcp-upgrade-k135el" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="DockerMachine" 
I0331 10:05:47.078423       1 dockermachine_controller.go:73] controller-runtime/manager/controller/dockermachine "msg"="Waiting for Machine Controller to set OwnerRef on DockerMachine" "name"="kcp-upgrade-b3dk85-control-plane-kwgrx" "namespace"="kcp-upgrade-k135el" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="DockerMachine" 
I0331 10:05:47.844159       1 dockermachine_controller.go:200] controller-runtime/manager/controller/dockermachine "msg"="Waiting for the Bootstrap provider controller to set bootstrap data" "name"="kcp-upgrade-b3dk85-control-plane-kwgrx" "namespace"="kcp-upgrade-k135el" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="DockerMachine" 
I0331 10:05:48.003814       1 dockermachine_controller.go:200] controller-runtime/manager/controller/dockermachine "msg"="Waiting for the Bootstrap provider controller to set bootstrap data" "name"="kcp-upgrade-b3dk85-control-plane-kwgrx" "namespace"="kcp-upgrade-k135el" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="DockerMachine" 
E0331 10:05:48.736814       1 controller.go:302] controller-runtime/manager/controller/dockermachine "msg"="Reconciler error" "error"="failed to create worker DockerMachine: timed out waiting for the condition" "name"="kcp-upgrade-vs9awl-control-plane-5lql4" "namespace"="kcp-upgrade-xv6wt7" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="DockerMachine" 
I0331 10:05:48.988211       1 loadbalancer.go:126] controller-runtime/manager/controller/dockermachine "msg"="Updating load balancer configuration" "name"="kcp-upgrade-vs9awl-control-plane-5lql4" "namespace"="kcp-upgrade-xv6wt7" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="DockerMachine" 
I0331 10:05:49.817654       1 machine.go:312] controller-runtime/manager/controller/dockermachine "msg"="Failed running command" "name"="kcp-upgrade-vs9awl-control-plane-5lql4" "namespace"="kcp-upgrade-xv6wt7" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="DockerMachine" "bootstrap 

I think a docker inspect makes sense at both locations.