kubernetes: node-kubelet-conformance fails with "pod ... was not deleted"

Which jobs are failing:

ci-kubernetes-node-kubelet-conformance

Which test(s) are failing:

  • Variable Expansion should succeed in writing subpaths in container
  • Variable Expansion should verify that a failing subpath expansion can be modified during the lifecycle of a container

Since when has it been failing:

Since Fri Jul 10 15:41:46 2020 -0700, when this PR was merged

Testgrid link:

https://k8s-testgrid.appspot.com/sig-node-kubelet#node-kubelet-conformance

Reason for failure:

Used DeletePropagationForeground option to delete test pod introduced by this commit

Anything else we need to know:

The test pod stuck in Terminating state until 5m timeout of DeletePodWithWait expired. It happens with empty pod as well. Here is a simplified ginkgo test case that triggers this failure:

framework.ConformanceIt("trigger stuck pod deletion", func() {
		pod := newPod([]string{"sh", "-c", "sleep 600"}, nil, nil, nil)
		ginkgo.By("creating the pod")
		var podClient *framework.PodClient = f.PodClient()
		pod = podClient.Create(pod)

		ginkgo.By("waiting for pod running")
		err := e2epod.WaitTimeoutForPodRunningInNamespace(f.ClientSet, pod.Name, pod.Namespace, framework.PodStartShortTimeout)
		framework.ExpectNoError(err, "while waiting for pod to be running")

		ginkgo.By("deleting the pod gracefully")
		err = e2epod.DeletePodWithWait(f.ClientSet, pod)
		framework.ExpectNoError(err, "failed to delete pod")
	})

The pod phase is “Failed” and container reason is “Error” for some reason:

# kubectl get pod -n var-expansion-1646 var-expansion-f5a6d071-e1c0-40a6-9292-49807b1f862f -o yaml |grep -i -B15 fail
    image: busybox:1.29
    imageID: docker-pullable://busybox@sha256:8ccbac733d19c0dd4d70b4f0c1e12245b5fa3ad24758a11035ee505c629c0796
    lastState: {}
    name: dapi-container
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: docker://7b60220b917510a6819d04774ebf7be660a67cc85ceea5e99174322b26190910
        exitCode: 137
        finishedAt: "2020-07-21T16:35:39Z"
        reason: Error
        startedAt: "2020-07-21T16:35:07Z"
  hostIP: 10.237.72.179
  phase: Failed

Reverting above commit or changing delete option to DeletePropagationBackground should fix this.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 17 (17 by maintainers)

Most upvoted comments

Also do we have plans to make this job a blocking presubmit test so we can find these problems earlier?

putting my @BenTheElder hat on, probably not? more hour+ tests that run on every PR are a non-goal