kubernetes: [failing test] "Pod Container Status should never report success for a pending container" hangs and times out KIND jobs

I’ve checked some jobs that timed out and something is odd with the test

[k8s.io] [sig-node] Pods Extended
^[[90mtest/e2e/framework/framework.go:592^[[0m
  [k8s.io] Pod Container Status
  ^[[90mtest/e2e/framework/framework.go:592^[[0m
    should never report success for a pending container

Check the durations in

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/88196/pull-kubernetes-e2e-kind/1266095722265055235/build-log.txt

• [SLOW TEST:3484.128 seconds]
[k8s.io] [sig-node] Pods Extended
test/e2e/framework/framework.go:592
  [k8s.io] Pod Container Status
  test/e2e/framework/framework.go:592
    should never report success for a pending container
    test/e2e/node/pods.go:219
May 28 21:05:05.238: INFO: Pod pod-submit-status-0-7 on node kind-worker2 timings total=56m0.34898024s t=1.832s run=0s execute=0s

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/91535/pull-kubernetes-e2e-kind/1266050429549023233/build-log.txt

[32m• [SLOW TEST:3376.269 seconds]
[k8s.io] [sig-node] Pods Extended
test/e2e/framework/framework.go:592
  [k8s.io] Pod Container Status
  test/e2e/framework/framework.go:592
    should never report success for a pending container
    test/e2e/node/pods.go:219
May 28 18:08:32.338: INFO: Pod pod-submit-status-0-13 on node kind-worker2 timings total=54m38.072272958s t=1.751s run=0s execute=0s

_Originally posted by @aojea in https://github.com/kubernetes/kubernetes/pull/91491#issuecomment-635651958_

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 31 (31 by maintainers)

Most upvoted comments

This can probably be closed out, but should spawn an issue for containerd that we’ll need to drive

https://github.com/opencontainers/runc/pull/2338 might fix the issue, though the PR is originally intended for cgroup2+systemd.

https://prow.k8s.io/?job=pull-kubernetes-e2e-kind looks greener already:

Success rate over time: 3h: 75%, 12h: 53%, 48h: 36%

image

it was rolled back some days later kubernetes-sigs/kind#1635

but to v1.3.4-12-g1e902b2d, not the original version (v1.3.3-14-g449e9269)

looking at that delta now

https://storage.googleapis.com/k8s-gubernator/triage/index.html?ci=0&pr=1&job=kind#b05025d9cb729f714b09 is interesting… the containerd logs show the same error:

May 27 03:28:05 kind-worker2 containerd[125]: time=“2020-05-27T03:28:05.678487798Z” level=error msg=“Failed to handle exit event &TaskExit{ContainerID:95794ab7e562c322e36a86ad7105469be95514ffde22b303b24214c5743fb13d,ID:95794ab7e562c322e36a86ad7105469be95514ffde22b303b24214c5743fb13d,Pid:20149,ExitStatus:137,ExitedAt:2020-05-27 03:28:05.270358036 +0000 UTC,XXX_unrecognized:[],} for 95794ab7e562c322e36a86ad7105469be95514ffde22b303b24214c5743fb13d” error=“failed to handle container TaskExit event: failed to stop container: unknown error after kill: runc did not terminate successfully: lstat /sys/fs/cgroup/devices/kubepods/besteffort/podf16a275a-abf1-4533-8c2e-8cfc94f4780d/95794ab7e562c322e36a86ad7105469be95514ffde22b303b24214c5743fb13d: no such file or directory\n: unknown”

And appears to have started right around 5/26 - 5/27

Digging into https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-kind/1267615032988930051

The pod pod-submit-status-0-4 was never deleted:

01:00:34.847: timed out waiting for watch events for pod-submit-status-0-4

Events:

  • Jun 2 01:00:54.440: INFO: At 2020-06-02 00:55:33 +0000 UTC - event for pod-submit-status-0-4: {default-scheduler } Scheduled: Successfully assigned pods-7342/pod-submit-status-0-4 to kind-worker2
  • Jun 2 01:00:54.440: INFO: At 2020-06-02 00:55:35 +0000 UTC - event for pod-submit-status-0-4: {kubelet kind-worker2} Created: Created container busybox
  • Jun 2 01:00:54.440: INFO: At 2020-06-02 00:55:35 +0000 UTC - event for pod-submit-status-0-4: {kubelet kind-worker2} Pulled: Container image "docker.io/library/busybox:1.29" already present on machine
  • Jun 2 01:00:54.440: INFO: At 2020-06-02 00:55:36 +0000 UTC - event for pod-submit-status-0-4: {kubelet kind-worker2} Started: Started container busybox

Searching the kubelet log for “pod-submit-status-0-4” shows repeated errors:

  • Jun 02 00:56:13 kind-worker2 kubelet[677]: E0602 00:56:13.036013 677 remote_runtime.go:563] ReopenContainerLog "583c17764ed521bc6e30482e10faff6f0cda6d5351cfdb9b942b905df0eec592" from runtime service failed: rpc error: code = Unknown desc = failed to create and open log file: open /var/log/pods/pods-7342_pod-submit-status-0-4_553b0f48-8dbd-4ff6-8e4c-289a0a9b7f5c/busybox/0.log: no such file or directory
  • Jun 02 00:56:13 kind-worker2 kubelet[677]: E0602 00:56:13.036055 677 container_log_manager.go:210] Container "583c17764ed521bc6e30482e10faff6f0cda6d5351cfdb9b942b905df0eec592" log "/var/log/pods/pods-7342_pod-submit-status-0-4_553b0f48-8dbd-4ff6-8e4c-289a0a9b7f5c/busybox/0.log" doesn't exist, reopen container log failed: rpc error: code = Unknown desc = failed to create and open log file: open /var/log/pods/pods-7342_pod-submit-status-0-4_553b0f48-8dbd-4ff6-8e4c-289a0a9b7f5c/busybox/0.log: no such file or directory
  • Jun 02 00:56:15 kind-worker2 kubelet[677]: E0602 00:56:15.717463 677 cri_stats_provider.go:591] Unable to fetch container log stats for path /var/log/pods/pods-7342_pod-submit-status-0-4_553b0f48-8dbd-4ff6-8e4c-289a0a9b7f5c/busybox: failed command 'du' ($ nice -n 19 du -x -s -B 1) on path /var/log/pods/pods-7342_pod-submit-status-0-4_553b0f48-8dbd-4ff6-8e4c-289a0a9b7f5c/busybox with error exit status 1

These repeat until the end of the log. None of the other test pods from the test log these errors.