kubernetes: HostPathType Socket e2e tests hang on containerd

What happened: Once we switched over the default runtime to containerd, the slow ci and pull jobs started timing out. The tests, which used to take 13 minutes, jumped to 30 minutes. https://github.com/kubernetes/kubernetes/issues/92045

This increase in test time is only observed in the slow job, so I suspect the slow job is exercising some functionality that may not be exercised as frequently in the other jobs.

What you expected to happen: No performance regression

How to reproduce it (as minimally and precisely as possible): Run the tests in the slow job with containerd

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others:

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 21 (21 by maintainers)

Most upvoted comments

The test logs don’t seem to show up in the artifacts. Running it manually, I see it hung on the exec during BeforeEach():

STEP: Create a socket for further testing
Jun 12 14:58:53.432: INFO: ExecWithOptions {Command:[/bin/sh -c nc -lU /mnt/test/asocket &] Namespace:host-path-type-socket-9510 PodName:test-hostpath-type-7rh6v ContainerName:host-path-testing Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:false}

However when I do this manually using “kubectl exec” it is fine.

We have a similar issue with CRI-O. I agree with @Random-Liu that we should define this clearly and meanwhile setting a timeout makes sense.

https://github.com/kubernetes/kubernetes/issues/92057#issuecomment-644485839

This seems to be a known issue. If an exec:

  1. Spawns a long running background process;
  2. Leaves the IO of the background process open after exits;
  3. Containerd will wait until that IO to be closed before finishing the exec. https://github.com/containerd/containerd/blob/master/process.go#L224

It is arguable a bug in containerd, and the behavior of background process spawned by exec was never clearly defined. We’ve seen several production issues caused by this.

Maybe we should:

  1. Clearly define and document the behavior, and fix the test;
  2. Pass timeout for probe (not related to this issue, but to protect the system). ExecSync timeout is supported by many runtimes, but kubelet doesn’t pass a proper value right now. https://github.com/kubernetes/kubernetes/blob/v1.19.0-beta.2/pkg/kubelet/lifecycle/handlers.go#L63

i can easily recreate the problem locally using kubetest + local-up-cluster.sh running against containerd.

kubetest --build=none --ginkgo-parallel=1 --deployment=local --provider=local  --test --test_args='--ginkgo.focus=HostPathType\sSocket'