containerd: failed to start or create containerd task

Description

Running Kubernetes conformance testing against a cluster with containerd runtime sometimes fails due to a pod not starting during one of the test cases. The general error is failed to start containerd task or failed to create containerd task. More detailed errors include the following:

ttrpc: closed: unknown
read: connection reset by peer: unknown
failed to start io pipe copy: unable to copy pipes: containerd-shim: opening w/o fifo ... failed: context deadline exceeded

Steps to reproduce the issue:

Option 1: Follow https://github.com/cncf/k8s-conformance/blob/master/instructions.md#running to run Kubernetes conformance testing via sonobuoy.

Option 2: Follow https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md#running-conformance-tests to run Kubernetes conformance testing via kubetest.

The more load on the cluster (i.e running conformance tests in parallel) makes the problem easier to reproduce. However, the problem is in general difficult to reproduce since the failure rate is low. For example, re-running the conformance tests after a failure is usually successful.

Describe the results you received:

See description.

Describe the results you expected:

Kubernetes conformance test passes because containerd retries the failed task.

Output of containerd --version:

We’ve seen this on various containerd 1.2.x and 1.3.x versions.

Any other relevant information:

We’ve noticed and have been monitoring these failures since October 2019. Although, they could have started long before that.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 22 (14 by maintainers)

Commits related to this issue

Int tests: Warn (instead of erroring) upon pod restarts, part two In #4595 we stopped failing integration tests whenever a pod restarted just once, which is being caused by containerd/containerd#4068... — committed to linkerd/linkerd2 by alpeb 4 years ago
Int tests: Warn (instead of erroring) upon pod restarts, part two (#4637) In #4595 we stopped failing integration tests whenever a pod restarted just once, which is being caused by containerd/contai... — committed to linkerd/linkerd2 by alpeb 4 years ago
test: non-fatal containerd task issue https://github.com/containerd/containerd/issues/4068 caused a container start to fail and get retried, which then broke tests because of our "no container restar... — committed to pohly/pmem-CSI by pohly 4 years ago

Most upvoted comments

Hello Also noticed this issue on EKS node:

  Kernel Version:             5.10.178-162.673.amzn2.x86_64
  OS Image:                   Amazon Linux 2
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.19
  Kubelet Version:            v1.24.11-eks-a59e1f0
  Kube-Proxy Version:         v1.24.11-eks-a59e1f0

Pod’s status:

Last State:     Terminated
      Reason:       StartError
      Message:      failed to create containerd task: failed to create shim task: context canceled: unknown
      Exit Code:    128

shreben on May 16, 2023

Thanks for the feedback! Closing.

estesp on Feb 24, 2022

tracking on the Kubernetes side in https://github.com/kubernetes/kubernetes/issues/89064

liggitt on Mar 11, 2020