kind: Pod deletion hangs, containerd issue: "/opt/cni/bin/loopback: argument list too long"

What happened: Running kubectl delete ns <some-namespace> randomly fails. Pods stuck in the “Termination” phase due to the containerd issue. From containerd service logs inside kind container:

Mar 29 08:02:43 test-cluster-control-plane containerd[478745]: time="2021-03-29T08:02:43.442708643Z" level=info msg="StopPodSandbox for \"0a36509dcfc500b5090ee28e6c53faf90308ad13bcbadf8dfa0b1f9eee9c12a4\""
Mar 29 08:02:43 test-cluster-control-plane containerd[478745]: time="2021-03-29T08:02:43.443615250Z" level=error msg="StopPodSandbox for \"0a36509dcfc500b5090ee28e6c53faf90308ad13bcbadf8dfa0b1f9eee9c12a4\" failed" error="failed to destroy network for sandbox \"0a36509dcfc500b5090ee28e6c53faf90308ad13bcbadf8dfa0b1f9eee9c12a4\": netplugin failed with no error message: fork/exec /opt/cni/bin/loopback: argument list too long"
Mar 29 08:02:43 test-cluster-control-plane containerd[478745]: time="2021-03-29T08:02:43.444195181Z" level=info msg="StopPodSandbox for \"5c91ce58b5ea70a17b45c1293e28c5f62f20bb0819e997c914393e514e459c9c\""
Mar 29 08:02:43 test-cluster-control-plane containerd[478745]: time="2021-03-29T08:02:43.445062363Z" level=error msg="StopPodSandbox for \"5c91ce58b5ea70a17b45c1293e28c5f62f20bb0819e997c914393e514e459c9c\" failed" error="failed to destroy network for sandbox \"5c91ce58b5ea70a17b45c1293e28c5f62f20bb0819e997c914393e514e459c9c\": netplugin failed with no error message: fork/exec /opt/cni/bin/loopback: argument list too long"

This leads to kubelet errors and prevents pods and namespace deletion:

4bd3fd6e722623fc9d7f39eab46" from runtime service failed: rpc error: code = Unknown desc = failed to destroy network for sandbox "38289292c07bef6ba9429dac9f51da382eda64bd3fd6e722623fc9d7f39eab46": netplugin failed with no error message: fork/exec /opt/cni/bin/loopback: argument list too long

Also, crictl ps shows that the needed containers were actually deleted, there were no processes running inside the kind container. Seems that the issue only happens on the network deletion process.

What you expected to happen: Expected kubectl delete namespace <some-namespace> to finish successfully along with all pods deletion.

How to reproduce it (as minimally and precisely as possible): Unfortunately, this happens randomly on all our CI nodes (5 Ubuntu VMs) during the evaluation of kubectl delete namespace <some-namespace>.

Environment:

  • kind version:
kind v0.10.0 go1.15.7 linux/amd64
  • Kubernetes version:
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.4", GitCommit:"e87da0bd6e03ec3fea7933c4b5263d151aafd07c", GitTreeState:"clean", BuildDate:"2021-02-20T02:22:41Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-21T01:11:42Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}

  • Docker version:
Client:
 Debug Mode: false

Server:
 Containers: 2
  Running: 2
  Paused: 0
  Stopped: 0
 Images: 347
 Server Version: 19.03.13
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 8fba4e9a7d01810a393d5d25a3621dc101981175
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.4.0-53-generic
 Operating System: Ubuntu 20.04.1 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 7.788GiB
 Name: stenant02
 ID: 3LWL:KDWV:S7PP:X72I:66DE:T7QS:7J5W:LZPK:XQRA:NXO3:TJEU:TQW3
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Username: resoptimahlo
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
  • OS (e.g. from /etc/os-release):
NAME="Ubuntu"
VERSION="20.04.1 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.1 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 35 (20 by maintainers)

Most upvoted comments

Hi, I found the reason. It’s the bug of containerd/nri in https://github.com/containerd/nri/blame/0afc7f031eaf9c7d9c1a381b7ab5462e89c998fc/client.go#L32.

// DefaultBinaryPath = /opt/nri/bin
if err := os.Setenv("PATH", fmt.Sprintf("%s:%s", os.Getenv("PATH"), DefaultBinaryPath)); err != nil {

it was fixed by https://github.com/containerd/nri/pull/1. In containerd, the bug was:

  1. introduced by https://github.com/containerd/containerd/pull/4497 (25 Aug 2020)
  2. fixed by https://github.com/containerd/containerd/pull/4605 (8 Oct 2020)

— debug –

strace -ff -p $(pidof /usr/local/bin/containerd) -e trace=execve -v -s200000 &> E2BIG.log

image

@aojea Now I’m wondering if I installed istio with CNI support and perhaps that is leading to this issue. I’m going create a fresh cluster and make sure I install without any CNI. Will let you know…

@aojea sorry for the delay, regarding additional info:

  • We are not using any additional CNI plugins, the default one is used;
  • Sample deployment manifest attached to this comment;
  • Tarball logs will be attached later today, need to sanitize some info.

deployment.zip

None at all. And libcni, which containerd uses, doesn’t set any arguments – https://github.com/containernetworking/cni/blob/master/pkg/invoke/raw_exec.go#L37

This is properly weird 😃