kind: Pod deletion hangs, containerd issue: "/opt/cni/bin/loopback: argument list too long"
What happened:
Running kubectl delete ns <some-namespace>
randomly fails.
Pods stuck in the “Termination” phase due to the containerd
issue. From containerd service logs inside kind container:
Mar 29 08:02:43 test-cluster-control-plane containerd[478745]: time="2021-03-29T08:02:43.442708643Z" level=info msg="StopPodSandbox for \"0a36509dcfc500b5090ee28e6c53faf90308ad13bcbadf8dfa0b1f9eee9c12a4\""
Mar 29 08:02:43 test-cluster-control-plane containerd[478745]: time="2021-03-29T08:02:43.443615250Z" level=error msg="StopPodSandbox for \"0a36509dcfc500b5090ee28e6c53faf90308ad13bcbadf8dfa0b1f9eee9c12a4\" failed" error="failed to destroy network for sandbox \"0a36509dcfc500b5090ee28e6c53faf90308ad13bcbadf8dfa0b1f9eee9c12a4\": netplugin failed with no error message: fork/exec /opt/cni/bin/loopback: argument list too long"
Mar 29 08:02:43 test-cluster-control-plane containerd[478745]: time="2021-03-29T08:02:43.444195181Z" level=info msg="StopPodSandbox for \"5c91ce58b5ea70a17b45c1293e28c5f62f20bb0819e997c914393e514e459c9c\""
Mar 29 08:02:43 test-cluster-control-plane containerd[478745]: time="2021-03-29T08:02:43.445062363Z" level=error msg="StopPodSandbox for \"5c91ce58b5ea70a17b45c1293e28c5f62f20bb0819e997c914393e514e459c9c\" failed" error="failed to destroy network for sandbox \"5c91ce58b5ea70a17b45c1293e28c5f62f20bb0819e997c914393e514e459c9c\": netplugin failed with no error message: fork/exec /opt/cni/bin/loopback: argument list too long"
This leads to kubelet errors and prevents pods and namespace deletion:
4bd3fd6e722623fc9d7f39eab46" from runtime service failed: rpc error: code = Unknown desc = failed to destroy network for sandbox "38289292c07bef6ba9429dac9f51da382eda64bd3fd6e722623fc9d7f39eab46": netplugin failed with no error message: fork/exec /opt/cni/bin/loopback: argument list too long
Also, crictl ps
shows that the needed containers were actually deleted, there were no processes running inside the kind container. Seems that the issue only happens on the network deletion process.
What you expected to happen:
Expected kubectl delete namespace <some-namespace>
to finish successfully along with all pods deletion.
How to reproduce it (as minimally and precisely as possible):
Unfortunately, this happens randomly on all our CI nodes (5 Ubuntu VMs) during the evaluation of kubectl delete namespace <some-namespace>
.
Environment:
- kind version:
kind v0.10.0 go1.15.7 linux/amd64
- Kubernetes version:
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.4", GitCommit:"e87da0bd6e03ec3fea7933c4b5263d151aafd07c", GitTreeState:"clean", BuildDate:"2021-02-20T02:22:41Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-21T01:11:42Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
- Docker version:
Client:
Debug Mode: false
Server:
Containers: 2
Running: 2
Paused: 0
Stopped: 0
Images: 347
Server Version: 19.03.13
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 8fba4e9a7d01810a393d5d25a3621dc101981175
runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
init version: fec3683
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 5.4.0-53-generic
Operating System: Ubuntu 20.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 7.788GiB
Name: stenant02
ID: 3LWL:KDWV:S7PP:X72I:66DE:T7QS:7J5W:LZPK:XQRA:NXO3:TJEU:TQW3
Docker Root Dir: /var/lib/docker
Debug Mode: false
Username: resoptimahlo
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
- OS (e.g. from
/etc/os-release
):
NAME="Ubuntu"
VERSION="20.04.1 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.1 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 35 (20 by maintainers)
Hi, I found the reason. It’s the bug of
containerd/nri
in https://github.com/containerd/nri/blame/0afc7f031eaf9c7d9c1a381b7ab5462e89c998fc/client.go#L32.it was fixed by https://github.com/containerd/nri/pull/1. In containerd, the bug was:
— debug –
@aojea Now I’m wondering if I installed istio with CNI support and perhaps that is leading to this issue. I’m going create a fresh cluster and make sure I install without any CNI. Will let you know…
@aojea sorry for the delay, regarding additional info:
deployment.zip
None at all. And libcni, which containerd uses, doesn’t set any arguments – https://github.com/containernetworking/cni/blob/master/pkg/invoke/raw_exec.go#L37
This is properly weird 😃