containerd: pod deletion failing on network namespace
Description
Deleted a pod. stays terminating: Sep 19 23:26:23 xx.yyyy containerd[245702]: time=“2019-09-19T23:26:23.112627249Z” level=error msg=“PodSandboxStatus for “004c4f886765769305f7e65a42c66ec95fda8fba9ddbf0d62fedae62e8873299” failed” error=“failed to get sandbox ip: check network namespace closed: remove netns: unlinkat /var/run/netns/cni-45ff10e9-dcc1-b779-f1a1-3515a5d56e61: device or resource busy”
Steps to reproduce the issue: unsure.
Output of containerd --version
:
containerd github.com/containerd/containerd v1.2.9 d50db0a42053864a270f648048f9a8b4f24eced3
Any other relevant information:
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 6
- Comments: 33 (23 by maintainers)
Commits related to this issue
- packages: Handle fs.may_detach_mounts sysctl for containerd On EL7.4, a new sysctl, `fs.may_detach_mounts`, was added which should be enabled on hosts where container runtimes are being used (it's of... — committed to scality/metalk8s by NicolasT 3 years ago
- fix: https://github.com/containerd/containerd/issues/3667; support disabel logrotate; set dnsPolicy for scheduler. Signed-off-by: huaiyou <huaiyou.cyz@alibaba-inc.com> — committed to AliyunContainerService/ackdistro by VinceCui a year ago
The solution for me was indicated here:
https://github.com/cri-o/cri-o/pull/4210
Good luck.
Thinking more about it, maybe this
/run/netns/cni-X
proc mount could be “private” so it never appears in the container mount namespace in the first place (but not sure it’s possible). But if we start to hide some mount point it’s hard to say where we stop.Wait for @fuweid / containerd maintainers opinions to close this issue
check your iproute rpm package. maybe related to https://patchwork.ozlabs.org/patch/796300/
@fuweid I sill have same issue with patched iproute… What information should I provide for debugging?
adding
mountPropagation: HostToContainer
is enough to fix the issue @kfox1111 you might want to usemountPropagation: Bidirectional
I’m not sure how much we can trust
lsof
with namespaces Will have a look at your link on Monday (it’s 11pm Sunday for me)I think I have the same issue, will try to update my containerd, but here my investigation on 1.2.6
Description Sometimes pod are stuck in Terminating, in the logs I have in loop
We cannot shutdown “rook-ceph-mon-a-59cbf85446-jwprv” because “/var/run/netns/cni-3276e31b-c2af-4840-abce-a9d3e8d061b4: device or resource busy”
looking at strace output
It fails when we try to unlink “/var/run/netns/cni-3276e31b-c2af-4840-abce-a9d3e8d061b4” it
This is a simple empty file
So the likely explanation for “device or resource busy” error is that there is something mounted on top
process 3091 is the one using this file it seems
so the pod that is blocking us is cc17615e-1c3b-4155-a2f2-658090f682ca
or rook-ceph/csi-rbdplugin-b5z8n
now if we go back a bit, the pod that fail to shutdown is rook-ceph/rook-ceph-mon-a-59cbf85446-jwprv if we search with the sandbox id
we see that the process is not running anymore.
kubectl get -n rook-ceph pod/csi-rbdplugin-b5z8n -o yaml > rook-ceph_csi-rbdplugin-b5z8n.txt kubectl get -n rook-ceph pod/rook-ceph-mon-a-59cbf85446-jwprv -o yaml > rook-ceph_rook-ceph-mon-a-59cbf85446-jwprv.txt
csi-rbdplugin-b5z8n has “hostPID: true”, but not rook-ceph-mon-a-59cbf85446-jwprv
If I look at another server with a stuck container
I’ll try to reproduce with more minimal containers, but that means destroying everything
Steps to reproduce the issue: no idea, using latest Centos (7.7 / 3.10.0-1062.1.1.el7.x86_64), cluster deployed using kubespray 2.11 configured to use containerd instead of docker, all the rest default
Describe the results you expected: It works 😉
Output of
containerd --version
: