containerd: Containerd service not restarting (and many pods stuck in Terminating status)

Discussed in https://github.com/containerd/containerd/discussions/6949

<div type='discussions-op-text'>

Originally posted by mimmus May 17, 2022 Hi, we are tied to CentOS 7.9 and containerd.io-1.4.7-3.1.el7.x86_64, as these are currently the supported releases from our k8s distro. We are experiencing many issues with containerd service not restarting (and, as a consequence, node goes in NotReady status). In addition, even if we don’t know if it’s related, pods get stuck in Terminating state during deployments restart.

This is the containerd journal log durng restart of service:

May 17 13:35:38 ip-10-161-14-214.eu-central-1.compute.internal systemd[1]: Stopping containerd container runtime...
May 17 13:35:38 ip-10-161-14-214.eu-central-1.compute.internal containerd[2040]: time="2022-05-17T13:35:38.065983468Z" level=info msg="Stop CRI service"
May 17 13:35:38 ip-10-161-14-214.eu-central-1.compute.internal containerd[2040]: time="2022-05-17T13:35:38.066526675Z" level=info msg="Stop CRI service"
May 17 13:35:38 ip-10-161-14-214.eu-central-1.compute.internal containerd[2040]: time="2022-05-17T13:35:38.066548441Z" level=info msg="Event monitor stopped"
May 17 13:35:38 ip-10-161-14-214.eu-central-1.compute.internal containerd[2040]: time="2022-05-17T13:35:38.066565000Z" level=info msg="Stream server stopped"
May 17 13:35:38 ip-10-161-14-214.eu-central-1.compute.internal systemd[1]: Stopped containerd container runtime.
May 17 13:35:38 ip-10-161-14-214.eu-central-1.compute.internal systemd[1]: Starting containerd container runtime...
May 17 13:35:38 ip-10-161-14-214.eu-central-1.compute.internal containerd[12708]: time="2022-05-17T13:35:38.363681481Z" level=info msg="starting containerd" revision=3194fb46e8311ae0eeae5a7a5843573adfebb16d version=1.4.7
May 17 13:35:38 ip-10-161-14-214.eu-central-1.compute.internal containerd[12708]: time="2022-05-17T13:35:38.387897453Z" level=info msg="loading plugin \"io.containerd.content.v1.content\"..." type=io.containerd.content.v1
May 17 13:35:38 ip-10-161-14-214.eu-central-1.compute.internal containerd[12708]: time="2022-05-17T13:35:38.387961876Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.aufs\"..." type=io.containerd.snapshotter.v1
May 17 13:35:38 ip-10-161-14-214.eu-central-1.compute.internal containerd[12708]: time="2022-05-17T13:35:38.391702121Z" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.aufs\"..." error="aufs is not supported (modprobe aufs failed: exit status 1 \"modprobe: FATAL: Module aufs not found.\\n\"): skip plugin" type=io.containerd.snapshotter.v1
May 17 13:35:38 ip-10-161-14-214.eu-central-1.compute.internal containerd[12708]: time="2022-05-17T13:35:38.391731548Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.devmapper\"..." type=io.containerd.snapshotter.v1
May 17 13:35:38 ip-10-161-14-214.eu-central-1.compute.internal containerd[12708]: time="2022-05-17T13:35:38.391776662Z" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.devmapper" error="devmapper not configured"
May 17 13:35:38 ip-10-161-14-214.eu-central-1.compute.internal containerd[12708]: time="2022-05-17T13:35:38.391787006Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.native\"..." type=io.containerd.snapshotter.v1
May 17 13:35:38 ip-10-161-14-214.eu-central-1.compute.internal containerd[12708]: time="2022-05-17T13:35:38.391811077Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.overlayfs\"..." type=io.containerd.snapshotter.v1
May 17 13:35:38 ip-10-161-14-214.eu-central-1.compute.internal containerd[12708]: time="2022-05-17T13:35:38.391977493Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.zfs\"..." type=io.containerd.snapshotter.v1
May 17 13:35:38 ip-10-161-14-214.eu-central-1.compute.internal containerd[12708]: time="2022-05-17T13:35:38.393773966Z" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.zfs\"..." error="path /var/lib/containerd/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1
May 17 13:35:38 ip-10-161-14-214.eu-central-1.compute.internal containerd[12708]: time="2022-05-17T13:35:38.393800413Z" level=info msg="loading plugin \"io.containerd.metadata.v1.bolt\"..." type=io.containerd.metadata.v1
May 17 13:35:38 ip-10-161-14-214.eu-central-1.compute.internal containerd[12708]: time="2022-05-17T13:35:38.393831336Z" level=warning msg="could not use snapshotter devmapper in metadata plugin" error="devmapper not configured"
May 17 13:35:38 ip-10-161-14-214.eu-central-1.compute.internal containerd[12708]: time="2022-05-17T13:35:38.393840160Z" level=info msg="metadata content store policy set" policy=shared
May 17 13:35:38 ip-10-161-14-214.eu-central-1.compute.internal containerd[12708]: time="2022-05-17T13:35:38.393998757Z" level=info msg="loading plugin \"io.containerd.differ.v1.walking\"..." type=io.containerd.differ.v1
May 17 13:35:38 ip-10-161-14-214.eu-central-1.compute.internal containerd[12708]: time="2022-05-17T13:35:38.394014737Z" level=info msg="loading plugin \"io.containerd.gc.v1.scheduler\"..." type=io.containerd.gc.v1
May 17 13:35:38 ip-10-161-14-214.eu-central-1.compute.internal containerd[12708]: time="2022-05-17T13:35:38.394080847Z" level=info msg="loading plugin \"io.containerd.service.v1.introspection-service\"..." type=io.containerd.service.v1
May 17 13:35:38 ip-10-161-14-214.eu-central-1.compute.internal containerd[12708]: time="2022-05-17T13:35:38.394121100Z" level=info msg="loading plugin \"io.containerd.service.v1.containers-service\"..." type=io.containerd.service.v1
May 17 13:35:38 ip-10-161-14-214.eu-central-1.compute.internal containerd[12708]: time="2022-05-17T13:35:38.394146536Z" level=info msg="loading plugin \"io.containerd.service.v1.content-service\"..." type=io.containerd.service.v1
May 17 13:35:38 ip-10-161-14-214.eu-central-1.compute.internal containerd[12708]: time="2022-05-17T13:35:38.394399030Z" level=info msg="loading plugin \"io.containerd.service.v1.diff-service\"..." type=io.containerd.service.v1
May 17 13:35:38 ip-10-161-14-214.eu-central-1.compute.internal containerd[12708]: time="2022-05-17T13:35:38.394448934Z" level=info msg="loading plugin \"io.containerd.service.v1.images-service\"..." type=io.containerd.service.v1
May 17 13:35:38 ip-10-161-14-214.eu-central-1.compute.internal containerd[12708]: time="2022-05-17T13:35:38.394480597Z" level=info msg="loading plugin \"io.containerd.service.v1.leases-service\"..." type=io.containerd.service.v1
May 17 13:35:38 ip-10-161-14-214.eu-central-1.compute.internal containerd[12708]: time="2022-05-17T13:35:38.394508698Z" level=info msg="loading plugin \"io.containerd.service.v1.namespaces-service\"..." type=io.containerd.service.v1
May 17 13:35:38 ip-10-161-14-214.eu-central-1.compute.internal containerd[12708]: time="2022-05-17T13:35:38.394528256Z" level=info msg="loading plugin \"io.containerd.service.v1.snapshots-service\"..." type=io.containerd.service.v1
May 17 13:35:38 ip-10-161-14-214.eu-central-1.compute.internal containerd[12708]: time="2022-05-17T13:35:38.394560189Z" level=info msg="loading plugin \"io.containerd.runtime.v1.linux\"..." type=io.containerd.runtime.v1
May 17 13:35:38 ip-10-161-14-214.eu-central-1.compute.internal containerd[12708]: time="2022-05-17T13:35:38.394919264Z" level=info msg="loading plugin \"io.containerd.runtime.v2.task\"..." type=io.containerd.runtime.v2
May 17 13:35:38 ip-10-161-14-214.eu-central-1.compute.internal containerd[12708]: time="2022-05-17T13:35:38.412085303Z" level=error msg="loading container 4a1c59aeefdf6a5ce65b98a43482966aae791a68b6e1b99d85de8a37036006d3" error="container \"4a1c59aeefdf6a5ce65b98a43482966aae791a68b6e1b99d85de8a37036006d3\" in namespace \"k8s.io\": not found"
May 17 13:37:08 ip-10-161-14-214.eu-central-1.compute.internal systemd[1]: containerd.service start operation timed out. Terminating.
May 17 13:37:18 ip-10-161-14-214.eu-central-1.compute.internal containerd[12708]: time="2022-05-17T13:37:18.436911668Z" level=warning msg="cleaning up after shim disconnected" id=cab8b98388ef1de379daf572bb64f758a0f5a51803e2e37f5a34fad7afb023fa namespace=k8s.io
May 17 13:37:18 ip-10-161-14-214.eu-central-1.compute.internal containerd[12708]: time="2022-05-17T13:37:18.437478392Z" level=info msg="cleaning up dead shim"
May 17 13:37:18 ip-10-161-14-214.eu-central-1.compute.internal containerd[12708]: time="2022-05-17T13:37:18.446814884Z" level=warning msg="failed to clean up after shim disconnected" error="io.containerd.runc.v1: remove /run/containerd/s/9a8cb101c42564a04bef6227e55fa19539a54775fd441787d5e548a1e2156477: no such file or directory\n: exit status 1" id=cab8b98388ef1de379daf572bb64f758a0f5a51803e2e37f5a34fad7afb023fa namespace=k8s.io
```</div>

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 21 (12 by maintainers)

Most upvoted comments

1. figurę out the initial problem, fix that...

It seems that our nodes are missing this kernel parameter: fs.may_detach_mounts Setting it to 1: echo 1 > /proc/sys/fs/may_detach_mounts released all containers in Terminating status and I’m confident that this will solve also the containerd restart issue.

Thanks @mimmus for the solution, looks like this setting was handled by Docker engine automatically, but on containerd it’s missing.

I had some pods stuck at Terminating state and I found these logs in containerd.

time="2022-06-01T10:40:17.354538461-07:00" level=info msg="TearDown network for sandbox \"7c2aec0bfe0c056e7c2c5dda440d31124da53eaba3fe31741b931530c84e216c\" successfully"
time="2022-06-01T10:40:17.354552175-07:00" level=info msg="StopPodSandbox for \"7c2aec0bfe0c056e7c2c5dda440d31124da53eaba3fe31741b931530c84e216c\" returns successfully"
time="2022-06-01T10:40:17.354715998-07:00" level=info msg="RemovePodSandbox for \"7c2aec0bfe0c056e7c2c5dda440d31124da53eaba3fe31741b931530c84e216c\""
time="2022-06-01T10:40:22.394186510-07:00" level=error msg="RemovePodSandbox for \"7c2aec0bfe0c056e7c2c5dda440d31124da53eaba3fe31741b931530c84e216c\" failed" error="failed to remove volatile sandbox root directory \"/run/containerd/io.containerd.grpc.v1.cri/sandboxes/7c2aec0bfe0c056e7c2c5dda440d31124da53eaba3fe31741b931530c84e216c\": unlinkat /run/containerd/io.containerd.grpc.v1.cri/sandboxes/7c2aec0bfe0c056e7c2c5dda440d31124da53eaba3fe31741b931530c84e216c/shm: device or resource busy"
time="2022-06-01T10:41:27.442916350-07:00" level=info msg="StopPodSandbox for \"70b9b95e25205b25a152639808fd31379d8dbc3cc1b563967a77fe4f694776c9\""
time="2022-06-01T10:41:27.443355472-07:00" level=error msg="StopPodSandbox for \"70b9b95e25205b25a152639808fd31379d8dbc3cc1b563967a77fe4f694776c9\" failed" error="failed to check network namespace closed: remove netns: unlinkat /var/run/netns/cni-8c107a57-807b-6478-21fd-296a68492e97: device or resource busy"
time="2022-06-01T10:42:27.449716945-07:00" level=info msg="StopPodSandbox for \"70b9b95e25205b25a152639808fd31379d8dbc3cc1b563967a77fe4f694776c9\""
time="2022-06-01T10:42:27.450890376-07:00" level=error msg="StopPodSandbox for \"70b9b95e25205b25a152639808fd31379d8dbc3cc1b563967a77fe4f694776c9\" failed" error="failed to check network namespace closed: remove netns: unlinkat /var/run/netns/cni-8c107a57-807b-6478-21fd-296a68492e97: device or resource busy"

time="2022-06-01T10:41:27.442916350-07:00" level=info msg="StopPodSandbox for \"70b9b95e25205b25a152639808fd31379d8dbc3cc1b563967a77fe4f694776c9\""
time="2022-06-01T10:41:27.443355472-07:00" level=error msg="StopPodSandbox for \"70b9b95e25205b25a152639808fd31379d8dbc3cc1b563967a77fe4f694776c9\" failed" error="failed to check network namespace closed: remove netns: unlinkat /var/run/netns/cni-8c107a57-807b-6478-21fd-296a68492e97: device or resource busy"
time="2022-06-01T10:42:27.449716945-07:00" level=info msg="StopPodSandbox for \"70b9b95e25205b25a152639808fd31379d8dbc3cc1b563967a77fe4f694776c9\""
time="2022-06-01T10:42:27.450890376-07:00" level=error msg="StopPodSandbox for \"70b9b95e25205b25a152639808fd31379d8dbc3cc1b563967a77fe4f694776c9\" failed" error="failed to check network namespace closed: remove netns: unlinkat /var/run/netns/cni-8c107a57-807b-6478-21fd-296a68492e97: device or resource busy"

I couldn’t map the containerd taskID to kubernetes pod, but after running echo 1 > /proc/sys/fs/may_detach_mounts, terminating pods were gone.