kubernetes: Pod stuck in ContainerCreating: Unit ...slice already exists

What happened:

Errors like this one

May 27 06:38:19.960408 ip-10-0-220-230 hyperkube[1448]: E0527 06:38:19.960361 1448 pod_workers.go:190] “Error syncing pod, skipping” err=“failed to ensure that the pod: 5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16 cgroups exist and are correctly applied: failed to create container for [kubepods burstable pod5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16] : Unit kubepods-burstable-pod5ac83c3f_0b16_4cf2_a3cb_f67c19cd0e16.slice already exists.” pod=“openshift-machine-config-operator/machine-config-daemon-mm7gt” podUID=5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16

(when using cgroupDriver: systemd)

What you expected to happen:

No such errors

How to reproduce it (as minimally and precisely as possible):

I don’t know for sure.

Anything else we need to know?:

This was introduced in k8s in #102147 and backported to 1.21 in #102196, so needs to be fixed in both master and release-1.21.

RH BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1965545

The cause is a regression in runc/libcontainer: https://github.com/opencontainers/runc/issues/2996

The fix is in https://github.com/opencontainers/runc/pull/2997, which should make its way into runc 1.0.0 GA.

Currently there is DNM PR to bump runc to the version with the fix: https://github.com/kubernetes/kubernetes/pull/102508, but we have decided (https://github.com/kubernetes/kubernetes/pull/102250#issuecomment-855922711) to wait until the release.

Environment:

Kubernetes version (use kubectl version):
Cloud provider or hardware configuration:
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Network plugin and version (if this is a network-related bug):
Others:

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 5
Comments: 45 (38 by maintainers)

Most upvoted comments

Not sure, but problem still exists in both 1.21.2 and 1.21.3 versions. Stuck with *.slice problem: not creating container and complaining that slice exists Also found fatal error which causes failure of kubelet at all:

kubelet.go:1384] "Failed to start ContainerManager" err="failed to initialize top level QOS containers: failed to create top level Burstable QOS cgroup : error while starting unit \"kubepods-burstable.slice\" with properties [{Name:Description Value:\"libcontainer container kubepods-burstable.slice\"} {Name:Wants Value:[\"kubepods.slice\"]} {Name:MemoryAccounting Value:true} {Name:CPUAccounting Value:true} {Name:IOAccounting Value:true} {Name:TasksAccounting Value:true} {Name:DefaultDependencies Value:false}]: Unit kubepods-burstable.slice already exists."

kubelet.service: Main process exited, code=exited, status=1/FAILURE

runc is 1.0.1 under archlinux

arren-ru on Jul 27, 2021

@sfxworks thanks for the report, we believe that this currently affects 1.21.2 (but not prior versions of 1.21) and the 1.22 release candidate based on our RCA. It’s a release blocker, we’re working on it.

ehashman on Jul 15, 2021

/milestone v1.22

+1 to backport to 1.21 as well to fix the bug in 1.21.x

dims on Jun 24, 2021

Can confirm I just experienced this as well after upgrading to v1.21.2. It’s intermittent as well.

ihgann on Jul 16, 2021

Thanks for opening #104280.

if I’m understanding correctly, this issue (#102676) describes a problem that is still present in release-1.21 (runc rc95+), but we can’t pick up the runc version that fixes it (1.0.1) because of #104280?

liggitt on Aug 10, 2021

@odinuge thanks! We’re backporting both https://github.com/opencontainers/runc/pull/3082 and https://github.com/opencontainers/runc/pull/3067 to 1.0.1, and still hope to have a release soon.

kolyshkin on Jul 15, 2021

@odinuge Got https://github.com/kubernetes/test-infra/pull/22553 approved to help out with the failing serial tests on cri-o/systemd driver, https://github.com/kubernetes/kubernetes/pull/102169 is also pending and should help.

ehashman on Jul 15, 2021

The current runc diff to master: https://github.com/opencontainers/runc/compare/v1.0.0-rc95...master

This is not the right diff to look at. We have already upgraded to 1.0.0 and it caused an issue (https://github.com/kubernetes/kubernetes/pull/103483) causing a revert. The issue has been fixed and backported to release-1.0 branch; we’re going to have a release this week.

So, the diff to look at is the difference between 1.0.0 and to-be-released 1.0.1: https://github.com/opencontainers/runc/compare/v1.0.0...release-1.0

The list of PRs for 1.0.1 can be seen at https://github.com/opencontainers/runc/issues/3076; from the kubernetes perspective, this is limited to the fix that caused a revert of runc 1.0.0 bump.

The runc 1.0.0 (or 1.0.1) bump would also fix “dbus: connection closed by user” issue caused by dbus restart (reported to https://bugzilla.redhat.com/show_bug.cgi?id=1941456), but I guess it’s not super critical (the workaround is to restart kubernetes).

I am fine either way, but bumping runc to 1.0.1 and backporting may be beneficial and should not introduce any new regressions as we have already tested 1.0.0.

kolyshkin on Jul 14, 2021

I agree with @odinuge, the churn+bugs in runc is very concerning, if #102250 works, then we should just take that and hold off on the huge dependency bump.

cc @derekwaynecarr @SergeyKanzhelev as well.

dims on Jul 14, 2021

You should probably mention that this only affects people who use cgroupDriver: systemd.

odinuge on Jun 9, 2021