kubernetes: Pod stuck in ContainerCreating: Unit ...slice already exists

What happened:

Errors like this one

May 27 06:38:19.960408 ip-10-0-220-230 hyperkube[1448]: E0527 06:38:19.960361 1448 pod_workers.go:190] “Error syncing pod, skipping” err=“failed to ensure that the pod: 5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16 cgroups exist and are correctly applied: failed to create container for [kubepods burstable pod5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16] : Unit kubepods-burstable-pod5ac83c3f_0b16_4cf2_a3cb_f67c19cd0e16.slice already exists.” pod=“openshift-machine-config-operator/machine-config-daemon-mm7gt” podUID=5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16

(when using cgroupDriver: systemd)

What you expected to happen:

No such errors

How to reproduce it (as minimally and precisely as possible):

I don’t know for sure.

Anything else we need to know?:

This was introduced in k8s in #102147 and backported to 1.21 in #102196, so needs to be fixed in both master and release-1.21.

RH BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1965545

The cause is a regression in runc/libcontainer: https://github.com/opencontainers/runc/issues/2996

The fix is in https://github.com/opencontainers/runc/pull/2997, which should make its way into runc 1.0.0 GA.

Currently there is DNM PR to bump runc to the version with the fix: https://github.com/kubernetes/kubernetes/pull/102508, but we have decided (https://github.com/kubernetes/kubernetes/pull/102250#issuecomment-855922711) to wait until the release.

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others:

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 5
  • Comments: 45 (38 by maintainers)

Most upvoted comments

Not sure, but problem still exists in both 1.21.2 and 1.21.3 versions. Stuck with *.slice problem: not creating container and complaining that slice exists Also found fatal error which causes failure of kubelet at all:

kubelet.go:1384] "Failed to start ContainerManager" err="failed to initialize top level QOS containers: failed to create top level Burstable QOS cgroup : error while starting unit \"kubepods-burstable.slice\" with properties [{Name:Description Value:\"libcontainer container kubepods-burstable.slice\"} {Name:Wants Value:[\"kubepods.slice\"]} {Name:MemoryAccounting Value:true} {Name:CPUAccounting Value:true} {Name:IOAccounting Value:true} {Name:TasksAccounting Value:true} {Name:DefaultDependencies Value:false}]: Unit kubepods-burstable.slice already exists."

kubelet.service: Main process exited, code=exited, status=1/FAILURE

runc is 1.0.1 under archlinux

@sfxworks thanks for the report, we believe that this currently affects 1.21.2 (but not prior versions of 1.21) and the 1.22 release candidate based on our RCA. It’s a release blocker, we’re working on it.

/milestone v1.22

+1 to backport to 1.21 as well to fix the bug in 1.21.x

Can confirm I just experienced this as well after upgrading to v1.21.2. It’s intermittent as well.

Thanks for opening #104280.

if I’m understanding correctly, this issue (#102676) describes a problem that is still present in release-1.21 (runc rc95+), but we can’t pick up the runc version that fixes it (1.0.1) because of #104280?

@odinuge thanks! We’re backporting both https://github.com/opencontainers/runc/pull/3082 and https://github.com/opencontainers/runc/pull/3067 to 1.0.1, and still hope to have a release soon.

@odinuge Got https://github.com/kubernetes/test-infra/pull/22553 approved to help out with the failing serial tests on cri-o/systemd driver, https://github.com/kubernetes/kubernetes/pull/102169 is also pending and should help.

The current runc diff to master: https://github.com/opencontainers/runc/compare/v1.0.0-rc95...master

This is not the right diff to look at. We have already upgraded to 1.0.0 and it caused an issue (https://github.com/kubernetes/kubernetes/pull/103483) causing a revert. The issue has been fixed and backported to release-1.0 branch; we’re going to have a release this week.

So, the diff to look at is the difference between 1.0.0 and to-be-released 1.0.1: https://github.com/opencontainers/runc/compare/v1.0.0...release-1.0

The list of PRs for 1.0.1 can be seen at https://github.com/opencontainers/runc/issues/3076; from the kubernetes perspective, this is limited to the fix that caused a revert of runc 1.0.0 bump.

The runc 1.0.0 (or 1.0.1) bump would also fix “dbus: connection closed by user” issue caused by dbus restart (reported to https://bugzilla.redhat.com/show_bug.cgi?id=1941456), but I guess it’s not super critical (the workaround is to restart kubernetes).

I am fine either way, but bumping runc to 1.0.1 and backporting may be beneficial and should not introduce any new regressions as we have already tested 1.0.0.

I agree with @odinuge, the churn+bugs in runc is very concerning, if #102250 works, then we should just take that and hold off on the huge dependency bump.

cc @derekwaynecarr @SergeyKanzhelev as well.

You should probably mention that this only affects people who use cgroupDriver: systemd.