kubernetes: Pod stuck in ContainerCreating: Unit ...slice already exists
What happened:
Errors like this one
May 27 06:38:19.960408 ip-10-0-220-230 hyperkube[1448]: E0527 06:38:19.960361 1448 pod_workers.go:190] “Error syncing pod, skipping” err=“failed to ensure that the pod: 5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16 cgroups exist and are correctly applied: failed to create container for [kubepods burstable pod5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16] : Unit kubepods-burstable-pod5ac83c3f_0b16_4cf2_a3cb_f67c19cd0e16.slice already exists.” pod=“openshift-machine-config-operator/machine-config-daemon-mm7gt” podUID=5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16
(when using cgroupDriver: systemd)
What you expected to happen:
No such errors
How to reproduce it (as minimally and precisely as possible):
I don’t know for sure.
Anything else we need to know?:
This was introduced in k8s in #102147 and backported to 1.21 in #102196, so needs to be fixed in both master and release-1.21
.
RH BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1965545
The cause is a regression in runc/libcontainer: https://github.com/opencontainers/runc/issues/2996
The fix is in https://github.com/opencontainers/runc/pull/2997, which should make its way into runc 1.0.0 GA.
Currently there is DNM PR to bump runc to the version with the fix: https://github.com/kubernetes/kubernetes/pull/102508, but we have decided (https://github.com/kubernetes/kubernetes/pull/102250#issuecomment-855922711) to wait until the release.
Environment:
- Kubernetes version (use
kubectl version
): - Cloud provider or hardware configuration:
- OS (e.g:
cat /etc/os-release
): - Kernel (e.g.
uname -a
): - Install tools:
- Network plugin and version (if this is a network-related bug):
- Others:
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 5
- Comments: 45 (38 by maintainers)
Not sure, but problem still exists in both 1.21.2 and 1.21.3 versions. Stuck with *.slice problem: not creating container and complaining that slice exists Also found fatal error which causes failure of kubelet at all:
runc is 1.0.1 under archlinux
@sfxworks thanks for the report, we believe that this currently affects 1.21.2 (but not prior versions of 1.21) and the 1.22 release candidate based on our RCA. It’s a release blocker, we’re working on it.
/milestone v1.22
+1 to backport to 1.21 as well to fix the bug in 1.21.x
Can confirm I just experienced this as well after upgrading to
v1.21.2
. It’s intermittent as well.Thanks for opening #104280.
if I’m understanding correctly, this issue (#102676) describes a problem that is still present in release-1.21 (runc rc95+), but we can’t pick up the runc version that fixes it (1.0.1) because of #104280?
@odinuge thanks! We’re backporting both https://github.com/opencontainers/runc/pull/3082 and https://github.com/opencontainers/runc/pull/3067 to 1.0.1, and still hope to have a release soon.
@odinuge Got https://github.com/kubernetes/test-infra/pull/22553 approved to help out with the failing serial tests on cri-o/systemd driver, https://github.com/kubernetes/kubernetes/pull/102169 is also pending and should help.
This is not the right diff to look at. We have already upgraded to 1.0.0 and it caused an issue (https://github.com/kubernetes/kubernetes/pull/103483) causing a revert. The issue has been fixed and backported to release-1.0 branch; we’re going to have a release this week.
So, the diff to look at is the difference between 1.0.0 and to-be-released 1.0.1: https://github.com/opencontainers/runc/compare/v1.0.0...release-1.0
The list of PRs for 1.0.1 can be seen at https://github.com/opencontainers/runc/issues/3076; from the kubernetes perspective, this is limited to the fix that caused a revert of runc 1.0.0 bump.
The runc 1.0.0 (or 1.0.1) bump would also fix “dbus: connection closed by user” issue caused by dbus restart (reported to https://bugzilla.redhat.com/show_bug.cgi?id=1941456), but I guess it’s not super critical (the workaround is to restart kubernetes).
I am fine either way, but bumping runc to 1.0.1 and backporting may be beneficial and should not introduce any new regressions as we have already tested 1.0.0.
I agree with @odinuge, the churn+bugs in runc is very concerning, if #102250 works, then we should just take that and hold off on the huge dependency bump.
cc @derekwaynecarr @SergeyKanzhelev as well.
You should probably mention that this only affects people who use
cgroupDriver: systemd
.