kubernetes: Running pods with devices are terminated if kubelet is restarted
What happened?
In KubeVirt project, we now see a regression when running on Kubernetes 1.25.10 | 1.26.5 | 1.27.2
. If kubelet is restarted on a node, then all the existing and running workloads that use devices are terminated with UnexpectedAdmissionError
:
Warning UnexpectedAdmissionError 45s kubelet Allocate failed due to no healthy devices present; cannot allocate unhealthy devices devices.kubevirt.io/kvm, which is unexpected
Normal Killing 42s kubelet Stopping container compute
KubeVirt runs virtual machines inside pods and uses a device plugin to advertise e.g. /dev/kvm
on the nodes.
Presumably, this PR changed the behavior: https://github.com/kubernetes/kubernetes/pull/116376 Original issue: https://github.com/kubernetes/kubernetes/issues/109595
What did you expect to happen?
A potential restart of kubelet should not interrupt the running workloads.
How can we reproduce it (as minimally and precisely as possible)?
with KubeVirt:
- run a KubeVirt VM
pkill kubelet
- observe that the workload pod gets terminated
or with https://github.com/k8stopologyawareschedwg/sample-device-plugin
- make deploy
- make test-both
- pkill kubelet
- the pod gets restarted
Anything else we need to know?
No response
Kubernetes version
This affects the 1.25.x, 1.26.x and 1.27.x branches.
1.25.10 | 1.26.5 | 1.27.2
Cloud provider
N/A
OS version
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here
# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, …) and versions (if applicable)
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 1
- Comments: 28 (21 by maintainers)
Commits related to this issue
- kubelet: devices: skip allocation for running pods When kubelet initializes, runs admission for pods and possibly allocated requested resources. We need to distinguish between node reboot (no contain... — committed to ffromani/kubernetes by ffromani a year ago
- kubelet: devices: skip allocation for running pods When kubelet initializes, runs admission for pods and possibly allocated requested resources. We need to distinguish between node reboot (no contain... — committed to ffromani/kubernetes by ffromani a year ago
- node: devicemgr: topomgr: add logs One of the contributing factors of issues #118559 and #109595 hard to debug and fix is that the devicemanager has very few logs in important flow, so it's unnecessa... — committed to ffromani/kubernetes by ffromani a year ago
- kubelet: devices: skip allocation for running pods When kubelet initializes, runs admission for pods and possibly allocated requested resources. We need to distinguish between node reboot (no contain... — committed to ffromani/kubernetes by ffromani a year ago
- node: devicemgr: topomgr: add logs One of the contributing factors of issues #118559 and #109595 hard to debug and fix is that the devicemanager has very few logs in important flow, so it's unnecessa... — committed to ffromani/kubernetes by ffromani a year ago
- kubelet: devices: skip allocation for running pods When kubelet initializes, runs admission for pods and possibly allocated requested resources. We need to distinguish between node reboot (no contain... — committed to ffromani/kubernetes by ffromani a year ago
- node: devicemgr: topomgr: add logs One of the contributing factors of issues #118559 and #109595 hard to debug and fix is that the devicemanager has very few logs in important flow, so it's unnecessa... — committed to ffromani/kubernetes by ffromani a year ago
- kubelet: devices: skip allocation for running pods When kubelet initializes, runs admission for pods and possibly allocated requested resources. We need to distinguish between node reboot (no contain... — committed to ffromani/kubernetes by ffromani a year ago
- node: devicemgr: topomgr: add logs One of the contributing factors of issues #118559 and #109595 hard to debug and fix is that the devicemanager has very few logs in important flow, so it's unnecessa... — committed to ffromani/kubernetes by ffromani a year ago
- kubelet: devices: skip allocation for running pods When kubelet initializes, runs admission for pods and possibly allocated requested resources. We need to distinguish between node reboot (no contain... — committed to ffromani/kubernetes by ffromani a year ago
- node: devicemgr: topomgr: add logs One of the contributing factors of issues #118559 and #109595 hard to debug and fix is that the devicemanager has very few logs in important flow, so it's unnecessa... — committed to ffromani/kubernetes by ffromani a year ago
@vasiliy-ul We are brainstorming possible options and are working towards a fix for this issue. Reverting the PR would mean we would have to reopen https://github.com/kubernetes/kubernetes/issues/109595 which is a valid issue and is impacting users as well.
In addition to that, we are just past the patch release cherry-pick deadline so reverts are not going to make their way into the impacted versions until next set of patch releases for at least another month.
Given that we HAVE to wait, it is probably best to aim for a fix. Also, given the critical nature of this bug and the timing, I think the first option recommended by @smarterclayton:
Making device manager intelligent and allowing admission of pods that were previously admitted (or in running phase)
is the way to go as it solves the problem in a less invasive way as compared to option 2 which (is overall nicer in the long run but) might need additional rounds of design discussion and time we can’t afford right now.It sounds like this is because the device admission plugin is still not able to authoritatively state whether a device is available at the time the restarted pod is run?
We need to make some changes to admission generally to solve this case completely, but until we do, is it possible to have the admission plugin safely accept a pod that is a) never before seen by the device plugin and b) in the running phase? Or are there other reasons why that has been tried and found not to work?
I agree this is a desirable and expected behavior, if nothing else out of habit. This is what kubelet implementation did.
However, the deeper I look, the less I’m sure it is a guaranteed behavior.
There are well known circumnstances on which kubelet may reserve the option to kill running pods when kubelet restarts, e.g. if the machine config changes. Granted, this is NOT the case which is reported here (nothing changed across restart, hence we want a followup fix of some kind), but I’m convinced that clarifying the guarantees on kubelet restart should be part of the ongoing conversation.
From the user point of view, I guess, kubelet should keep the running pods. Kubelet can be restarted for various reasons, but IMHO it should not affect critical workloads. Hm… I thought that it was actually the supposed behavior to always try to keep the pods.
I’m looking into this issue and I’ll be updating shortly. At this point in time I can say that yes, https://github.com/kubernetes/kubernetes/pull/116376 made the devicemanager stricter and leads to this behavior. We may need another way to fix the inconsistency reported in https://github.com/kubernetes/kubernetes/issues/109595 , and likely we will need to tight up/fix the e2e tests about device plugins.
But there could be a (partially?) mismatched expectation as well, because AFAIK the kubelet will run admission on restart (and in general on initialization) and thus may kill running pods.
In other words, in addition to the followup fix, it would be beneficial to clarify that in general running containers are or are not guaranteed to survive a kubelet restart.