kubernetes: Running pods with devices are terminated if kubelet is restarted

What happened?

In KubeVirt project, we now see a regression when running on Kubernetes 1.25.10 | 1.26.5 | 1.27.2. If kubelet is restarted on a node, then all the existing and running workloads that use devices are terminated with UnexpectedAdmissionError:

Warning  UnexpectedAdmissionError  45s   kubelet            Allocate failed due to no healthy devices present; cannot allocate unhealthy devices devices.kubevirt.io/kvm, which is unexpected
Normal   Killing                   42s   kubelet            Stopping container compute

KubeVirt runs virtual machines inside pods and uses a device plugin to advertise e.g. /dev/kvm on the nodes.

Presumably, this PR changed the behavior: https://github.com/kubernetes/kubernetes/pull/116376 Original issue: https://github.com/kubernetes/kubernetes/issues/109595

What did you expect to happen?

A potential restart of kubelet should not interrupt the running workloads.

How can we reproduce it (as minimally and precisely as possible)?

with KubeVirt:

  • run a KubeVirt VM
  • pkill kubelet
  • observe that the workload pod gets terminated

or with https://github.com/k8stopologyawareschedwg/sample-device-plugin

  • make deploy
  • make test-both
  • pkill kubelet
  • the pod gets restarted

Anything else we need to know?

No response

Kubernetes version

This affects the 1.25.x, 1.26.x and 1.27.x branches.

1.25.10 | 1.26.5 | 1.27.2

Cloud provider

N/A

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, …) and versions (if applicable)

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 1
  • Comments: 28 (21 by maintainers)

Commits related to this issue

Most upvoted comments

As I read the comments, a proper solution here may be non-trivial to implement and require more time for additional discussions. Would it make sense then to revert the breaking change meanwhile and work on a fix independently?

@vasiliy-ul We are brainstorming possible options and are working towards a fix for this issue. Reverting the PR would mean we would have to reopen https://github.com/kubernetes/kubernetes/issues/109595 which is a valid issue and is impacting users as well.

In addition to that, we are just past the patch release cherry-pick deadline so reverts are not going to make their way into the impacted versions until next set of patch releases for at least another month.

Given that we HAVE to wait, it is probably best to aim for a fix. Also, given the critical nature of this bug and the timing, I think the first option recommended by @smarterclayton: Making device manager intelligent and allowing admission of pods that were previously admitted (or in running phase) is the way to go as it solves the problem in a less invasive way as compared to option 2 which (is overall nicer in the long run but) might need additional rounds of design discussion and time we can’t afford right now.

  1. Running pods should always survive kubelet restart.
  2. Admission is re-run on every kubelet restart (it must, because the kubelet is stateless and we have coupled admission with allocation)
  3. It is the responsibility of every admission plugin to handle the scenario of kubelet restart correctly (by identifying when it can start making admission decisions)
  4. We probably lack all the tools to correctly handle admission + allocation, and we need to identify which ones to add.
  5. Admission is processed in a serial (and mostly random) order and therefore admission plugins cannot safely “block” until initialization is complete

It sounds like this is because the device admission plugin is still not able to authoritatively state whether a device is available at the time the restarted pod is run?

We need to make some changes to admission generally to solve this case completely, but until we do, is it possible to have the admission plugin safely accept a pod that is a) never before seen by the device plugin and b) in the running phase? Or are there other reasons why that has been tried and found not to work?

In other words, in addition to the followup fix, it would be beneficial to clarify that in general running containers are or are not guaranteed to survive a kubelet restart.

From the user point of view, I guess, kubelet should keep the running pods. Kubelet can be restarted for various reasons, but IMHO it should not affect critical workloads. Hm… I thought that it was actually the supposed behavior to always try to keep the pods.

I agree this is a desirable and expected behavior, if nothing else out of habit. This is what kubelet implementation did.

However, the deeper I look, the less I’m sure it is a guaranteed behavior.

There are well known circumnstances on which kubelet may reserve the option to kill running pods when kubelet restarts, e.g. if the machine config changes. Granted, this is NOT the case which is reported here (nothing changed across restart, hence we want a followup fix of some kind), but I’m convinced that clarifying the guarantees on kubelet restart should be part of the ongoing conversation.

In other words, in addition to the followup fix, it would be beneficial to clarify that in general running containers are or are not guaranteed to survive a kubelet restart.

From the user point of view, I guess, kubelet should keep the running pods. Kubelet can be restarted for various reasons, but IMHO it should not affect critical workloads. Hm… I thought that it was actually the supposed behavior to always try to keep the pods.

I’m looking into this issue and I’ll be updating shortly. At this point in time I can say that yes, https://github.com/kubernetes/kubernetes/pull/116376 made the devicemanager stricter and leads to this behavior. We may need another way to fix the inconsistency reported in https://github.com/kubernetes/kubernetes/issues/109595 , and likely we will need to tight up/fix the e2e tests about device plugins.

But there could be a (partially?) mismatched expectation as well, because AFAIK the kubelet will run admission on restart (and in general on initialization) and thus may kill running pods.

In other words, in addition to the followup fix, it would be beneficial to clarify that in general running containers are or are not guaranteed to survive a kubelet restart.