kubernetes: From 1.11, an extended resource reported by a device plugin can be left on a node after node upgrade even though its device plugin never registers back
Is this a BUG REPORT or FEATURE REQUEST?:
Uncomment only one, leave it on its own line:
/kind bug
/kind feature
What happened:
In 1.11, https://github.com/kubernetes/kubernetes/pull/61877 removed externalID from Node.Spec which used to be used by kubelet to detect node recreation in certain environments during node upgrade. This changed node upgrade behavior in those environments because in pkg/kubelet/kubelet_node_status.go tryRegisterWithAPIServer(node *v1.Node), without externalID checking, we now consider the recreated node after upgrade as a previously registered node and thus will get old node state from API server instead of regenerating them based on node local state.
For extended resource reported by a device plugin, this causes problem because after node upgrade/recreation, the device plugin DaemonSet pod needs to be restarted by Kubelet, finishes certain setup, and then report to kubelet that its resource is available at the node. However, the node status capacity and allocatable from the API server still contains the old state before the node upgrade. After kubelet syncs up with the API server, the node status field gets overwritten and previously reported extended resources will appear on the node status capacity/allocatable even though the node is not ready for pods to consume such resources. As the result, Kubelet may start a pod requesting such resource without proper container runtime setup it needs to learn from device plugin.
Note to cope with the externalID removal change, cluster/gce/upgrade.sh has been updated (https://github.com/kubernetes/kubernetes/issues/63506) to explicitly uncordon the node after checking it is restarted and becomes ready. This however doesn’t help device plugin case because node readiness doesn’t depend on individual extended resource readiness.
This issue is to explore how we can fix this problem on head and 1.11.
In particular, I wonder whether we may switch to the model that we consider Kubelet as the only source to update extended resource capacity/allocatable in node status. I.e., don’t support manually updating node status capacity/allocatable fields as documented in https://kubernetes.io/docs/tasks/administer-cluster/extended-resource-node/. This would allow us to re-generate node status capacity/allocatable every time in node status update, which I think is much simpler and more robust. As I heard, some folks have been using this mechanism to do simple resource exporting/accounting through a central controller. I wonder whether people are open to switch to the device plugin model in those cases. Even though those extended resources may not require special container setup, it seems a more secure model to have Kubelet totally own its node status.
FYI, here is the related OSS issue: https://github.com/kubernetes/kubernetes/issues/50473 where we first introduced extended resource concept.
What you expected to happen:
How to reproduce it (as minimally and precisely as possible): Running the gpu upgrade test from https://github.com/kubernetes/kubernetes/pull/63631 with --upgrade-target set to any 1.11+ version would hit the issue.
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version): - Cloud provider or hardware configuration:
- OS (e.g. from /etc/os-release):
- Kernel (e.g.
uname -a): - Install tools:
- Others:
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 33 (29 by maintainers)
Having an external controller watching the nodes status update to repatch the nodes sounds like a bad “user” experience.
Maybe looking at it from a user’s perspective and then figuring out what we’ll do is a better way to look at this?
The behavior I’d expect to see is the following:
As for the case where the checkpoint data is gone, should we expect the pod’s controller to recreate a new pod? That seems like a surprising behavior are we doing this somewhere else?
Does the general behavior outlined here seem sane to the people in this thread?
it would delay removal of the self-deletion permission for yet another release (already pushed to 1.12 because of version-skew), meaning node taints would not be usable for isolating compromised nodes until at least 1.14 (because the associated kubelet could simply delete/re-register their Node object).
I also don’t think it would actually solve this issue. According to https://github.com/kubernetes/kubernetes/issues/63505#issuecomment-388217003, even were we to revert https://github.com/kubernetes/kubernetes/pull/61877 and restore kubelet self-deletion, changes made in https://github.com/kubernetes/kubernetes/pull/60692 stabilized the computation of external id in GCE such that it no longer changes as part of node upgrades in 1.11+. Undoing that, destabilizing the GCE computed external id, and reintroducing self-deletion cases to solve the stale resource issue in one particular cloud environment doesn’t seem like a good approach.
For extended resources exported through device plugin, kubelet device manager will handle their restarts, currently based on checkpoint data and in the future, we will explore whether we may store this information in some form of API. If the checkpoint data is gone, pods will fail kubelet admission handling and the pod’s controller should recreate a new pod.
For extended resources exported by an external controller through direct node status capacity patching, the controller should watch for kubelet readiness change, and repatch the field after kubelet becomes ready again.