kubernetes: device manager can incorrectly recover devices on node restart
What happened?
The following behaviour was reproduced on openshift 4.10 (“Kubelet version” kubeletVersion=“v1.23.3+54654d2”) but I believe the root cause is on kubernetes.
Let’s consider a pod requiring resources, among them a device (point in case: dpdk network device)
apiVersion: v1
kind: Pod
metadata:
labels:
app: dpdk
name: testpmd
namespace: default
spec:
containers:
- command:
- /bin/bash
- -c
- sleep INF
image: $DPKDK_IMAGE
imagePullPolicy: Always
name: dpdk
resources:
limits:
memory: 1000Mi
hugepages-1Gi: 1Gi
cpu: '2'
openshift.io/dpdk_nic_1: 1
requests:
memory: 1000Mi
hugepages-1Gi: 1Gi
cpu: '2'
openshift.io/dpdk_nic_1: 1
securityContext:
capabilities:
add:
- IPC_LOCK
- SYS_RESOURCE
- NET_RAW
- NET_ADMIN
runAsUser: 0
the pods starts and run just fine up until the node is restarted. On restart, the device manager correctly recover its state from the checkpoint data:
Apr 13 09:53:14 TESTNODE hyperkube[2011]: I0413 09:53:14.461850 2011 manager.go:247] "Starting Device Plugin manager"
Apr 13 09:53:14 TESTNODE hyperkube[2011]: I0413 09:53:14.463809 2011 container_manager_linux.go:930] "Attempting to apply oom_score_adj to process" oomScoreAdj=-999 pid=2011
Apr 13 09:53:14 TESTNODE hyperkube[2011]: I0413 09:53:14.464997 2011 oom_linux.go:66] attempting to set "/proc/2011/oom_score_adj" to "-999"
Apr 13 09:53:14 TESTNODE hyperkube[2011]: I0413 09:53:14.464269 2011 iptables.go:462] running iptables: ip6tables [-w 5 -W 100000 -N KUBE-POSTROUTING -t nat]
Apr 13 09:53:14 TESTNODE hyperkube[2011]: I0413 09:53:14.465254 2011 event.go:294] "Event occurred" object="TESTNODE" kind="Node" apiVersion="" type="Normal" reason="NodeAllocatableEnforced" message="Updated Node Allocatable limit across pods"
Apr 13 09:53:14 TESTNODE hyperkube[2011]: I0413 09:53:14.466379 2011 pod_devices.go:215] "Get checkpoint entry" podUID="c7dcb25a-6b71-4f80-9482-053ea597230c" containerName="dpdk" resourceName="openshift.io/dpdk_nic_1" deviceIDs=map[-1:[0000:00:05.0]] allocated="\n1\n!PCIDEVICE_OPENSHIFT_IO_DPDK_NIC_1\x12\f0000:00:05.0\x1a%\n\x0e/dev/vfio/vfio\x12\x0e/dev/vfio/vfio\x1a\x03mrw\x1a'\n\v/dev/vfio/0\x12\x13/dev/vfio/noiommu-0\x1a\x03mrw"
Apr 13 09:53:14 TESTNODE hyperkube[2011]: I0413 09:53:14.466823 2011 manager.go:289] "Serving device plugin registration server on socket" path="/var/lib/kubelet/device-plugins/kubelet.sock"
As expected, the device plugin capacity is set to zero until the device plugin correctly re-register itself:
Apr 13 09:53:14 TESTNODE hyperkube[2011]: I0413 09:53:14.607362 2011 setters.go:344] "Updated capacity for device plugin" plugin="openshift.io/dpdk_nic_1" capacity=0`
A bit later, the recovery process from the apiserver begins:
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193478 2011 config.go:278] "Setting pods for source" source="api"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193526 2011 config.go:383] "Receiving a new pod" pod="openshift-cluster-node-tuning-operator/tuned-tql4x"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193541 2011 config.go:383] "Receiving a new pod" pod="openshift-sdn/sdn-fnls4"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193549 2011 config.go:383] "Receiving a new pod" pod="openshift-monitoring/node-exporter-jzprg"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193557 2011 config.go:383] "Receiving a new pod" pod="openshift-ingress-canary/ingress-canary-62kq2"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193565 2011 config.go:383] "Receiving a new pod" pod="openshift-multus/multus-87sbt"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193573 2011 config.go:383] "Receiving a new pod" pod="openshift-multus/network-metrics-daemon-5qb5q"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193580 2011 config.go:383] "Receiving a new pod" pod="openshift-image-registry/node-ca-frkhr"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193597 2011 config.go:383] "Receiving a new pod" pod="openshift-monitoring/prometheus-adapter-689c895f8f-zpsx2"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193605 2011 config.go:383] "Receiving a new pod" pod="openshift-operator-lifecycle-manager/collect-profiles-27497370-6vd5b"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193615 2011 config.go:383] "Receiving a new pod" pod="openshift-operator-lifecycle-manager/collect-profiles-27497385-mwtfp"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193624 2011 config.go:383] "Receiving a new pod" pod="openshift-machine-config-operator/machine-config-daemon-nkqsq"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193632 2011 config.go:383] "Receiving a new pod" pod="openshift-image-registry/image-pruner-27496800-n4qzq"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193640 2011 config.go:383] "Receiving a new pod" pod="openshift-cluster-csi-drivers/openstack-cinder-csi-driver-node-qklm6"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193649 2011 config.go:383] "Receiving a new pod" pod="openshift-marketplace/246a7cd23747989b8f475c6ffc04f7e523236656eaac2249a6d4ddf03bvpl6x"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193659 2011 config.go:383] "Receiving a new pod" pod="openshift-multus/multus-additional-cni-plugins-xwgkj"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193667 2011 config.go:383] "Receiving a new pod" pod="openshift-openstack-infra/coredns-TESTNODE"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193674 2011 config.go:383] "Receiving a new pod" pod="openshift-dns/node-resolver-6n2vp"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193682 2011 config.go:383] "Receiving a new pod" pod="openshift-openstack-infra/keepalived-TESTNODE"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193689 2011 config.go:383] "Receiving a new pod" pod="openshift-dns/dns-default-cnlrl"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193695 2011 config.go:383] "Receiving a new pod" pod="default/testpmd"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193703 2011 config.go:383] "Receiving a new pod" pod="openshift-operator-lifecycle-manager/collect-profiles-27497355-cdncm"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193711 2011 config.go:383] "Receiving a new pod" pod="openshift-sriov-network-operator/sriov-network-config-daemon-p94vm"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193721 2011 config.go:383] "Receiving a new pod" pod="openshift-sriov-network-operator/sriov-device-plugin-bw7rq"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193728 2011 config.go:383] "Receiving a new pod" pod="openshift-network-diagnostics/network-check-target-2tbjs"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.195257 2011 kubelet.go:2106] "SyncLoop ADD" source="api" pods=[openshift-cluster-node-tuning-operator/tuned-tql4x openshift-sdn/sdn-fnls4 openshift-monitoring/node-exporter-jzprg openshift-ingress-canary/ingress-canary-62kq2 openshift-multus/multus-87sbt openshift-multus/network-metrics-daemon-5qb5q openshift-image-registry/node-ca-frkhr openshift-monitoring/prometheus-adapter-689c895f8f-zpsx2 openshift-operator-lifecycle-manager/collect-profiles-27497370-6vd5b openshift-operator-lifecycle-manager/collect-profiles-27497385-mwtfp openshift-machine-config-operator/machine-config-daemon-nkqsq openshift-image-registry/image-pruner-27496800-n4qzq openshift-cluster-csi-drivers/openstack-cinder-csi-driver-node-qklm6 openshift-marketplace/246a7cd23747989b8f475c6ffc04f7e523236656eaac2249a6d4ddf03bvpl6x openshift-multus/multus-additional-cni-plugins-xwgkj openshift-openstack-infra/coredns-TESTNODE openshift-dns/node-resolver-6n2vp openshift-openstack-infra/keepalived-TESTNODE openshift-dns/dns-default-cnlrl default/testpmd openshift-operator-lifecycle-manager/collect-profiles-27497355-cdncm openshift-sriov-network-operator/sriov-network-config-daemon-p94vm openshift-sriov-network-operator/sriov-device-plugin-bw7rq openshift-network-diagnostics/network-check-target-2tbjs]
we can see already that the workload pod (testpmd) is going to be recovered BEFORE the device plugin. When the workload pod enters the admission flow (as per recovered pods), because the devicemanager restored its state and has data about this pod, we see:
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.204837 2011 topology_manager.go:200] "Topology Admit Handler"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.204848 2011 config.go:97] "Looking for sources, have seen" sources=[api file] seenSources=map[file:{}]
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.204859 2011 topology_hints.go:49] "Resource does not have a topology preference" resource="openshift.io/dpdk_nic_1"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.204873 2011 scope_container.go:77] "TopologyHints" hints=map[openshift.io/dpdk_nic_1:[]] pod="default/testpmd" containerName="dpdk"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.204880 2011 config.go:97] "Looking for sources, have seen" sources=[api file] seenSources=map[file:{}]
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.204928 2011 policy_static.go:437] "TopologyHints generated" pod="default/testpmd" containerName="dpdk" cpuHints=[{NUMANodeAffinity:01 Preferred:true}]
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.204941 2011 scope_container.go:77] "TopologyHints" hints=map[cpu:[{NUMANodeAffinity:01 Preferred:true}]] pod="default/testpmd" containerName="dpdk"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.204949 2011 config.go:97] "Looking for sources, have seen" sources=[api file] seenSources=map[file:{}]
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.204957 2011 scope_container.go:77] "TopologyHints" hints=map[] pod="default/testpmd" containerName="dpdk"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.204964 2011 policy.go:78] "Hint Provider has no preference for NUMA affinity with resource" resource="openshift.io/dpdk_nic_1"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.204971 2011 policy.go:70] "Hint Provider has no preference for NUMA affinity with any resource"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.204979 2011 scope_container.go:85] "ContainerTopologyHint" bestHint={NUMANodeAffinity:01 Preferred:true}
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.204987 2011 scope_container.go:54] "Best TopologyHint" bestHint={NUMANodeAffinity:01 Preferred:true} pod="default/testpmd" containerName="dpdk"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.204995 2011 scope_container.go:59] "Topology Affinity" bestHint={NUMANodeAffinity:01 Preferred:true} pod="default/testpmd" containerName="dpdk"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.205009 2011 manager.go:914] "Looking for needed resources" needed=1048576000 resourceName="memory"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.205016 2011 manager.go:914] "Looking for needed resources" needed=1 resourceName="openshift.io/dpdk_nic_1"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.205024 2011 config.go:97] "Looking for sources, have seen" sources=[api file] seenSources=map[file:{}]
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.205037 2011 manager.go:694] "Found pre-allocated devices for resource on pod" resourceName="openshift.io/dpdk_nic_1" containerName="dpdk" podUID="c7dcb25a-6b71-4f80-9482-053ea597230c" devices=[0000:00:05.0]
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.205044 2011 manager.go:914] "Looking for needed resources" needed=2 resourceName="cpu"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.205050 2011 manager.go:914] "Looking for needed resources" needed=1073741824 resourceName="hugepages-1Gi"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.205060 2011 config.go:97] "Looking for sources, have seen" sources=[api file] seenSources=map[file:{}]
When the topology manager requests device allocation, we end up in the devicemanager’s devicesToAllocate function. This function checks if there already a pre-allocated device (and there is, thanks to the recovered data), and if so exits early, skipping the health check which would have failed, because the device plugin has not re-registered itself yet.
The end result is that containers can be admitted and start running without the requested device actually available, causing runtime errors in the workload.
What did you expect to happen?
The kubelet should reject the admission of the pod until the device plugins re-registered themselves; then the pods requiring devices should be processed again. This seems the only meaningful way to recover pods while taking in consideration the actualy health of the devices.
How can we reproduce it (as minimally and precisely as possible)?
The behaviour here depends on the order on which pods are recovered, so it’s essentially a race.
- run a test pod requiring devices. Make sure this test pod clearly fails if the requested devices are not available
- restart the node (can take few attempts)
- check the health of the test pod
Anything else we need to know?
No response
Kubernetes version
kubelet 1.23.3
likely happening on current master
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, …) and versions (if applicable)
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 26 (20 by maintainers)
Commits related to this issue
- e2e: node: testcase to exercise device manager recovery When kubelet recovers pods, there's no guarantee about order of recovering. This means a pod requesting devices can be recovered, and can start... — committed to ffromani/origin by ffromani a year ago
- e2e: node: testcase to exercise device manager recovery When kubelet recovers pods, there's no guarantee about order of recovering. This means a pod requesting devices can be recovered, and can start... — committed to ffromani/origin by ffromani a year ago
- e2e: node: testcase to exercise device manager recovery When kubelet recovers pods, there's no guarantee about order of recovering. This means a pod requesting devices can be recovered, and can start... — committed to ffromani/origin by ffromani a year ago
- e2e: node: testcase to exercise device manager recovery When kubelet recovers pods, there's no guarantee about order of recovering. This means a pod requesting devices can be recovered, and can start... — committed to ffromani/origin by ffromani a year ago
- e2e: node: testcase to exercise device manager recovery When kubelet recovers pods, there's no guarantee about order of recovering. This means a pod requesting devices can be recovered, and can start... — committed to ffromani/origin by ffromani a year ago
- e2e: node: testcase to exercise device manager recovery When kubelet recovers pods, there's no guarantee about order of recovering. This means a pod requesting devices can be recovered, and can start... — committed to ffromani/origin by ffromani a year ago
- e2e: node: testcase to exercise device manager recovery When kubelet recovers pods, there's no guarantee about order of recovering. This means a pod requesting devices can be recovered, and can start... — committed to ffromani/origin by ffromani a year ago
- e2e: node: testcase to exercise device manager recovery When kubelet recovers pods, there's no guarantee about order of recovering. This means a pod requesting devices can be recovered, and can start... — committed to ffromani/origin by ffromani a year ago
- e2e: node: testcase to exercise device manager recovery When kubelet recovers pods, there's no guarantee about order of recovering. This means a pod requesting devices can be recovered, and can start... — committed to ffromani/origin by ffromani a year ago
- e2e: node: testcase to exercise device manager recovery When kubelet recovers pods, there's no guarantee about order of recovering. This means a pod requesting devices can be recovered, and can start... — committed to ffromani/origin by ffromani a year ago
- e2e: node: testcase to exercise device manager recovery When kubelet recovers pods, there's no guarantee about order of recovering. This means a pod requesting devices can be recovered, and can start... — committed to ffromani/origin by ffromani a year ago
- e2e: node: testcase to exercise device manager recovery When kubelet recovers pods, there's no guarantee about order of recovering. This means a pod requesting devices can be recovered, and can start... — committed to ffromani/origin by ffromani a year ago
- e2e: node: testcase to exercise device manager recovery When kubelet recovers pods, there's no guarantee about order of recovering. This means a pod requesting devices can be recovered, and can start... — committed to ffromani/origin by ffromani a year ago
- e2e: node: testcase to exercise device manager recovery When kubelet recovers pods, there's no guarantee about order of recovering. This means a pod requesting devices can be recovered, and can start... — committed to ffromani/origin by ffromani a year ago
- m2e: node: testcase to exercise device manager recovery When kubelet recovers pods, there's no guarantee about order of recovering. This means a pod requesting devices can be recovered, and can start... — committed to ffromani/origin by ffromani a year ago
- node: devicemgr: topomgr: add logs One of the contributing factors of issues #118559 and #109595 hard to debug and fix is that the devicemanager has very few logs in important flow, so it's unnecessa... — committed to ffromani/kubernetes by ffromani a year ago
/remove-lifecycle stale
Yes, something is not fine here and this deserve discussion, please file a separate issue, backlinking this one, for visibility. Thanks!
I agree that this kind of behavior would be useful but it is worth mentioning that currently we rely on the fact we don’t use naked pods and typically a pod is part of a deployment. New instances would be spinned up and at some stage a pod would succeed at admission and that’s when the deployment would succeed.
EDIT: The selling point of the semantic of “i can’t admit this now but probably can later” is that we don’t end up with runaway pod creation due to attempts of having a successful deployment.
Ok, that’s useful. I’ll add that to the tracking list for improving admission - the problem of “i can’t admit this now but probably can later” comes in with other places (apparmor, waiting for pods to be gracefully deleted). Thanks
Thanks for pointing to this
issuePR, it is definitely related.I managed to reproduce this issue on a kind based Kubernetes cluster (v1.24.7) so the root cause is certainly Kubernetes. I Will provide additional updates as I have 'em!