kubernetes: device manager can incorrectly recover devices on node restart

What happened?

The following behaviour was reproduced on openshift 4.10 (“Kubelet version” kubeletVersion=“v1.23.3+54654d2”) but I believe the root cause is on kubernetes.

Let’s consider a pod requiring resources, among them a device (point in case: dpdk network device)

apiVersion: v1
kind: Pod
metadata:
  labels:
    app: dpdk
  name: testpmd
  namespace: default
spec:
  containers:
    - command:
        - /bin/bash
        - -c
        - sleep INF
      image: $DPKDK_IMAGE
      imagePullPolicy: Always
      name: dpdk
      resources:
        limits:
          memory: 1000Mi
          hugepages-1Gi: 1Gi
          cpu: '2'
          openshift.io/dpdk_nic_1: 1
        requests:
          memory: 1000Mi
          hugepages-1Gi: 1Gi
          cpu: '2'
          openshift.io/dpdk_nic_1: 1
      securityContext:
        capabilities:
          add:
            - IPC_LOCK
            - SYS_RESOURCE
            - NET_RAW
            - NET_ADMIN
        runAsUser: 0

the pods starts and run just fine up until the node is restarted. On restart, the device manager correctly recover its state from the checkpoint data:

Apr 13 09:53:14 TESTNODE hyperkube[2011]: I0413 09:53:14.461850    2011 manager.go:247] "Starting Device Plugin manager"
Apr 13 09:53:14 TESTNODE hyperkube[2011]: I0413 09:53:14.463809    2011 container_manager_linux.go:930] "Attempting to apply oom_score_adj to process" oomScoreAdj=-999 pid=2011
Apr 13 09:53:14 TESTNODE hyperkube[2011]: I0413 09:53:14.464997    2011 oom_linux.go:66] attempting to set "/proc/2011/oom_score_adj" to "-999"
Apr 13 09:53:14 TESTNODE hyperkube[2011]: I0413 09:53:14.464269    2011 iptables.go:462] running iptables: ip6tables [-w 5 -W 100000 -N KUBE-POSTROUTING -t nat]
Apr 13 09:53:14 TESTNODE hyperkube[2011]: I0413 09:53:14.465254    2011 event.go:294] "Event occurred" object="TESTNODE" kind="Node" apiVersion="" type="Normal" reason="NodeAllocatableEnforced" message="Updated Node Allocatable limit across pods"
Apr 13 09:53:14 TESTNODE hyperkube[2011]: I0413 09:53:14.466379    2011 pod_devices.go:215] "Get checkpoint entry" podUID="c7dcb25a-6b71-4f80-9482-053ea597230c" containerName="dpdk" resourceName="openshift.io/dpdk_nic_1" deviceIDs=map[-1:[0000:00:05.0]] allocated="\n1\n!PCIDEVICE_OPENSHIFT_IO_DPDK_NIC_1\x12\f0000:00:05.0\x1a%\n\x0e/dev/vfio/vfio\x12\x0e/dev/vfio/vfio\x1a\x03mrw\x1a'\n\v/dev/vfio/0\x12\x13/dev/vfio/noiommu-0\x1a\x03mrw"
Apr 13 09:53:14 TESTNODE hyperkube[2011]: I0413 09:53:14.466823    2011 manager.go:289] "Serving device plugin registration server on socket" path="/var/lib/kubelet/device-plugins/kubelet.sock"

As expected, the device plugin capacity is set to zero until the device plugin correctly re-register itself:

Apr 13 09:53:14 TESTNODE hyperkube[2011]: I0413 09:53:14.607362    2011 setters.go:344] "Updated capacity for device plugin" plugin="openshift.io/dpdk_nic_1" capacity=0`

A bit later, the recovery process from the apiserver begins:

Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193478    2011 config.go:278] "Setting pods for source" source="api"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193526    2011 config.go:383] "Receiving a new pod" pod="openshift-cluster-node-tuning-operator/tuned-tql4x"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193541    2011 config.go:383] "Receiving a new pod" pod="openshift-sdn/sdn-fnls4"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193549    2011 config.go:383] "Receiving a new pod" pod="openshift-monitoring/node-exporter-jzprg"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193557    2011 config.go:383] "Receiving a new pod" pod="openshift-ingress-canary/ingress-canary-62kq2"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193565    2011 config.go:383] "Receiving a new pod" pod="openshift-multus/multus-87sbt"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193573    2011 config.go:383] "Receiving a new pod" pod="openshift-multus/network-metrics-daemon-5qb5q"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193580    2011 config.go:383] "Receiving a new pod" pod="openshift-image-registry/node-ca-frkhr"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193597    2011 config.go:383] "Receiving a new pod" pod="openshift-monitoring/prometheus-adapter-689c895f8f-zpsx2"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193605    2011 config.go:383] "Receiving a new pod" pod="openshift-operator-lifecycle-manager/collect-profiles-27497370-6vd5b"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193615    2011 config.go:383] "Receiving a new pod" pod="openshift-operator-lifecycle-manager/collect-profiles-27497385-mwtfp"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193624    2011 config.go:383] "Receiving a new pod" pod="openshift-machine-config-operator/machine-config-daemon-nkqsq"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193632    2011 config.go:383] "Receiving a new pod" pod="openshift-image-registry/image-pruner-27496800-n4qzq"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193640    2011 config.go:383] "Receiving a new pod" pod="openshift-cluster-csi-drivers/openstack-cinder-csi-driver-node-qklm6"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193649    2011 config.go:383] "Receiving a new pod" pod="openshift-marketplace/246a7cd23747989b8f475c6ffc04f7e523236656eaac2249a6d4ddf03bvpl6x"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193659    2011 config.go:383] "Receiving a new pod" pod="openshift-multus/multus-additional-cni-plugins-xwgkj"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193667    2011 config.go:383] "Receiving a new pod" pod="openshift-openstack-infra/coredns-TESTNODE"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193674    2011 config.go:383] "Receiving a new pod" pod="openshift-dns/node-resolver-6n2vp"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193682    2011 config.go:383] "Receiving a new pod" pod="openshift-openstack-infra/keepalived-TESTNODE"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193689    2011 config.go:383] "Receiving a new pod" pod="openshift-dns/dns-default-cnlrl"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193695    2011 config.go:383] "Receiving a new pod" pod="default/testpmd"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193703    2011 config.go:383] "Receiving a new pod" pod="openshift-operator-lifecycle-manager/collect-profiles-27497355-cdncm"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193711    2011 config.go:383] "Receiving a new pod" pod="openshift-sriov-network-operator/sriov-network-config-daemon-p94vm"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193721    2011 config.go:383] "Receiving a new pod" pod="openshift-sriov-network-operator/sriov-device-plugin-bw7rq"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.193728    2011 config.go:383] "Receiving a new pod" pod="openshift-network-diagnostics/network-check-target-2tbjs"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.195257    2011 kubelet.go:2106] "SyncLoop ADD" source="api" pods=[openshift-cluster-node-tuning-operator/tuned-tql4x openshift-sdn/sdn-fnls4 openshift-monitoring/node-exporter-jzprg openshift-ingress-canary/ingress-canary-62kq2 openshift-multus/multus-87sbt openshift-multus/network-metrics-daemon-5qb5q openshift-image-registry/node-ca-frkhr openshift-monitoring/prometheus-adapter-689c895f8f-zpsx2 openshift-operator-lifecycle-manager/collect-profiles-27497370-6vd5b openshift-operator-lifecycle-manager/collect-profiles-27497385-mwtfp openshift-machine-config-operator/machine-config-daemon-nkqsq openshift-image-registry/image-pruner-27496800-n4qzq openshift-cluster-csi-drivers/openstack-cinder-csi-driver-node-qklm6 openshift-marketplace/246a7cd23747989b8f475c6ffc04f7e523236656eaac2249a6d4ddf03bvpl6x openshift-multus/multus-additional-cni-plugins-xwgkj openshift-openstack-infra/coredns-TESTNODE openshift-dns/node-resolver-6n2vp openshift-openstack-infra/keepalived-TESTNODE openshift-dns/dns-default-cnlrl default/testpmd openshift-operator-lifecycle-manager/collect-profiles-27497355-cdncm openshift-sriov-network-operator/sriov-network-config-daemon-p94vm openshift-sriov-network-operator/sriov-device-plugin-bw7rq openshift-network-diagnostics/network-check-target-2tbjs]

we can see already that the workload pod (testpmd) is going to be recovered BEFORE the device plugin. When the workload pod enters the admission flow (as per recovered pods), because the devicemanager restored its state and has data about this pod, we see:

Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.204837    2011 topology_manager.go:200] "Topology Admit Handler"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.204848    2011 config.go:97] "Looking for sources, have seen" sources=[api file] seenSources=map[file:{}]
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.204859    2011 topology_hints.go:49] "Resource does not have a topology preference" resource="openshift.io/dpdk_nic_1"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.204873    2011 scope_container.go:77] "TopologyHints" hints=map[openshift.io/dpdk_nic_1:[]] pod="default/testpmd" containerName="dpdk"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.204880    2011 config.go:97] "Looking for sources, have seen" sources=[api file] seenSources=map[file:{}]
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.204928    2011 policy_static.go:437] "TopologyHints generated" pod="default/testpmd" containerName="dpdk" cpuHints=[{NUMANodeAffinity:01 Preferred:true}]
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.204941    2011 scope_container.go:77] "TopologyHints" hints=map[cpu:[{NUMANodeAffinity:01 Preferred:true}]] pod="default/testpmd" containerName="dpdk"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.204949    2011 config.go:97] "Looking for sources, have seen" sources=[api file] seenSources=map[file:{}]
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.204957    2011 scope_container.go:77] "TopologyHints" hints=map[] pod="default/testpmd" containerName="dpdk"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.204964    2011 policy.go:78] "Hint Provider has no preference for NUMA affinity with resource" resource="openshift.io/dpdk_nic_1"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.204971    2011 policy.go:70] "Hint Provider has no preference for NUMA affinity with any resource"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.204979    2011 scope_container.go:85] "ContainerTopologyHint" bestHint={NUMANodeAffinity:01 Preferred:true}
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.204987    2011 scope_container.go:54] "Best TopologyHint" bestHint={NUMANodeAffinity:01 Preferred:true} pod="default/testpmd" containerName="dpdk"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.204995    2011 scope_container.go:59] "Topology Affinity" bestHint={NUMANodeAffinity:01 Preferred:true} pod="default/testpmd" containerName="dpdk"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.205009    2011 manager.go:914] "Looking for needed resources" needed=1048576000 resourceName="memory"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.205016    2011 manager.go:914] "Looking for needed resources" needed=1 resourceName="openshift.io/dpdk_nic_1"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.205024    2011 config.go:97] "Looking for sources, have seen" sources=[api file] seenSources=map[file:{}]
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.205037    2011 manager.go:694] "Found pre-allocated devices for resource on pod" resourceName="openshift.io/dpdk_nic_1" containerName="dpdk" podUID="c7dcb25a-6b71-4f80-9482-053ea597230c" devices=[0000:00:05.0]
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.205044    2011 manager.go:914] "Looking for needed resources" needed=2 resourceName="cpu"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.205050    2011 manager.go:914] "Looking for needed resources" needed=1073741824 resourceName="hugepages-1Gi"
Apr 13 09:53:17 TESTNODE hyperkube[2011]: I0413 09:53:17.205060    2011 config.go:97] "Looking for sources, have seen" sources=[api file] seenSources=map[file:{}]

When the topology manager requests device allocation, we end up in the devicemanager’s devicesToAllocate function. This function checks if there already a pre-allocated device (and there is, thanks to the recovered data), and if so exits early, skipping the health check which would have failed, because the device plugin has not re-registered itself yet.

The end result is that containers can be admitted and start running without the requested device actually available, causing runtime errors in the workload.

What did you expect to happen?

The kubelet should reject the admission of the pod until the device plugins re-registered themselves; then the pods requiring devices should be processed again. This seems the only meaningful way to recover pods while taking in consideration the actualy health of the devices.

How can we reproduce it (as minimally and precisely as possible)?

The behaviour here depends on the order on which pods are recovered, so it’s essentially a race.

run a test pod requiring devices. Make sure this test pod clearly fails if the requested devices are not available
restart the node (can take few attempts)
check the health of the test pod

Anything else we need to know?

No response

Kubernetes version

kubelet 1.23.3
likely happening on current master

Cloud provider

bare metal

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, …) and versions (if applicable)

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 26 (20 by maintainers)

Commits related to this issue

e2e: node: testcase to exercise device manager recovery When kubelet recovers pods, there's no guarantee about order of recovering. This means a pod requesting devices can be recovered, and can start... — committed to ffromani/origin by ffromani a year ago
e2e: node: testcase to exercise device manager recovery When kubelet recovers pods, there's no guarantee about order of recovering. This means a pod requesting devices can be recovered, and can start... — committed to ffromani/origin by ffromani a year ago
e2e: node: testcase to exercise device manager recovery When kubelet recovers pods, there's no guarantee about order of recovering. This means a pod requesting devices can be recovered, and can start... — committed to ffromani/origin by ffromani a year ago
e2e: node: testcase to exercise device manager recovery When kubelet recovers pods, there's no guarantee about order of recovering. This means a pod requesting devices can be recovered, and can start... — committed to ffromani/origin by ffromani a year ago
e2e: node: testcase to exercise device manager recovery When kubelet recovers pods, there's no guarantee about order of recovering. This means a pod requesting devices can be recovered, and can start... — committed to ffromani/origin by ffromani a year ago
e2e: node: testcase to exercise device manager recovery When kubelet recovers pods, there's no guarantee about order of recovering. This means a pod requesting devices can be recovered, and can start... — committed to ffromani/origin by ffromani a year ago
e2e: node: testcase to exercise device manager recovery When kubelet recovers pods, there's no guarantee about order of recovering. This means a pod requesting devices can be recovered, and can start... — committed to ffromani/origin by ffromani a year ago
e2e: node: testcase to exercise device manager recovery When kubelet recovers pods, there's no guarantee about order of recovering. This means a pod requesting devices can be recovered, and can start... — committed to ffromani/origin by ffromani a year ago
e2e: node: testcase to exercise device manager recovery When kubelet recovers pods, there's no guarantee about order of recovering. This means a pod requesting devices can be recovered, and can start... — committed to ffromani/origin by ffromani a year ago
e2e: node: testcase to exercise device manager recovery When kubelet recovers pods, there's no guarantee about order of recovering. This means a pod requesting devices can be recovered, and can start... — committed to ffromani/origin by ffromani a year ago
e2e: node: testcase to exercise device manager recovery When kubelet recovers pods, there's no guarantee about order of recovering. This means a pod requesting devices can be recovered, and can start... — committed to ffromani/origin by ffromani a year ago
e2e: node: testcase to exercise device manager recovery When kubelet recovers pods, there's no guarantee about order of recovering. This means a pod requesting devices can be recovered, and can start... — committed to ffromani/origin by ffromani a year ago
e2e: node: testcase to exercise device manager recovery When kubelet recovers pods, there's no guarantee about order of recovering. This means a pod requesting devices can be recovered, and can start... — committed to ffromani/origin by ffromani a year ago
e2e: node: testcase to exercise device manager recovery When kubelet recovers pods, there's no guarantee about order of recovering. This means a pod requesting devices can be recovered, and can start... — committed to ffromani/origin by ffromani a year ago
m2e: node: testcase to exercise device manager recovery When kubelet recovers pods, there's no guarantee about order of recovering. This means a pod requesting devices can be recovered, and can start... — committed to ffromani/origin by ffromani a year ago
node: devicemgr: topomgr: add logs One of the contributing factors of issues #118559 and #109595 hard to debug and fix is that the devicemanager has very few logs in important flow, so it's unnecessa... — committed to ffromani/kubernetes by ffromani a year ago

Most upvoted comments

/remove-lifecycle stale

vaibhav2107 on Nov 1, 2022

A gentle ping here. With the fix #116376, if kubelet is restarted on a node, then all the existing and running workloads that use devices are terminated:
 Warning  UnexpectedAdmissionError  45s   kubelet            Allocate failed due to no healthy devices present; cannot allocate unhealthy devices devices.kubevirt.io/kvm, which is unexpected
  Normal   Killing                   42s   kubelet            Stopping container compute
This also affects 1.25.x and 1.26.x branches, since the fix has been cherry-picked there.

I guess this happens because of the race between device plugin registration and the code in devicesToAllocate. Anyway, this does not seem right to me. A potential restart of kubelet should not interrupt the running workloads, right? Any thought on that?

Yes, something is not fine here and this deserve discussion, please file a separate issue, backlinking this one, for visibility. Thanks!

ffromani on Jun 8, 2023

This sounds like this adds additional requirements to kubelet admission - the ability to defer for retry admission for some time, rather than rejecting right away?

I agree that this kind of behavior would be useful but it is worth mentioning that currently we rely on the fact we don’t use naked pods and typically a pod is part of a deployment. New instances would be spinned up and at some stage a pod would succeed at admission and that’s when the deployment would succeed.

EDIT: The selling point of the semantic of “i can’t admit this now but probably can later” is that we don’t end up with runaway pod creation due to attempts of having a successful deployment.

swatisehgal on May 9, 2023

Ok, that’s useful. I’ll add that to the tracking list for improving admission - the problem of “i can’t admit this now but probably can later” comes in with other places (apparmor, waiting for pods to be gracefully deleted). Thanks

smarterclayton on May 9, 2023

xref: #110537 possibly related

Thanks for pointing to this ~~issue~~ PR, it is definitely related.

swatisehgal on Dec 21, 2022

What happened?

The following behaviour was reproduced on openshift 4.10 (“Kubelet version” kubeletVersion=“v1.23.3+54654d2”) but I believe the root cause is on kubernetes.

I managed to reproduce this issue on a kind based Kubernetes cluster (v1.24.7) so the root cause is certainly Kubernetes. I Will provide additional updates as I have 'em!

swatisehgal on Dec 7, 2022