k8s-device-plugin: OpenShift 3.9/Docker-CE, Could not register device plugin: context deadline exceeded

Following blog posting “How to use GPUs with Device Plugin in OpenShift 3.9 (Now Tech Preview!)” in blog.openshift.com

In my case, nvidia-device-plugin shows errors like below:

# oc logs -f nvidia-device-plugin-daemonset-nj9p8
2018/06/06 12:40:11 Loading NVML
2018/06/06 12:40:11 Fetching devices.
2018/06/06 12:40:11 Starting FS watcher.
2018/06/06 12:40:11 Starting OS watcher.
2018/06/06 12:40:11 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2018/06/06 12:40:16 Could not register device plugin: context deadline exceeded
2018/06/06 12:40:16 Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?
2018/06/06 12:40:16 You can check the prerequisites at: https://github.com/NVIDIA/k...
2018/06/06 12:40:16 You can learn how to set the runtime at: https://github.com/NVIDIA/k...
2018/06/06 12:40:16 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
...

One of the device-plugin-daemonset pod description is

# oc describe pod nvidia-device-plugin-daemonset-2
Name:           nvidia-device-plugin-daemonset-2jqgk
Namespace:      nvidia
Node:           node02/192.168.5.102
Start Time:     Wed, 06 Jun 2018 22:59:32 +0900
Labels:         controller-revision-hash=4102904998
                name=nvidia-device-plugin-ds
                pod-template-generation=1
Annotations:    openshift.io/scc=nvidia-deviceplugin
Status:         Running
IP:             192.168.5.102
Controlled By:  DaemonSet/nvidia-device-plugin-daemonset
Containers:
  nvidia-device-plugin-ctr:
    Container ID:   docker://b92280bd124df9fd46fe08ab4bbda76e2458cf5572f5ffc651661580bcd9126d
    Image:          nvidia/k8s-device-plugin:1.9
    Image ID:       docker-pullable://nvidia/k8s-device-plugin@sha256:7ba244bce75da00edd907209fe4cf7ea8edd0def5d4de71939899534134aea31
    Port:           <none>
    State:          Running
      Started:      Wed, 06 Jun 2018 22:59:34 +0900
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from nvidia-deviceplugin-token-cv7p5 (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          True 
  PodScheduled   True 
Volumes:
  device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:  
  nvidia-deviceplugin-token-cv7p5:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  nvidia-deviceplugin-token-cv7p5
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/unreachable:NoExecute
Events:
  Type    Reason                 Age   From             Message
  ----    ------                 ----  ----             -------
  Normal  SuccessfulMountVolume  1h    kubelet, node02  MountVolume.SetUp succeeded for volume "device-plugin"
  Normal  SuccessfulMountVolume  1h    kubelet, node02  MountVolume.SetUp succeeded for volume "nvidia-deviceplugin-token-cv7p5"
  Normal  Pulled                 1h    kubelet, node02  Container image "nvidia/k8s-device-plugin:1.9" already present on machine
  Normal  Created                1h    kubelet, node02  Created container
  Normal  Started                1h    kubelet, node02  Started container

And running “docker run -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.9” shows the log messages just like above.
On each origin-nodes, docker run test shows like this(its normal, right?),

# docker run --rm nvidia/cuda nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 | sed -e 's/ /-/g'
Tesla-P40

# docker run -it --rm docker.io/mirrorgoogleconta...
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

[Test Env.]

1 Master with OpenShift v3.9(Origin)
2 GPU nodes with Tesla-P40*2
Docker-CE, nvidia-docker2 on GPU nodes

[Master]

# oc version
oc v3.9.0+46ff3a0-18
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://MYDOMAIN.local:8443
openshift v3.9.0+46ff3a0-18
kubernetes v1.9.1+a0ce1bc657

# uname -r
3.10.0-862.3.2.el7.x86_64

# cat /etc/redhat-release 
CentOS Linux release 7.5.1804 (Core)

[GPU nodes]

# docker version
Client:
Version: 18.03.1-ce
API version: 1.37
Go version: go1.9.5
Git commit: 9ee9f40
Built: Thu Apr 26 07:20:16 2018
OS/Arch: linux/amd64
Experimental: false
Orchestrator: swarm

Server:
Engine:
Version: 18.03.1-ce
API version: 1.37 (minimum version 1.12)
Go version: go1.9.5
Git commit: 9ee9f40
Built: Thu Apr 26 07:23:58 2018
OS/Arch: linux/amd64
Experimental: false

# uname -r
3.10.0-862.3.2.el7.x86_64

# cat /etc/redhat-release 
CentOS Linux release 7.5.1804 (Core)

# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
4b1a37d31cb9 openshift/node:v3.9.0 "/usr/local/bin/orig…" 22 minutes ago Up 21 minutes origin-node
efbedeeb88f0 fe3e6b0d95b5 "nvidia-device-plugin" About an hour ago Up About an hour k8s_nvidia-device-plugin-ctr_nvidia-device-plugin-daemonset-4sn5v_nvidia_bffb6d61-6986-11e8-8dd7-0cc47ad9bf7a_0
36aa988447b8 openshift/origin-pod:v3.9.0 "/usr/bin/pod" About an hour ago Up About an hour k8s_POD_nvidia-device-plugin-daemonset-4sn5v_nvidia_bffb6d61-6986-11e8-8dd7-0cc47ad9bf7a_0
6e6b598fa144 openshift/openvswitch:v3.9.0 "/usr/local/bin/ovs-…" 2 hours ago Up 2 hours openvswitch

# cat /etc/docker/daemon.json 
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

Please help me with this problem. TIA!

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 24 (7 by maintainers)

Most upvoted comments

So much thanks, @RenaudWasTaken - with your help, it works now!

Here’s what I’ve done. May these be helpful with further documentation of the project.

In my case, using OpenShift v3.9 and Docker-ce with Nvidia-p40 GPUs, not like kubernetes-only env(editing kubernetes manifest files in the worker node…)., I had to edit the file ‘/etc/systemd/system/origin-node.service’

### edit "ExecStart=/usr/bin/docker run ..." line
### add "-v /var/lib/kubelet/device-plugins/:/var/lib/kubelet/device-plugins/" line as below
# REDACTED
ExecStart=/usr/bin/docker run --name origin-node \
...
  \
  -v /dev:/dev $DOCKER_ADDTL_BIND_MOUNTS -v /etc/pki:/etc/pki:ro \
  \
  -v /var/lib/kubelet/device-plugins/:/var/lib/kubelet/device-plugins/ \
  \
  openshift/node:${IMAGE_VERSION}
...

Next, restart origin-node service(on each GPU worker node)

# setenforce 0
# systemctl daemon-reload
# systemctl restart origin-node

Delete existing k8s-device-plugin daemonset with abnormal status(on master node).

# oc delete -f nvidia-device-plugin-daemonset.yml
daemonset "nvidia-device-plugin-daemonset" deleted

Check the yaml for running k8s-device-plugin and re-create the daemonset(on master node).

# vi 02-nvidia-device-plugin-daemonset.yml

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: nvidia 
spec:
 template:
   metadata:
     labels:
       name: nvidia-device-plugin-ds
   spec:
     priorityClassName: system-node-critical
     affinity:
       nodeAffinity:
         requiredDuringSchedulingIgnoredDuringExecution:
           nodeSelectorTerms:
           - matchExpressions:
             - key: openshift.com/gpu-accelerator 
               operator: Exists
     securityContext:
        privileged: true
     serviceAccount: nvidia-deviceplugin
     serivceAccountName: nvidia-deviceplugin
     hostNetwork: true
     hostPID: true
     containers:
     - image: nvidia/k8s-device-plugin:1.9
       name: nvidia-device-plugin-ctr
       securityContext:
         allowPrivilegeEscalation: false
         capabilities:
           drop: ["ALL"]
       volumeMounts:
         - name: device-plugin
           mountPath: /var/lib/kubelet/device-plugins
     volumes:
       - name: device-plugin
         hostPath:
           path: /var/lib/kubelet/device-plugins

# oc create -f 02-nvidia-device-plugin-daemonset.yml
daemonset "nvidia-device-plugin-daemonset" created
# oc logs -f nvidia-device-plugin-daemonset-89nrf 
2018/06/18 09:54:12 Loading NVML
2018/06/18 09:54:12 Fetching devices.
2018/06/18 09:54:12 Starting FS watcher.
2018/06/18 09:54:12 Starting OS watcher.
2018/06/18 09:54:12 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2018/06/18 09:54:12 Registered device plugin with Kubelet
^C

Now, check the node’s GPU capacity.

# oc describe node node01 | egrep 'Capacity|Allocatable|gpu'
Labels:             apptier=gpu
                    openshift.com/gpu-accelerator=true
Capacity:
 nvidia.com/gpu:  2
Allocatable:
 nvidia.com/gpu:  2
  Normal  NodeAllocatableEnforced  17m                kubelet, node01  Updated Node Allocatable limit across pods
  Normal  NodeAllocatableEnforced  14m                kubelet, node01  Updated Node Allocatable limit across pods

Running the test(cuda-vector-add) pod …

# vi cuda-vector-add.yaml

apiVersion: v1
kind: Pod
metadata:
 name: cuda-vector-add
 namespace: nvidia
spec:
 restartPolicy: OnFailure
 containers:
   - name: cuda-vector-add
     image: "docker.io/mirrorgooglecontainers/cuda-vector-add:v0.1"
     env:
       - name: NVIDIA_VISIBLE_DEVICES
         value: all
       - name: NVIDIA_DRIVER_CAPABILITIES
         value: "compute,utility"
       - name: NVIDIA_REQUIRE_CUDA
         value: "cuda>=8.0"
     resources:
       limits:
         nvidia.com/gpu: 1 # requesting 1 GPU

# oc create -f cuda-vector-add.yaml
pod "cuda-vector-add" created
# oc get pods
NAME                                   READY     STATUS      RESTARTS   AGE
cuda-vector-add                        0/1       Completed   0          5s
nvidia-device-plugin-daemonset-7rl44   1/1       Running     0          4m
nvidia-device-plugin-daemonset-nghph   1/1       Running     0          4m
# oc logs -f cuda-vector-add 
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Thank you for all your interest and help.

DragOnMe on Jun 19, 2018

I think I figured it out. Your kubelet is running in a container!

You need to mount the host directory there too. Add -v /var/lib/kubelet/device-plugins/:/var/lib/kubelet/device-plugins/ to that docker command (of your origin-node systemd unit file)

RenaudWasTaken on Jun 15, 2018