k8s-device-plugin: OpenShift 3.9/Docker-CE, Could not register device plugin: context deadline exceeded
Following blog posting “How to use GPUs with Device Plugin in OpenShift 3.9 (Now Tech Preview!)” in blog.openshift.com
In my case, nvidia-device-plugin shows errors like below:
# oc logs -f nvidia-device-plugin-daemonset-nj9p8
2018/06/06 12:40:11 Loading NVML
2018/06/06 12:40:11 Fetching devices.
2018/06/06 12:40:11 Starting FS watcher.
2018/06/06 12:40:11 Starting OS watcher.
2018/06/06 12:40:11 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2018/06/06 12:40:16 Could not register device plugin: context deadline exceeded
2018/06/06 12:40:16 Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?
2018/06/06 12:40:16 You can check the prerequisites at: https://github.com/NVIDIA/k...
2018/06/06 12:40:16 You can learn how to set the runtime at: https://github.com/NVIDIA/k...
2018/06/06 12:40:16 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
...
- One of the device-plugin-daemonset pod description is
# oc describe pod nvidia-device-plugin-daemonset-2
Name: nvidia-device-plugin-daemonset-2jqgk
Namespace: nvidia
Node: node02/192.168.5.102
Start Time: Wed, 06 Jun 2018 22:59:32 +0900
Labels: controller-revision-hash=4102904998
name=nvidia-device-plugin-ds
pod-template-generation=1
Annotations: openshift.io/scc=nvidia-deviceplugin
Status: Running
IP: 192.168.5.102
Controlled By: DaemonSet/nvidia-device-plugin-daemonset
Containers:
nvidia-device-plugin-ctr:
Container ID: docker://b92280bd124df9fd46fe08ab4bbda76e2458cf5572f5ffc651661580bcd9126d
Image: nvidia/k8s-device-plugin:1.9
Image ID: docker-pullable://nvidia/k8s-device-plugin@sha256:7ba244bce75da00edd907209fe4cf7ea8edd0def5d4de71939899534134aea31
Port: <none>
State: Running
Started: Wed, 06 Jun 2018 22:59:34 +0900
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/var/lib/kubelet/device-plugins from device-plugin (rw)
/var/run/secrets/kubernetes.io/serviceaccount from nvidia-deviceplugin-token-cv7p5 (ro)
Conditions:
Type Status
Initialized True
Ready True
PodScheduled True
Volumes:
device-plugin:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/device-plugins
HostPathType:
nvidia-deviceplugin-token-cv7p5:
Type: Secret (a volume populated by a Secret)
SecretName: nvidia-deviceplugin-token-cv7p5
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/unreachable:NoExecute
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulMountVolume 1h kubelet, node02 MountVolume.SetUp succeeded for volume "device-plugin"
Normal SuccessfulMountVolume 1h kubelet, node02 MountVolume.SetUp succeeded for volume "nvidia-deviceplugin-token-cv7p5"
Normal Pulled 1h kubelet, node02 Container image "nvidia/k8s-device-plugin:1.9" already present on machine
Normal Created 1h kubelet, node02 Created container
Normal Started 1h kubelet, node02 Started container
-
And running “docker run -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.9” shows the log messages just like above.
-
On each origin-nodes, docker run test shows like this(its normal, right?),
# docker run --rm nvidia/cuda nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 | sed -e 's/ /-/g'
Tesla-P40
# docker run -it --rm docker.io/mirrorgoogleconta...
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
[Test Env.]
- 1 Master with OpenShift v3.9(Origin)
- 2 GPU nodes with Tesla-P40*2
- Docker-CE, nvidia-docker2 on GPU nodes
[Master]
# oc version
oc v3.9.0+46ff3a0-18
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO
Server https://MYDOMAIN.local:8443
openshift v3.9.0+46ff3a0-18
kubernetes v1.9.1+a0ce1bc657
# uname -r
3.10.0-862.3.2.el7.x86_64
# cat /etc/redhat-release
CentOS Linux release 7.5.1804 (Core)
[GPU nodes]
# docker version
Client:
Version: 18.03.1-ce
API version: 1.37
Go version: go1.9.5
Git commit: 9ee9f40
Built: Thu Apr 26 07:20:16 2018
OS/Arch: linux/amd64
Experimental: false
Orchestrator: swarm
Server:
Engine:
Version: 18.03.1-ce
API version: 1.37 (minimum version 1.12)
Go version: go1.9.5
Git commit: 9ee9f40
Built: Thu Apr 26 07:23:58 2018
OS/Arch: linux/amd64
Experimental: false
# uname -r
3.10.0-862.3.2.el7.x86_64
# cat /etc/redhat-release
CentOS Linux release 7.5.1804 (Core)
# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
4b1a37d31cb9 openshift/node:v3.9.0 "/usr/local/bin/orig…" 22 minutes ago Up 21 minutes origin-node
efbedeeb88f0 fe3e6b0d95b5 "nvidia-device-plugin" About an hour ago Up About an hour k8s_nvidia-device-plugin-ctr_nvidia-device-plugin-daemonset-4sn5v_nvidia_bffb6d61-6986-11e8-8dd7-0cc47ad9bf7a_0
36aa988447b8 openshift/origin-pod:v3.9.0 "/usr/bin/pod" About an hour ago Up About an hour k8s_POD_nvidia-device-plugin-daemonset-4sn5v_nvidia_bffb6d61-6986-11e8-8dd7-0cc47ad9bf7a_0
6e6b598fa144 openshift/openvswitch:v3.9.0 "/usr/local/bin/ovs-…" 2 hours ago Up 2 hours openvswitch
# cat /etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
Please help me with this problem. TIA!
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 24 (7 by maintainers)
So much thanks, @RenaudWasTaken - with your help, it works now!
Here’s what I’ve done. May these be helpful with further documentation of the project.
In my case, using OpenShift v3.9 and Docker-ce with Nvidia-p40 GPUs, not like kubernetes-only env(editing kubernetes manifest files in the worker node…)., I had to edit the file ‘/etc/systemd/system/origin-node.service’
Next, restart origin-node service(on each GPU worker node)
Delete existing k8s-device-plugin daemonset with abnormal status(on master node).
Check the yaml for running k8s-device-plugin and re-create the daemonset(on master node).
Now, check the node’s GPU capacity.
Running the test(cuda-vector-add) pod …
Thank you for all your interest and help.
I think I figured it out. Your kubelet is running in a container!
You need to mount the host directory there too. Add
-v /var/lib/kubelet/device-plugins/:/var/lib/kubelet/device-plugins/
to that docker command (of your origin-node systemd unit file)