kubernetes: DevicePlugin is not working correct when the limit set to 0
/kind bug /sig node /area hw-accelerators /cc @vishh @jiayingz @vikaschoudhary16 @RenaudWasTaken
What happened:
we can see all GPUs in container (nvidia-smi
) when the limit of “nvidia.com/gpu” set to 0
also, ls /dev/nvidia*
in container can see all GPU device files.
What you expected to happen:
ls /dev/nvidia*
in container shows nothing when the limit of “nvidia.com/gpu” set to 0
How to reproduce it (as minimally and precisely as possible):
yaml file:
apiVersion: v1
kind: Pod
metadata:
name: fff
spec:
containers:
- name: fff
image: nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04
resources:
limits:
nvidia.com/gpu: "0"
requests:
nvidia.com/gpu: "0"
command: ["/bin/sh", "-c"]
args: ["sleep 10d;"]
imagePullPolicy: IfNotPresent
dnsPolicy: ClusterFirst
restartPolicy: Never
terminationGracePeriodSeconds: 1
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version
): - Cloud provider or hardware configuration:
- OS (e.g. from /etc/os-release):
- Kernel (e.g.
uname -a
): - Install tools:
- Others:
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 41 (28 by maintainers)
@RenaudWasTaken thanks for your reply, look forward to the new release. According to the doc https://github.com/nvidia/nvidia-container-runtime#environment-variables-oci-spec, I add these lines in YAML file as a workaround when
nvidia.com/gpu == 0
@jiayingz I got finally around testing the GKE nvidia device plugin and it works nicely. I have now k8s setup with gpu quotas working as expected. Also with GKE plugin, I don’t have the env var issue as the code does its job well and simply to make it work.
I would suggest though to make the instructions of driver way simpler for OS where you don’t have issue of non-writable rootfs e.g. Ubuntu/CentOS. As docs on driver installation are more complex than needed.
For reference, I have used ansible script to install the drivers on my ubuntu boxes, but it can be used quite easily on other OSes (need to adapt package installation part), after this I deploy daemonset with correct parameters, I needed to pass to device plugin to make it work correctly. I also tested the Nvidia index image
nvcr.io/nvidia-hpcvis/index:1.0
which uses visualisation part and it seems to be working on this setup, as well as my deep learning setup. These are the only two things I needed to get GPU working on k8s.Thanks @mindprince @jiayingz for your help. @pineking you might want to give this setup a try to remove your issue on env vars as well.
So wondering is there any plan to add the hook functionality to device plugin api. As current solution of using nvidia gpu with device plugin is not suitable for multi tenant environment.
For now I think
PodPreset
might solve my problem which will inject in every pod env var asNVIDIA_VISIBLE_DEVICES: ""
except when a pod has certain label e.g.gpu: nvidia
. As then I am sure there will not be unintended exposure of gpu devices to any pod unless its asked for and combine this with admission controller which will check if the labelgpu: nvidia
is present then then there is resources/limit also set e.g.nvidia.com/gpu: 1
otherwise strip the label.