kubernetes: DevicePlugin is not working correct when the limit set to 0

/kind bug /sig node /area hw-accelerators /cc @vishh @jiayingz @vikaschoudhary16 @RenaudWasTaken

What happened: we can see all GPUs in container (nvidia-smi) when the limit of “nvidia.com/gpu” set to 0 also, ls /dev/nvidia* in container can see all GPU device files.

What you expected to happen: ls /dev/nvidia* in container shows nothing when the limit of “nvidia.com/gpu” set to 0

How to reproduce it (as minimally and precisely as possible):

yaml file:

apiVersion: v1
kind: Pod
metadata:
    name: fff
spec:
  containers:
  - name: fff
    image: nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04
    resources:
      limits:
        nvidia.com/gpu: "0"
      requests:
        nvidia.com/gpu: "0"
    command: ["/bin/sh", "-c"]
    args: ["sleep 10d;"]
    imagePullPolicy: IfNotPresent
  dnsPolicy: ClusterFirst
  restartPolicy: Never
  terminationGracePeriodSeconds: 1

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 41 (28 by maintainers)

Most upvoted comments

@RenaudWasTaken thanks for your reply, look forward to the new release. According to the doc https://github.com/nvidia/nvidia-container-runtime#environment-variables-oci-spec, I add these lines in YAML file as a workaround when nvidia.com/gpu == 0

env:
    - name: NVIDIA_VISIBLE_DEVICES
      value: ""

pineking on Feb 9, 2018

@jiayingz I got finally around testing the GKE nvidia device plugin and it works nicely. I have now k8s setup with gpu quotas working as expected. Also with GKE plugin, I don’t have the env var issue as the code does its job well and simply to make it work.

I would suggest though to make the instructions of driver way simpler for OS where you don’t have issue of non-writable rootfs e.g. Ubuntu/CentOS. As docs on driver installation are more complex than needed.

For reference, I have used ansible script to install the drivers on my ubuntu boxes, but it can be used quite easily on other OSes (need to adapt package installation part), after this I deploy daemonset with correct parameters, I needed to pass to device plugin to make it work correctly. I also tested the Nvidia index image nvcr.io/nvidia-hpcvis/index:1.0 which uses visualisation part and it seems to be working on this setup, as well as my deep learning setup. These are the only two things I needed to get GPU working on k8s.

Thanks @mindprince @jiayingz for your help. @pineking you might want to give this setup a try to remove your issue on env vars as well.

gurvindersingh on Jun 14, 2018

So wondering is there any plan to add the hook functionality to device plugin api. As current solution of using nvidia gpu with device plugin is not suitable for multi tenant environment.

For now I think PodPreset might solve my problem which will inject in every pod env var as NVIDIA_VISIBLE_DEVICES: "" except when a pod has certain label e.g. gpu: nvidia. As then I am sure there will not be unintended exposure of gpu devices to any pod unless its asked for and combine this with admission controller which will check if the label gpu: nvidia is present then then there is resources/limit also set e.g. nvidia.com/gpu: 1 otherwise strip the label.

gurvindersingh on Jun 8, 2018