kubernetes: CUDA_VISIBLE_DEVICES not set, no way to tell which devices were reserved

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.): yes

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.): CUDA_VISIBLE_DEVICES


Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.2", GitCommit:"477efc3cbe6a7effca06bd1452fa356e2201e1ee", GitTreeState:"clean", BuildDate:"2017-04-19T22:51:55Z", GoVersion:"go1.8.1", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.2", GitCommit:"477efc3cbe6a7effca06bd1452fa356e2201e1ee", GitTreeState:"clean", BuildDate:"2017-04-19T20:22:08Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Cloud provider or hardware configuration: Baremetal
  • OS (e.g. from /etc/os-release):
AME="Ubuntu"
VERSION="14.04.5 LTS, Trusty Tahr"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 14.04.5 LTS"
VERSION_ID="14.04"
  • Kernel (e.g. uname -a): Node:
4.4.0-75-generic #96~14.04.1-Ubuntu SMP Thu Apr 20 11:06:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Pod:

Linux busybox2 4.4.0-75-generic #96~14.04.1-Ubuntu SMP Thu Apr 20 11:06:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
  • Others:

What happened: On my pod I set a resource limit of 2 gpus (new 1.6 feature). When I ssh into the node (which has 8 gpus total) and inspect the docker container, I see that only two devices were chosen and attached (correct behavior), but I don’t see how to determine which devices these were. From inside the pod, all 8 devices appear in /dev/nvidia{0…8}. Below I included my pod yml and node yml.

I also tried dumping the environment and don’t see anything

kubectl exec busybox2 -- bash -c "xargs --null --max-args=1 < /proc/1/environ"

Tried to schedule the following pod:

apiVersion: v1
kind: Pod
metadata:
  name: busybox2
spec:
  containers:
  - image: [ubuntu 14.04]
    securityContext:
      privileged: true
    volumeMounts:
    - mountPath: /usr/local/nvidia
      name: nvidia
    - mountPath: /usr/local/cuda
      name: cuda
    - mountPath: /usr/lib/x86_64-linux-gnu/mesa
      name: lib
    command:
      - sleep
      - "3600"
    name: busybox
    resources:
      limits:
        alpha.kubernetes.io/nvidia-gpu: 2
  volumes:
  - hostPath:
      path: /usr/local/nvidia
    name: nvidia
  - hostPath:
      path: /usr/local/cuda
    name: cuda
  - hostPath:
      path: /usr/lib/x86_64-linux-gnu
    name: lib
  restartPolicy: Never

on this node:

apiVersion: v1
kind: Node
metadata:
  annotations:
    node.alpha.kubernetes.io/ttl: "0"
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: 2017-05-16T18:58:04Z
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/os: linux
    kubernetes.io/hostname: [host]
    openai.org/gpu-count: "8"
    openai.org/gpu-type: 1080ti
  name: [host]
  resourceVersion: "162166"
  selfLink: /api/v1/nodes[host]
  uid: 975a9d20-3a69-11e7-a320-fcaa14ea6d82
status:
 ...
  allocatable:
    alpha.kubernetes.io/nvidia-gpu: "8"
    cpu: "40"
    memory: 263939800Ki
    pods: "28"
  capacity:
    alpha.kubernetes.io/nvidia-gpu: "8"
    cpu: "40"
    memory: 264042200Ki
    pods: "28"
  conditions:
  - lastHeartbeatTime: 2017-05-17T20:41:16Z
    lastTransitionTime: 2017-05-16T18:58:04Z
    message: kubelet has sufficient disk space available
    reason: KubeletHasSufficientDisk
    status: "False"
    type: OutOfDisk
  - lastHeartbeatTime: 2017-05-17T20:41:16Z
    lastTransitionTime: 2017-05-16T18:58:04Z
    message: kubelet has sufficient memory available
    reason: KubeletHasSufficientMemory
    status: "False"
    type: MemoryPressure
  - lastHeartbeatTime: 2017-05-17T20:41:16Z
    lastTransitionTime: 2017-05-16T18:58:04Z
    message: kubelet has no disk pressure
    reason: KubeletHasNoDiskPressure
    status: "False"
    type: DiskPressure
  - lastHeartbeatTime: 2017-05-17T20:41:16Z
    lastTransitionTime: 2017-05-16T18:58:04Z
    message: kubelet is posting ready status. AppArmor enabled
    reason: KubeletReady
    status: "True"
    type: Ready
  daemonEndpoints:
    kubeletEndpoint:
      Port: 10250
  images:
  - names:
  ....
    -   nodeInfo:
    architecture: amd64
    bootID: fe667c90-b764-4a56-a817-518555e893bf
    containerRuntimeVersion: docker://1.12.6
    kernelVersion: 4.4.0-75-generic
    kubeProxyVersion: v1.6.2
    kubeletVersion: v1.6.2
    machineID: 9cae6c159be1b1271f87c4075910b9a5
    operatingSystem: linux
    osImage: Ubuntu 14.04.5 LTS
    systemUUID: 00000000-0000-0000-0000-AC1F6B206ED8

What you expected to happen: I expected either CUDA_VISIBLE_DEVICES to be set or only a subset of devices to appear in /dev/nvidia*

How to reproduce it (as minimally and precisely as possible): See above.

Anything else we need to know:

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 22 (16 by maintainers)

Commits related to this issue

Most upvoted comments

Summary from SIG Node chat:

  • Privileged containers are needed for those libraries as you mentioned
  • Workarounds include creating a PV with libraries and then mounting these into the host
  • Advanced solution is a sort of volume injection system (similar to the PV solution) and mounting it automatically to pods requesting GPUs

you should be able to mount /dev/shm, are you not able to do it?