kubernetes: CUDA_VISIBLE_DEVICES not set, no way to tell which devices were reserved
Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.): yes
What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.): CUDA_VISIBLE_DEVICES
Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT
Kubernetes version (use kubectl version
):
Client Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.2", GitCommit:"477efc3cbe6a7effca06bd1452fa356e2201e1ee", GitTreeState:"clean", BuildDate:"2017-04-19T22:51:55Z", GoVersion:"go1.8.1", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.2", GitCommit:"477efc3cbe6a7effca06bd1452fa356e2201e1ee", GitTreeState:"clean", BuildDate:"2017-04-19T20:22:08Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}
Environment:
- Cloud provider or hardware configuration: Baremetal
- OS (e.g. from /etc/os-release):
AME="Ubuntu"
VERSION="14.04.5 LTS, Trusty Tahr"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 14.04.5 LTS"
VERSION_ID="14.04"
- Kernel (e.g.
uname -a
): Node:
4.4.0-75-generic #96~14.04.1-Ubuntu SMP Thu Apr 20 11:06:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Pod:
Linux busybox2 4.4.0-75-generic #96~14.04.1-Ubuntu SMP Thu Apr 20 11:06:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
- Install tools:
- Others:
What happened: On my pod I set a resource limit of 2 gpus (new 1.6 feature). When I ssh into the node (which has 8 gpus total) and inspect the docker container, I see that only two devices were chosen and attached (correct behavior), but I don’t see how to determine which devices these were. From inside the pod, all 8 devices appear in /dev/nvidia{0…8}. Below I included my pod yml and node yml.
I also tried dumping the environment and don’t see anything
kubectl exec busybox2 -- bash -c "xargs --null --max-args=1 < /proc/1/environ"
Tried to schedule the following pod:
apiVersion: v1
kind: Pod
metadata:
name: busybox2
spec:
containers:
- image: [ubuntu 14.04]
securityContext:
privileged: true
volumeMounts:
- mountPath: /usr/local/nvidia
name: nvidia
- mountPath: /usr/local/cuda
name: cuda
- mountPath: /usr/lib/x86_64-linux-gnu/mesa
name: lib
command:
- sleep
- "3600"
name: busybox
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 2
volumes:
- hostPath:
path: /usr/local/nvidia
name: nvidia
- hostPath:
path: /usr/local/cuda
name: cuda
- hostPath:
path: /usr/lib/x86_64-linux-gnu
name: lib
restartPolicy: Never
on this node:
apiVersion: v1
kind: Node
metadata:
annotations:
node.alpha.kubernetes.io/ttl: "0"
volumes.kubernetes.io/controller-managed-attach-detach: "true"
creationTimestamp: 2017-05-16T18:58:04Z
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/os: linux
kubernetes.io/hostname: [host]
openai.org/gpu-count: "8"
openai.org/gpu-type: 1080ti
name: [host]
resourceVersion: "162166"
selfLink: /api/v1/nodes[host]
uid: 975a9d20-3a69-11e7-a320-fcaa14ea6d82
status:
...
allocatable:
alpha.kubernetes.io/nvidia-gpu: "8"
cpu: "40"
memory: 263939800Ki
pods: "28"
capacity:
alpha.kubernetes.io/nvidia-gpu: "8"
cpu: "40"
memory: 264042200Ki
pods: "28"
conditions:
- lastHeartbeatTime: 2017-05-17T20:41:16Z
lastTransitionTime: 2017-05-16T18:58:04Z
message: kubelet has sufficient disk space available
reason: KubeletHasSufficientDisk
status: "False"
type: OutOfDisk
- lastHeartbeatTime: 2017-05-17T20:41:16Z
lastTransitionTime: 2017-05-16T18:58:04Z
message: kubelet has sufficient memory available
reason: KubeletHasSufficientMemory
status: "False"
type: MemoryPressure
- lastHeartbeatTime: 2017-05-17T20:41:16Z
lastTransitionTime: 2017-05-16T18:58:04Z
message: kubelet has no disk pressure
reason: KubeletHasNoDiskPressure
status: "False"
type: DiskPressure
- lastHeartbeatTime: 2017-05-17T20:41:16Z
lastTransitionTime: 2017-05-16T18:58:04Z
message: kubelet is posting ready status. AppArmor enabled
reason: KubeletReady
status: "True"
type: Ready
daemonEndpoints:
kubeletEndpoint:
Port: 10250
images:
- names:
....
- nodeInfo:
architecture: amd64
bootID: fe667c90-b764-4a56-a817-518555e893bf
containerRuntimeVersion: docker://1.12.6
kernelVersion: 4.4.0-75-generic
kubeProxyVersion: v1.6.2
kubeletVersion: v1.6.2
machineID: 9cae6c159be1b1271f87c4075910b9a5
operatingSystem: linux
osImage: Ubuntu 14.04.5 LTS
systemUUID: 00000000-0000-0000-0000-AC1F6B206ED8
What you expected to happen: I expected either CUDA_VISIBLE_DEVICES to be set or only a subset of devices to appear in /dev/nvidia*
How to reproduce it (as minimally and precisely as possible): See above.
Anything else we need to know:
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 22 (16 by maintainers)
Commits related to this issue
- Add GPU mountpath warning The GPU path is not permissive by default and requires a bit of additional setup if the operator does not allow for privileged containers. Related kubernetes/kubernetes#460... — committed to cmluciano/kubernetes.github.io by deleted user 7 years ago
- Add GPU mountpath warning The GPU path is not permissive by default and requires a bit of additional setup if the operator does not allow for privileged containers. Related kubernetes/kubernetes#460... — committed to kubernetes/website by deleted user 7 years ago
Summary from SIG Node chat:
you should be able to mount
/dev/shm
, are you not able to do it?