volcano: "volocano.sh/vgpu-number" is not included in the allocatable resources.
What happened:
I followed the user guide to set up vgpu, but “volocano.sh/vgpu-number” is not included in the allocatable resources.
user guide: https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_vgpu.md
What you expected to happen:
“volcano.sh/vgpu-number: XX” is included by executing the following command.
root@k8s-tryvolcano-m001:~# k get node k8s-tryvolcano-w004 -ojson | jq .status.allocatable
{
"cpu": "2",
"ephemeral-storage": "93492209510",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"memory": "8050764Ki",
"pods": "110"
}
How to reproduce it (as minimally and precisely as possible):
Prerequisites:
- kubernetes cluster v1.24.3 is running
- Installed volocano
Reproduce:
- Install nvidia drivers in new GPU worker node.
- Install nvidia-docker2 in new GPU worker node.
- Install kubernetes in new GPU worker node.
- Join new GPU worker node to kubernetes cluster.
- Install volcano-vgpu-plugin.
Note: I refered to https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_vgpu.md.
Anything else we need to know?:
Environment:
- Volcano Version:
v1.8.0
- Kubernetes version (use
kubectl version
):
root@k8s-tryvolcano-m001:~# k get node k8s-tryvolcano-w004 -owide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k8s-tryvolcano-w004 Ready <none> 18h v1.24.3 192.168.100.168 <none> Ubuntu 20.04.6 LTS 5.4.0-72-generic containerd://1.7.2
- Cloud provider or hardware configuration:
Cloud provider: OpenStack
- OS (e.g. from /etc/os-release):
root@k8s-tryvolcano-w004:~# cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
- Kernel (e.g.
uname -a
):
root@k8s-tryvolcano-w004:~# uname -a
Linux k8s-tryvolcano-w004 5.4.0-72-generic #80-Ubuntu SMP Mon Apr 12 17:35:00 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
- Install tools:
kubeadm
- Others:
Nvidia driver
root@k8s-tryvolcano-w004:~# dpkg -l | grep nvidia-driver
ii nvidia-driver-535-server-open 535.104.12-0ubuntu0.20.04.1 amd64 NVIDIA driver (open kernel) metapackage
nvidia-docker2
root@k8s-tryvolcano-w004:~# dpkg -l | grep nvidia-docker
ii nvidia-docker2 2.13.0-1 all nvidia-docker CLI wrapper
GPU
root@k8s-tryvolcano-w004:~# nvidia-smi
Thu Oct 19 02:24:55 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000000:00:05.0 Off | 0 |
| N/A 43C P0 63W / 300W | 4MiB / 81920MiB | 20% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
“volocano-device-plugin” pod log
I1018 08:42:42.247448 1 main.go:77] Loading NVML
I1018 08:42:42.317422 1 main.go:91] Starting FS watcher.
I1018 08:42:42.317465 1 main.go:98] Starting OS watcher.
I1018 08:42:42.317759 1 main.go:116] Retreiving plugins.
I1018 08:42:42.317770 1 main.go:155] No devices found. Waiting indefinitely.
I1018 08:42:42.317783 1 register.go:101] into WatchAndRegister
I1018 08:42:42.360498 1 register.go:89] Reporting devices in 2023-10-18 08:42:42.360494312 +0000 UTC m=+0.116513880
I1018 08:43:12.468827 1 register.go:89] Reporting devices in 2023-10-18 08:43:12.468819399 +0000 UTC m=+30.224838968
I1018 08:43:42.485190 1 register.go:89] Reporting devices in 2023-10-18 08:43:42.485182962 +0000 UTC m=+60.241202532
I1018 08:44:12.505930 1 register.go:89] Reporting devices in 2023-10-18 08:44:12.505920612 +0000 UTC m=+90.261940182
I1018 08:44:42.523805 1 register.go:89] Reporting devices in 2023-10-18 08:44:42.523797163 +0000 UTC m=+120.279816722
I1018 08:45:12.542654 1 register.go:89] Reporting devices in 2023-10-18 08:45:12.542646375 +0000 UTC m=+150.298665943
I1018 08:45:42.564609 1 register.go:89] Reporting devices in 2023-10-18 08:45:42.564600701 +0000 UTC m=+180.320620270
I1018 08:46:12.584788 1 register.go:89] Reporting devices in 2023-10-18 08:46:12.584777812 +0000 UTC m=+210.340797381
I1018 08:46:42.653138 1 register.go:89] Reporting devices in 2023-10-18 08:46:42.653129051 +0000 UTC m=+240.409148620
I1018 08:47:12.674599 1 register.go:89] Reporting devices in 2023-10-18 08:47:12.674591614 +0000 UTC m=+270.430611183
I1018 08:47:42.690977 1 register.go:89] Reporting devices in 2023-10-18 08:47:42.69097107 +0000 UTC m=+300.446990640
I1018 08:48:12.707222 1 register.go:89] Reporting devices in 2023-10-18 08:48:12.707213231 +0000 UTC m=+330.463232800
I1018 08:48:42.781451 1 register.go:89] Reporting devices in 2023-10-18 08:48:42.781437965 +0000 UTC m=+360.537457544
I1018 08:49:12.816300 1 register.go:89] Reporting devices in 2023-10-18 08:49:12.816292362 +0000 UTC m=+390.572311921
I1018 08:49:42.834850 1 register.go:89] Reporting devices in 2023-10-18 08:49:42.834844163 +0000 UTC m=+420.590863732
I1018 08:50:12.855810 1 register.go:89] Reporting devices in 2023-10-18 08:50:12.855797817 +0000 UTC m=+450.611817406
I1018 08:50:42.875763 1 register.go:89] Reporting devices in 2023-10-18 08:50:42.875755678 +0000 UTC m=+480.631775247
I1018 08:51:12.892908 1 register.go:89] Reporting devices in 2023-10-18 08:51:12.89289625 +0000 UTC m=+510.648915829
I1018 08:51:42.913563 1 register.go:89] Reporting devices in 2023-10-18 08:51:42.913556355 +0000 UTC m=+540.669575924
I1018 08:52:12.938239 1 register.go:89] Reporting devices in 2023-10-18 08:52:12.93823072 +0000 UTC m=+570.694250290
I1018 08:52:42.968125 1 register.go:89] Reporting devices in 2023-10-18 08:52:42.968118172 +0000 UTC m=+600.724137731
I1018 08:53:12.988476 1 register.go:89] Reporting devices in 2023-10-18 08:53:12.988467434 +0000 UTC m=+630.744487003
...
volcano-scheduler-configmap
root@k8s-tryvolcano-m001:~# kubectl get cm -n volcano-system volcano-scheduler-configmap -oyaml
apiVersion: v1
data:
volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: priority
- name: gang
enablePreemptable: true
enableReclaimable: false
- name: conformance
- plugins:
- name: overcommit
- name: drf
enablePreemptable: false
- name: predicates
arguments:
predicate.VGPUEnable: true
- name: proportion
- name: nodeorder
- name: binpack
kind: ConfigMap
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","data":{"volcano-scheduler.conf":"actions: \"enqueue, allocate, backfill\"\ntiers:\n- plugins:\n - name: priority\n - name: gang\n enablePreemptable: false\n - name: conformance\n- plugins:\n - name: overcommit\n - name: drf\n enablePreemptable: false\n - name: predicates\n - name: proportion\n - name: nodeorder\n - name: binpack\n"},"kind":"ConfigMap","metadata":{"annotations":{},"name":"volcano-scheduler-configmap","namespace":"volcano-system"}}
creationTimestamp: "2023-09-21T04:44:44Z"
name: volcano-scheduler-configmap
namespace: volcano-system
resourceVersion: "4267609"
uid: 086455c9-7a0e-42b0-a938-4e56a6371207
About this issue
- Original URL
- State: open
- Created 8 months ago
- Comments: 19 (8 by maintainers)
can this issue be reproduced without install Gpu Operator?