volcano: "volocano.sh/vgpu-number" is not included in the allocatable resources.

What happened:

I followed the user guide to set up vgpu, but “volocano.sh/vgpu-number” is not included in the allocatable resources.

user guide: https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_vgpu.md

What you expected to happen:

“volcano.sh/vgpu-number: XX” is included by executing the following command.

root@k8s-tryvolcano-m001:~# k get node k8s-tryvolcano-w004 -ojson | jq .status.allocatable
{
  "cpu": "2",
  "ephemeral-storage": "93492209510",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "8050764Ki",
  "pods": "110"
}

How to reproduce it (as minimally and precisely as possible):

Prerequisites:

  • kubernetes cluster v1.24.3 is running
  • Installed volocano

Reproduce:

  1. Install nvidia drivers in new GPU worker node.
  2. Install nvidia-docker2 in new GPU worker node.
  3. Install kubernetes in new GPU worker node.
  4. Join new GPU worker node to kubernetes cluster.
  5. Install volcano-vgpu-plugin.

Note: I refered to https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_vgpu.md.

Anything else we need to know?:

Environment:

  • Volcano Version:

v1.8.0

  • Kubernetes version (use kubectl version):
root@k8s-tryvolcano-m001:~# k get node k8s-tryvolcano-w004 -owide
NAME                  STATUS   ROLES    AGE   VERSION   INTERNAL-IP       EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
k8s-tryvolcano-w004   Ready    <none>   18h   v1.24.3   192.168.100.168   <none>        Ubuntu 20.04.6 LTS   5.4.0-72-generic   containerd://1.7.2
  • Cloud provider or hardware configuration:

Cloud provider: OpenStack

  • OS (e.g. from /etc/os-release):
root@k8s-tryvolcano-w004:~# cat /etc/os-release 
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
  • Kernel (e.g. uname -a):
root@k8s-tryvolcano-w004:~# uname -a
Linux k8s-tryvolcano-w004 5.4.0-72-generic #80-Ubuntu SMP Mon Apr 12 17:35:00 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:

kubeadm

  • Others:

Nvidia driver

root@k8s-tryvolcano-w004:~# dpkg -l | grep nvidia-driver
ii  nvidia-driver-535-server-open         535.104.12-0ubuntu0.20.04.1       amd64        NVIDIA driver (open kernel) metapackage

nvidia-docker2

root@k8s-tryvolcano-w004:~# dpkg -l | grep nvidia-docker
ii  nvidia-docker2                        2.13.0-1                          all          nvidia-docker CLI wrapper

GPU

root@k8s-tryvolcano-w004:~# nvidia-smi 
Thu Oct 19 02:24:55 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          Off | 00000000:00:05.0 Off |                    0 |
| N/A   43C    P0              63W / 300W |      4MiB / 81920MiB |     20%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

“volocano-device-plugin” pod log

I1018 08:42:42.247448       1 main.go:77] Loading NVML
I1018 08:42:42.317422       1 main.go:91] Starting FS watcher.
I1018 08:42:42.317465       1 main.go:98] Starting OS watcher.
I1018 08:42:42.317759       1 main.go:116] Retreiving plugins.
I1018 08:42:42.317770       1 main.go:155] No devices found. Waiting indefinitely.
I1018 08:42:42.317783       1 register.go:101] into WatchAndRegister
I1018 08:42:42.360498       1 register.go:89] Reporting devices  in 2023-10-18 08:42:42.360494312 +0000 UTC m=+0.116513880
I1018 08:43:12.468827       1 register.go:89] Reporting devices  in 2023-10-18 08:43:12.468819399 +0000 UTC m=+30.224838968
I1018 08:43:42.485190       1 register.go:89] Reporting devices  in 2023-10-18 08:43:42.485182962 +0000 UTC m=+60.241202532
I1018 08:44:12.505930       1 register.go:89] Reporting devices  in 2023-10-18 08:44:12.505920612 +0000 UTC m=+90.261940182
I1018 08:44:42.523805       1 register.go:89] Reporting devices  in 2023-10-18 08:44:42.523797163 +0000 UTC m=+120.279816722
I1018 08:45:12.542654       1 register.go:89] Reporting devices  in 2023-10-18 08:45:12.542646375 +0000 UTC m=+150.298665943
I1018 08:45:42.564609       1 register.go:89] Reporting devices  in 2023-10-18 08:45:42.564600701 +0000 UTC m=+180.320620270
I1018 08:46:12.584788       1 register.go:89] Reporting devices  in 2023-10-18 08:46:12.584777812 +0000 UTC m=+210.340797381
I1018 08:46:42.653138       1 register.go:89] Reporting devices  in 2023-10-18 08:46:42.653129051 +0000 UTC m=+240.409148620
I1018 08:47:12.674599       1 register.go:89] Reporting devices  in 2023-10-18 08:47:12.674591614 +0000 UTC m=+270.430611183
I1018 08:47:42.690977       1 register.go:89] Reporting devices  in 2023-10-18 08:47:42.69097107 +0000 UTC m=+300.446990640
I1018 08:48:12.707222       1 register.go:89] Reporting devices  in 2023-10-18 08:48:12.707213231 +0000 UTC m=+330.463232800
I1018 08:48:42.781451       1 register.go:89] Reporting devices  in 2023-10-18 08:48:42.781437965 +0000 UTC m=+360.537457544
I1018 08:49:12.816300       1 register.go:89] Reporting devices  in 2023-10-18 08:49:12.816292362 +0000 UTC m=+390.572311921
I1018 08:49:42.834850       1 register.go:89] Reporting devices  in 2023-10-18 08:49:42.834844163 +0000 UTC m=+420.590863732
I1018 08:50:12.855810       1 register.go:89] Reporting devices  in 2023-10-18 08:50:12.855797817 +0000 UTC m=+450.611817406
I1018 08:50:42.875763       1 register.go:89] Reporting devices  in 2023-10-18 08:50:42.875755678 +0000 UTC m=+480.631775247
I1018 08:51:12.892908       1 register.go:89] Reporting devices  in 2023-10-18 08:51:12.89289625 +0000 UTC m=+510.648915829
I1018 08:51:42.913563       1 register.go:89] Reporting devices  in 2023-10-18 08:51:42.913556355 +0000 UTC m=+540.669575924
I1018 08:52:12.938239       1 register.go:89] Reporting devices  in 2023-10-18 08:52:12.93823072 +0000 UTC m=+570.694250290
I1018 08:52:42.968125       1 register.go:89] Reporting devices  in 2023-10-18 08:52:42.968118172 +0000 UTC m=+600.724137731
I1018 08:53:12.988476       1 register.go:89] Reporting devices  in 2023-10-18 08:53:12.988467434 +0000 UTC m=+630.744487003

... 

volcano-scheduler-configmap

root@k8s-tryvolcano-m001:~# kubectl get cm -n volcano-system volcano-scheduler-configmap -oyaml
apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
        enablePreemptable: true
        enableReclaimable: false
      - name: conformance
    - plugins:
      - name: overcommit
      - name: drf
        enablePreemptable: false
      - name: predicates
        arguments:
          predicate.VGPUEnable: true
      - name: proportion
      - name: nodeorder
      - name: binpack
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"volcano-scheduler.conf":"actions: \"enqueue, allocate, backfill\"\ntiers:\n- plugins:\n  - name: priority\n  - name: gang\n    enablePreemptable: false\n  - name: conformance\n- plugins:\n  - name: overcommit\n  - name: drf\n    enablePreemptable: false\n  - name: predicates\n  - name: proportion\n  - name: nodeorder\n  - name: binpack\n"},"kind":"ConfigMap","metadata":{"annotations":{},"name":"volcano-scheduler-configmap","namespace":"volcano-system"}}
  creationTimestamp: "2023-09-21T04:44:44Z"
  name: volcano-scheduler-configmap
  namespace: volcano-system
  resourceVersion: "4267609"
  uid: 086455c9-7a0e-42b0-a938-4e56a6371207

About this issue

  • Original URL
  • State: open
  • Created 8 months ago
  • Comments: 19 (8 by maintainers)

Most upvoted comments

can this issue be reproduced without install Gpu Operator?