gpu-operator: Unable to set the MIG profile of "4g.20gb" via mig manager.

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn’t apply to you and add more information where it makes sense.

1. Quick Debug Checklist

Are you running on an Ubuntu 18.04 node? --> No, ubuntu 20.04
Are you running Kubernetes v1.13+?
Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
Do you have i2c_core and ipmi_msghandler loaded on the nodes?
Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

Could not set the “4g.20gb” profile for A100 40GB PCIe card. It showed the following error on mig-maneger when applied this profile.

Unable to validate the selected MIG configuration
Restarting all GPU clients previouly shutdown by reenabling their component-specific nodeSelector labels
node/a100-server-b not labeled
Changing the 'nvidia.com/mig.config.state' node label to 'failed'

2. Steps to reproduce the issue

Deployed the pre released v1.8.2 operator.
Add the custom profile

a100-server-b-balanced:
- devices: all
  mig-devices:
      4g.20gb: 1
  mig-enabled: true

Changed the label for instance.

3. Information to attach (optional if deemed irrelevant)

kubernetes pods status: kubectl get pods --all-namespaces

NAMESPACE                NAME                                                          READY   STATUS      RESTARTS   AGE
cattle-system            cattle-cluster-agent-6d969d75b8-x6s8w                         1/1     Running     7          78d
cattle-system            cattle-node-agent-zbxjz                                       1/1     Running     7          78d
default                  gpu-operator-68c95b5679-6298n                                 1/1     Running     1          14d
default                  gpu-operator-node-feature-discovery-master-58d884d5cc-2r8dm   1/1     Running     0          36d
default                  gpu-operator-node-feature-discovery-worker-6zmd8              1/1     Running     0          36d
gpu-operator-resources   gpu-feature-discovery-vd6th                                   1/1     Running     0          12m
gpu-operator-resources   nvidia-container-toolkit-daemonset-xdfd9                      1/1     Running     0          14d
gpu-operator-resources   nvidia-cuda-validator-svvxf                                   0/1     Completed   0          12m
gpu-operator-resources   nvidia-dcgm-exporter-t9hqd                                    1/1     Running     0          12m
gpu-operator-resources   nvidia-dcgm-tvq4k                                             1/1     Running     0          12m
gpu-operator-resources   nvidia-device-plugin-daemonset-nw8mk                          1/1     Running     0          12m
gpu-operator-resources   nvidia-device-plugin-validator-pv676                          0/1     Completed   0          11m
gpu-operator-resources   nvidia-driver-daemonset-8swtd                                 1/1     Running     0          14d
gpu-operator-resources   nvidia-mig-manager-vtbtr                                      1/1     Running     0          14d
gpu-operator-resources   nvidia-operator-validator-rqxq4                               1/1     Running     0          12m
kube-system              calico-kube-controllers-6949477b58-jsprr                      1/1     Running     7          78d
kube-system              calico-node-p8b2p                                             1/1     Running     7          78d
kube-system              coredns-74ff55c5b-288sr                                       1/1     Running     7          78d
kube-system              coredns-74ff55c5b-nm7vm                                       1/1     Running     7          78d
kube-system              etcd-a100-server-b                                            1/1     Running     7          78d
kube-system              kube-apiserver-a100-server-b                                  1/1     Running     8          78d
kube-system              kube-controller-manager-a100-server-b                         1/1     Running     7          78d
kube-system              kube-proxy-jkdm4                                              1/1     Running     7          78d
kube-system              kube-scheduler-a100-server-b                                  1/1     Running     7          78d
monitoring               kube-prometheus-stack-grafana-5c5c84b568-4r96r                2/2     Running     12         77d
monitoring               kube-prometheus-stack-kube-state-metrics-6f85498dd8-xsdzs     1/1     Running     6          77d
monitoring               kube-prometheus-stack-operator-64c65d9fdd-9z7mn               1/1     Running     6          77d
monitoring               kube-prometheus-stack-prometheus-node-exporter-lvlkw          1/1     Running     6          77d
monitoring               prometheus-kube-prometheus-stack-prometheus-0                 2/2     Running     13         77d

kubernetes daemonset status: kubectl get ds --all-namespaces

NAMESPACE                NAME                                             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                      AGE
cattle-system            cattle-node-agent                                1         1         1       1            1           <none>                                             78d
default                  gpu-operator-node-feature-discovery-worker       1         1         1       1            1           <none>                                             36d
gpu-operator-resources   gpu-feature-discovery                            1         1         1       1            1           nvidia.com/gpu.deploy.gpu-feature-discovery=true   36d
gpu-operator-resources   nvidia-container-toolkit-daemonset               1         1         1       1            1           nvidia.com/gpu.deploy.container-toolkit=true       36d
gpu-operator-resources   nvidia-dcgm                                      1         1         1       1            1           nvidia.com/gpu.deploy.dcgm=true                    36d
gpu-operator-resources   nvidia-dcgm-exporter                             1         1         1       1            1           nvidia.com/gpu.deploy.dcgm-exporter=true           36d
gpu-operator-resources   nvidia-device-plugin-daemonset                   1         1         1       1            1           nvidia.com/gpu.deploy.device-plugin=true           36d
gpu-operator-resources   nvidia-driver-daemonset                          1         1         1       1            1           nvidia.com/gpu.deploy.driver=true                  36d
gpu-operator-resources   nvidia-mig-manager                               1         1         1       1            1           nvidia.com/gpu.deploy.mig-manager=true             36d
gpu-operator-resources   nvidia-operator-validator                        1         1         1       1            1           nvidia.com/gpu.deploy.operator-validator=true      36d
kube-system              calico-node                                      1         1         1       1            1           kubernetes.io/os=linux                             78d
kube-system              kube-proxy                                       1         1         1       1            1           kubernetes.io/os=linux                             78d
monitoring               kube-prometheus-stack-prometheus-node-exporter   1         1         1       1            1           <none>                                             77d

Docker configuration file: cat /etc/docker/daemon.json

{
    "default-runtime": "nvidia",
    "exec-opts": [
        "native.cgroupdriver=systemd"
    ],
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "/usr/local/nvidia/toolkit/nvidia-container-runtime"
        },
        "nvidia-experimental": {
            "args": [],
            "path": "/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental"
        }
    },
    "storage-driver": "overlay2",
    "storage-opts": [
        "overlay2.override_kernel_check=true"
    ]
}

Docker runtime configuration: docker info | grep Runtime

 Runtimes: nvidia nvidia-experimental runc
WARNING: No swap limit support
 Default Runtime: nvidia

NVIDIA shared directory: ls -la /run/nvidia

total 12
drwxr-xr-x  4 root root  120 Sep  7 05:00 .
drwxr-xr-x 40 root root 1340 Sep 21 11:59 ..
drwxr-xr-x  1 root root 4096 Sep  7 05:00 driver
-rw-r--r--  1 root root    8 Sep  7 05:00 nvidia-driver.pid
-rw-r--r--  1 root root    8 Sep  7 04:59 toolkit.pid
drwxr-xr-x  2 root root  120 Sep 21 11:48 validations

NVIDIA packages directory: ls -la /usr/local/nvidia/toolkit

 total 8552
drwxr-xr-x 3 root root    4096 Sep  7 04:59 .
drwxr-xr-x 3 root root    4096 Sep  7 04:59 ..
drwxr-xr-x 3 root root    4096 Sep  7 04:59 .config
lrwxrwxrwx 1 root root      28 Sep  7 04:59 libnvidia-container.so.1 -> libnvidia-container.so.1.4.0
-rwxr-xr-x 1 root root  175120 Sep  7 04:59 libnvidia-container.so.1.4.0
-rwxr-xr-x 1 root root     154 Sep  7 04:59 nvidia-container-cli
-rwxr-xr-x 1 root root   43024 Sep  7 04:59 nvidia-container-cli.real
-rwxr-xr-x 1 root root     342 Sep  7 04:59 nvidia-container-runtime
-rwxr-xr-x 1 root root     429 Sep  7 04:59 nvidia-container-runtime-experimental
-rwxr-xr-x 1 root root 3991000 Sep  7 04:59 nvidia-container-runtime.experimental
lrwxrwxrwx 1 root root      24 Sep  7 04:59 nvidia-container-runtime-hook -> nvidia-container-toolkit
-rwxr-xr-x 1 root root 2359384 Sep  7 04:59 nvidia-container-runtime.real
-rwxr-xr-x 1 root root     198 Sep  7 04:59 nvidia-container-toolkit
-rwxr-xr-x 1 root root 2147896 Sep  7 04:59 nvidia-container-toolkit.real

NVIDIA driver directory: ls -la /run/nvidia/driver

total 88
drwxr-xr-x    1 root root  4096 Sep  7 05:00 .
drwxr-xr-x    4 root root   120 Sep  7 05:00 ..
lrwxrwxrwx    1 root root     7 Jul 23 17:35 bin -> usr/bin
drwxr-xr-x    2 root root  4096 Apr 15  2020 boot
drwxr-xr-x   17 root root  4300 Sep  7 05:01 dev
-rwxr-xr-x    1 root root     0 Sep  7 05:00 .dockerenv
drwxr-xr-x    1 root root  4096 Sep  7 05:00 drivers
drwxr-xr-x    1 root root  4096 Sep  7 05:01 etc
drwxr-xr-x    2 root root  4096 Apr 15  2020 home
drwxr-xr-x    2 root root  4096 Sep  7 05:00 host-etc
lrwxrwxrwx    1 root root     7 Jul 23 17:35 lib -> usr/lib
lrwxrwxrwx    1 root root     9 Jul 23 17:35 lib32 -> usr/lib32
lrwxrwxrwx    1 root root     9 Jul 23 17:35 lib64 -> usr/lib64
lrwxrwxrwx    1 root root    10 Jul 23 17:35 libx32 -> usr/libx32
drwxr-xr-x    2 root root  4096 Jul 23 17:35 media
drwxr-xr-x    2 root root  4096 Jul 23 17:35 mnt
-rw-r--r--    1 root root 16047 Aug  3 20:33 NGC-DL-CONTAINER-LICENSE
drwxr-xr-x    2 root root  4096 Jul 23 17:35 opt
dr-xr-xr-x 1279 root root     0 Aug 16 01:57 proc
drwx------    2 root root  4096 Jul 23 17:38 root
drwxr-xr-x    1 root root  4096 Sep  7 05:01 run
lrwxrwxrwx    1 root root     8 Jul 23 17:35 sbin -> usr/sbin
drwxr-xr-x    2 root root  4096 Jul 23 17:35 srv
dr-xr-xr-x   13 root root     0 Sep  7 05:00 sys
drwxrwxrwt    1 root root  4096 Sep  7 05:01 tmp
drwxr-xr-x    1 root root  4096 Jul 23 17:35 usr
drwxr-xr-x    1 root root  4096 Jul 23 17:38 var

kubelet logs journalctl -u kubelet > kubelet.logs If you need, please let me know.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 15 (7 by maintainers)

Most upvoted comments

In general, the k8s-mig-manager does not automate the process of ensuring that it is “OK” to change the MIG configuration on the GPUs that it manages. It assumes that some higher level entity is coordinating the setting of the mig.config label on the node when it is OK to do so.

At present, best practice before applying a MIG config change to any GPU is to:

Cordone the node
Wait for all GPU jobs to complete
Apply the mig.config label to change the MIG config
Wait for the MIG config change to complete
Uncordone the node

We didn’t want to be prescriptive and automate all of these steps in the k8s-mig-manager itself because SRE teams typically already have their own process for doing 1, 2 and 5, and now they can just integrate the setting of the mig.config label into that process when desired.

klueska on Sep 24, 2021