k8s-device-plugin: Cannont pass through RTX 3090 into pod; Failed to initialize NVML: could not load NVML library.

1. Issue or feature description

Cannot pass through RTX 3090 GPU by k8s-device-plugin(both k8s-only or helm failed.)

2. Steps to reproduce the issue

My kubeadm version: 1.21.1 My kubectl version: 1.21.1 My kubelet version: 1.21.1 My CRI-O version: 1.21:1.21.1

I was trying to create a cluster using crio container runtime interface and flannel CNI.

My command for initialize cluster: sudo kubeadm init --cri-socket /var/run/crio/crio.sock --pod-network-cidr 10.244.0.0/16

Adding flannel: kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml

Adding k8s-device-plugin by nvidia: kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml

Then, the information below is the logs reported by nvidia-device-plugin-daemonset-llthp pod:

2021/08/30 06:04:38 Loading NVML
2021/08/30 06:04:38 Failed to initialize NVML: could not load NVML library.
2021/08/30 06:04:38 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2021/08/30 06:04:38 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2021/08/30 06:04:38 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2021/08/30 06:04:38 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes

Once I try to establish a pod using the following yaml:

apiVersion: v1
kind: Pod
metadata:
  name: torch
  labels:
    app: torch
spec:
  containers:
  - name: torch
    image: nvcr.io/nvidia/pytorch:21.03-py3
    #command: [ "/bin/bash", "-c", "--" ]
    #args: [ "while true; do sleep 30; done;" ]
    ports:
      - containerPort: 8888
        protocol: TCP
    resources:
      requests:
        nvidia.com/gpu: 1
        memory: "64Mi"
        cpu: "250m"
      limits:
        nvidia.com/gpu: 1
        memory: "128Mi"
        cpu: "500m"

The Kubernetes failed to get the GPU

Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  15s (x3 over 92s)  default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu.

But the docker works without error when I try to run:

 docker run --security-opt=no-new-privileges --cap-drop=ALL --network=none -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:v0.9.0

Output:

2021/08/30 10:38:09 Loading NVML
2021/08/30 10:38:09 Starting FS watcher.
2021/08/30 10:38:09 Starting OS watcher.
2021/08/30 10:38:09 Retreiving plugins.
2021/08/30 10:38:09 Starting GRPC server for 'nvidia.com/gpu'
2021/08/30 10:38:09 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2021/08/30 10:38:09 Registered device plugin for 'nvidia.com/gpu' with Kubelet

It seems docker can pass through GPU successfully but k8s do not.

Can anybody help me to figure out the problem?

3. Information to attach (optional if deemed irrelevant)

Common error checking:

  • The output of nvidia-smi -a on your host
==============NVSMI LOG==============

Timestamp                                 : Mon Aug 30 18:22:17 2021
Driver Version                            : 460.73.01
CUDA Version                              : 11.2

Attached GPUs                             : 1
GPU 00000000:01:00.0
    Product Name                          : GeForce RTX 3090
    Product Brand                         : GeForce
    Display Mode                          : Enabled
    Display Active                        : Enabled
    Persistence Mode                      : Enabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-948211b6-df7a-5768-ca7b-a84e23d9404d
    Minor Number                          : 0
    VBIOS Version                         : 94.02.26.08.1C
    MultiGPU Board                        : No
    Board ID                              : 0x100
    GPU Part Number                       : N/A
    Inforom Version
        Image Version                     : G001.0000.03.03
        OEM Object                        : 2.0
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x01
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x220410DE
        Bus Id                            : 00000000:01:00.0
        Sub System Id                     : 0x403B1458
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 1
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 1000 KB/s
        Rx Throughput                     : 1000 KB/s
    Fan Speed                             : 41 %
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 24265 MiB
        Used                              : 1256 MiB
        Free                              : 23009 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 14 MiB
        Free                              : 242 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 1 %
        Memory                            : 10 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 48 C
        GPU Shutdown Temp                 : 98 C
        GPU Slowdown Temp                 : 95 C
        GPU Max Operating Temp            : 93 C
        GPU Target Temperature            : 83 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 34.64 W
        Power Limit                       : 350.00 W
        Default Power Limit               : 350.00 W
        Enforced Power Limit              : 350.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 350.00 W
    Clocks
        Graphics                          : 270 MHz
        SM                                : 270 MHz
        Memory                            : 405 MHz
        Video                             : 555 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2100 MHz
        SM                                : 2100 MHz
        Memory                            : 9751 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 2692
            Type                          : G
            Name                          : /usr/lib/xorg/Xorg
            Used GPU Memory               : 73 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 3028
            Type                          : G
            Name                          : /usr/bin/gnome-shell
            Used GPU Memory               : 160 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 5521
            Type                          : G
            Name                          : /usr/lib/xorg/Xorg
            Used GPU Memory               : 624 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 5654
            Type                          : G
            Name                          : /usr/bin/gnome-shell
            Used GPU Memory               : 84 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 8351
            Type                          : G
            Name                          : /usr/share/skypeforlinux/skypeforlinux --type=gpu-process --field-trial-handle=2437345894369599647,6238031376657225521,131072 --enable-features=WebComponentsV0Enabled --disable-features=CookiesWithoutSameSiteMustBeSecure,SameSiteByDefaultCookies,SpareRendererForSitePerProcess --enable-crash-reporter=97d5b09d-f9b0-4336-bc9a-fe11870fe1b3,no_channel --global-crash-keys=97d5b09d-f9b0-4336-bc9a-fe11870fe1b3,no_channel,_companyName=Skype,_productName=skypeforlinux,_version=8.73.0.92 --gpu-preferences=OAAAAAAAAAAgAAAQAAAAAAAAAAAAAAAAAABgAAAAAAAYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIAAAAAAAAAA== --shared-files
            Used GPU Memory               : 14 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 8560
            Type                          : G
            Name                          : /opt/google/chrome/chrome --type=gpu-process --field-trial-handle=10043073040938675921,16429150098372267894,131072 --enable-crashpad --crashpad-handler-pid=8526 --enable-crash-reporter=a844a16f-8f0f-4770-87e1-a8389ca3c415, --gpu-preferences=UAAAAAAAAAAgAAAQAAAAAAAAAAAAAAAAAABgAAAAAAAwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAQAAABgAAAAAAAAAGAAAAAAAAAAIAAAAAAAAAAgAAAAAAAAACAAAAAAAAAA= --shared-files
            Used GPU Memory               : 91 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 8582
            Type                          : G
            Name                          : /usr/lib/firefox/firefox
            Used GPU Memory               : 178 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 9139
            Type                          : G
            Name                          : /usr/lib/firefox/firefox
            Used GPU Memory               : 4 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 9931
            Type                          : G
            Name                          : gnome-control-center
            Used GPU Memory               : 4 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 11503
            Type                          : G
            Name                          : /usr/lib/firefox/firefox
            Used GPU Memory               : 4 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 64276
            Type                          : G
            Name                          : /usr/lib/firefox/firefox
            Used GPU Memory               : 4 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 78463
            Type                          : G
            Name                          : /usr/lib/firefox/firefox
            Used GPU Memory               : 4 MiB
  • [x ] Your docker configuration file (e.g: /etc/docker/daemon.json)
{
    "exec-opts": ["native.cgroupdriver=systemd"],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m"
  },
  "storage-driver": "overlay2",
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}
  • The k8s-device-plugin container logs
2021/08/30 06:04:38 Loading NVML
2021/08/30 06:04:38 Failed to initialize NVML: could not load NVML library.
2021/08/30 06:04:38 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2021/08/30 06:04:38 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2021/08/30 06:04:38 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2021/08/30 06:04:38 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
 八  30 14:12:23 srv1 kubelet[108111]: I0830 14:12:23.643580  108111 eviction_manager.go:346] "Eviction manager: able to reduce resource pressure without evicting pods." resourceName="ephemeral-storage"
 八  30 14:12:23 srv1 kubelet[108111]: I0830 14:12:23.457677  108111 eviction_manager.go:425] "Eviction manager: unexpected error when attempting to reduce resource pressure" resourceName="ephemeral-storage" err="wanted to free 9223372036854775807 bytes, but freed 14575560277 bytes space with errors in image deletion: [rpc error: code = U
 八  30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.404808  108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="ad8c213c76c5990969673d7a22ed6bce9d13e6cdd613fefd2db967a03e1cd816" size=14575560277
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.404791  108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by 864db3a48c0a2753840a7f994873c2c5af696d6765aeb229b49e455ea5e98c4c: image is in use by a container" image="296a6d5035e2d6919249e02709a488d680ddca91357602bd65e605eac967b89
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.404762  108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by 864db3a48c0a2753840a7f994873c2c5af696d6765aeb229b49e455ea5e98c4c: image is in use by a container" image="296a6d5035e2d6919249e02709a488d680ddca91357602bd65e60
 八  30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.404494  108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="296a6d5035e2d6919249e02709a488d680ddca91357602bd65e605eac967b899" size=42585056
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.404479  108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by aa72c61c4181efcc0f55c70f42078481cc0af69654343aa98edd6bfac63290ba: image is in use by a container" image="8522d622299ca431311ac69992419c956fbaca6fa8289c76810c9399d17c69d
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.404467  108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by aa72c61c4181efcc0f55c70f42078481cc0af69654343aa98edd6bfac63290ba: image is in use by a container" image="8522d622299ca431311ac69992419c956fbaca6fa8289c76810c9
 八  30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.404230  108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="8522d622299ca431311ac69992419c956fbaca6fa8289c76810c9399d17c69de" size=68899837
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.404212  108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by b53c6818067e6b95f5e4689d991f86524bb4e47baec455a0211168b321e1af1b: image is in use by a container" image="37b8c3899b153afc2c7e65e1939330654276560b8b5f6dffdfd466bd8b4f7ef
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.404187  108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by b53c6818067e6b95f5e4689d991f86524bb4e47baec455a0211168b321e1af1b: image is in use by a container" image="37b8c3899b153afc2c7e65e1939330654276560b8b5f6dffdfd46
 八  30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.403939  108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="37b8c3899b153afc2c7e65e1939330654276560b8b5f6dffdfd466bd8b4f7ef8" size=195847465
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.403932  108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by 83c0ea8f464dc205726d29d407f564b5115e9b80bd65bac2f087463d80ff95ed: image is in use by a container" image="2c25d0f89db7a9dba5ed71b692b65e86b0ad9fcab1a9f94e946c05db18776ab
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.403920  108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by 83c0ea8f464dc205726d29d407f564b5115e9b80bd65bac2f087463d80ff95ed: image is in use by a container" image="2c25d0f89db7a9dba5ed71b692b65e86b0ad9fcab1a9f94e946c0
 八  30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.403680  108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="2c25d0f89db7a9dba5ed71b692b65e86b0ad9fcab1a9f94e946c05db18776ab3" size=121095258
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.403673  108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by ca4d555dce70b78abd85986745371d98c2028590ae058e2320ce457f5fec0b30: image is in use by a container" image="0369cf4303ffdb467dc219990960a9baa8512a54b0ad9283eaf55bd6c0adb93
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.403663  108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by ca4d555dce70b78abd85986745371d98c2028590ae058e2320ce457f5fec0b30: image is in use by a container" image="0369cf4303ffdb467dc219990960a9baa8512a54b0ad9283eaf55
 八  30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.403428  108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="0369cf4303ffdb467dc219990960a9baa8512a54b0ad9283eaf55bd6c0adb934" size=254662613
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.403422  108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by d4a7103f1e4829474bab231668d0377b97fc222e2a4b4332a669e912b863175a: image is in use by a container" image="993d3ec13feb2e7b7e9bd6ac4831fb0cdae7329a8e8f1e285d9f2790004b2fe
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.403412  108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by d4a7103f1e4829474bab231668d0377b97fc222e2a4b4332a669e912b863175a: image is in use by a container" image="993d3ec13feb2e7b7e9bd6ac4831fb0cdae7329a8e8f1e285d9f2
 八  30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.403187  108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="993d3ec13feb2e7b7e9bd6ac4831fb0cdae7329a8e8f1e285d9f2790004b2fe3" size=51893338
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.403180  108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by 40de580961ae274afef6eb2737f313bc8637ac21fc42fa53863a97523c07c831: image is in use by a container" image="cef7457710b1ace64357066aea33117083dfec9a023cade594cc16c7a81d936
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.403171  108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by 40de580961ae274afef6eb2737f313bc8637ac21fc42fa53863a97523c07c831: image is in use by a container" image="cef7457710b1ace64357066aea33117083dfec9a023cade594cc1
 八  30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.402907  108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="cef7457710b1ace64357066aea33117083dfec9a023cade594cc16c7a81d936b" size=126883060
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.402897  108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by 63d4a6aaa8f530cb3e33f02af9262d2ffd20f076b5803bc1ea1f03fc29f9ebf3: image is in use by a container" image="ef4bce0a7569b4fa83a559717c608c076a2c9d30361eb059ea4e1b7a55424d6
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.402886  108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by 63d4a6aaa8f530cb3e33f02af9262d2ffd20f076b5803bc1ea1f03fc29f9ebf3: image is in use by a container" image="ef4bce0a7569b4fa83a559717c608c076a2c9d30361eb059ea4e1
 八  30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.402498  108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="ef4bce0a7569b4fa83a559717c608c076a2c9d30361eb059ea4e1b7a55424d68" size=105130216
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.402486  108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by 9270341c09e80de42955681f04bb0baaac9f931e7e4eb6aa400a7419337e107b: image is in use by a container" image="ed210e3e4a5bae1237f1bb44d72a05a2f1e5c6bfe7a7e73da179e2534269c45
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.402467  108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by 9270341c09e80de42955681f04bb0baaac9f931e7e4eb6aa400a7419337e107b: image is in use by a container" image="ed210e3e4a5bae1237f1bb44d72a05a2f1e5c6bfe7a7e73da179e
 八  30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.402130  108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="ed210e3e4a5bae1237f1bb44d72a05a2f1e5c6bfe7a7e73da179e2534269c459" size=689969
 八  30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.400313  108111 image_gc_manager.go:321] "Attempting to delete unused images"
 八  30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.398657  108111 container_gc.go:85] "Attempting to delete unused containers"
 八  30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.398622  108111 eviction_manager.go:339] "Eviction manager: attempting to reclaim" resourceName="ephemeral-storage"
 八  30 14:12:10 srv1 kubelet[108111]: I0830 14:12:10.205926  108111 eviction_manager.go:391] "Eviction manager: unable to evict any pods from the node"

Additional information that might help better understand your environment and reproduce the bug:

  • Docker version from docker version
 Client: Docker Engine - Community
 Version:           20.10.0
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        7287ab3
 Built:             Tue Dec  8 18:59:53 2020
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.0
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       eeddea2
  Built:            Tue Dec  8 18:57:44 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.9
  GitCommit:        e25210fe30a0a703442421b0f60afac609f950a3
 nvidia:
  Version:          1.0.1
  GitCommit:        v1.0.1-0-g4144b63
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
  • Docker command, image and tag used
  • Kernel version from uname -a
Linux srv1 5.4.0-56-generic #62~18.04.1-Ubuntu SMP Tue Nov 24 10:07:50 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  • Any relevant kernel output lines from dmesg
  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
||/ Name                                                                            Version                                      Architecture                                 Description
+++-===============================================================================-============================================-============================================-===================================================================================================================================================================
un  libgldispatch0-nvidia                                                           <none>                                       <none>                                       (no description available)
ii  libnvidia-cfg1-460:amd64                                                        460.73.01-0ubuntu1                           amd64                                        NVIDIA binary OpenGL/GLX configuration library
un  libnvidia-cfg1-any                                                              <none>                                       <none>                                       (no description available)
un  libnvidia-common                                                                <none>                                       <none>                                       (no description available)
ii  libnvidia-common-460                                                            460.73.01-0ubuntu1                           all                                          Shared files used by the NVIDIA libraries
ii  libnvidia-compute-460:amd64                                                     460.73.01-0ubuntu1                           amd64                                        NVIDIA libcompute package
ii  libnvidia-container-tools                                                       1.4.0-1                                      amd64                                        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64                                                      1.4.0-1                                      amd64                                        NVIDIA container runtime library
un  libnvidia-decode                                                                <none>                                       <none>                                       (no description available)
ii  libnvidia-decode-460:amd64                                                      460.73.01-0ubuntu1                           amd64                                        NVIDIA Video Decoding runtime libraries
un  libnvidia-encode                                                                <none>                                       <none>                                       (no description available)
ii  libnvidia-encode-460:amd64                                                      460.73.01-0ubuntu1                           amd64                                        NVENC Video Encoding runtime library
un  libnvidia-extra                                                                 <none>                                       <none>                                       (no description available)
ii  libnvidia-extra-460:amd64                                                       460.73.01-0ubuntu1                           amd64                                        Extra libraries for the NVIDIA driver
un  libnvidia-fbc1                                                                  <none>                                       <none>                                       (no description available)
ii  libnvidia-fbc1-460:amd64                                                        460.73.01-0ubuntu1                           amd64                                        NVIDIA OpenGL-based Framebuffer Capture runtime library
un  libnvidia-gl                                                                    <none>                                       <none>                                       (no description available)
ii  libnvidia-gl-460:amd64                                                          460.73.01-0ubuntu1                           amd64                                        NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
un  libnvidia-ifr1                                                                  <none>                                       <none>                                       (no description available)
ii  libnvidia-ifr1-460:amd64                                                        460.73.01-0ubuntu1                           amd64                                        NVIDIA OpenGL-based Inband Frame Readback runtime library
un  libnvidia-ml1                                                                   <none>                                       <none>                                       (no description available)
un  nvidia-304                                                                      <none>                                       <none>                                       (no description available)
un  nvidia-340                                                                      <none>                                       <none>                                       (no description available)
un  nvidia-384                                                                      <none>                                       <none>                                       (no description available)
un  nvidia-390                                                                      <none>                                       <none>                                       (no description available)
un  nvidia-common                                                                   <none>                                       <none>                                       (no description available)
ii  nvidia-compute-utils-460                                                        460.73.01-0ubuntu1                           amd64                                        NVIDIA compute utilities
ii  nvidia-container-runtime                                                        3.5.0-1                                      amd64                                        NVIDIA container runtime
un  nvidia-container-runtime-hook                                                   <none>                                       <none>                                       (no description available)
ii  nvidia-container-toolkit                                                        1.5.1-1                                      amd64                                        NVIDIA container runtime hook
ii  nvidia-cuda-dev                                                                 9.1.85-3ubuntu1                              amd64                                        NVIDIA CUDA development files
ii  nvidia-cuda-doc                                                                 9.1.85-3ubuntu1                              all                                          NVIDIA CUDA and OpenCL documentation
ii  nvidia-cuda-gdb                                                                 9.1.85-3ubuntu1                              amd64                                        NVIDIA CUDA Debugger (GDB)
ii  nvidia-cuda-toolkit                                                             9.1.85-3ubuntu1                              amd64                                        NVIDIA CUDA development toolkit
ii  nvidia-dkms-460                                                                 460.73.01-0ubuntu1                           amd64                                        NVIDIA DKMS package
un  nvidia-dkms-kernel                                                              <none>                                       <none>                                       (no description available)
un  nvidia-driver                                                                   <none>                                       <none>                                       (no description available)
ii  nvidia-driver-460                                                               460.73.01-0ubuntu1                           amd64                                        NVIDIA driver metapackage
un  nvidia-driver-binary                                                            <none>                                       <none>                                       (no description available)
un  nvidia-kernel-common                                                            <none>                                       <none>                                       (no description available)
ii  nvidia-kernel-common-460                                                        460.73.01-0ubuntu1                           amd64                                        Shared files used with the kernel module
un  nvidia-kernel-source                                                            <none>                                       <none>                                       (no description available)
ii  nvidia-kernel-source-460                                                        460.73.01-0ubuntu1                           amd64                                        NVIDIA kernel source package
un  nvidia-legacy-304xx-vdpau-driver                                                <none>                                       <none>                                       (no description available)
un  nvidia-legacy-340xx-vdpau-driver                                                <none>                                       <none>                                       (no description available)
un  nvidia-libopencl1                                                               <none>                                       <none>                                       (no description available)
un  nvidia-libopencl1-dev                                                           <none>                                       <none>                                       (no description available)
ii  nvidia-modprobe                                                                 465.19.01-0ubuntu1                           amd64                                        Load the NVIDIA kernel driver and create device files
ii  nvidia-opencl-dev:amd64                                                         9.1.85-3ubuntu1                              amd64                                        NVIDIA OpenCL development files
un  nvidia-opencl-icd                                                               <none>                                       <none>                                       (no description available)
un  nvidia-persistenced                                                             <none>                                       <none>                                       (no description available)
ii  nvidia-prime                                                                    0.8.16~0.18.04.1                             all                                          Tools to enable NVIDIA's Prime
ii  nvidia-profiler                                                                 9.1.85-3ubuntu1                              amd64                                        NVIDIA Profiler for CUDA and OpenCL
ii  nvidia-settings                                                                 465.19.01-0ubuntu1                           amd64                                        Tool for configuring the NVIDIA graphics driver
un  nvidia-settings-binary                                                          <none>                                       <none>                                       (no description available)
un  nvidia-smi                                                                      <none>                                       <none>                                       (no description available)
un  nvidia-utils                                                                    <none>                                       <none>                                       (no description available)
ii  nvidia-utils-460                                                                460.73.01-0ubuntu1                           amd64                                        NVIDIA driver support binaries
un  nvidia-vdpau-driver                                                             <none>                                       <none>                                       (no description available)
ii  nvidia-visual-profiler                                                          9.1.85-3ubuntu1                              amd64                                        NVIDIA Visual Profiler for CUDA and OpenCL
ii  xserver-xorg-video-nvidia-460                                                   460.73.01-0ubuntu1                           amd64                                        NVIDIA binary Xorg driver
  • NVIDIA container library version from nvidia-container-cli -V
version: 1.4.0
build date: 2021-04-24T14:25+00:00
build revision: 704a698b7a0ceec07a48e56c37365c741718c2df
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 16 (7 by maintainers)

Most upvoted comments

@Mr-Linus note that 1.0.0-beta4 is not supported and v0.10.0 is the latest release. If you are experiencing problems with this release we should try to determine why this is.

👌🏻 Switched to v0.10.0 and it works fine.

Is there a way to run nvidia-container-runtime on io.containerd.runc.v2, not v1? I am getting the same error as OP, tried different versions of k8s-nvidia-plugin GPU on host node works fine, nvidia-smi outputs info

@luckyycode this seems like an unrelated issue to this thread.

Note that from the config in https://github.com/NVIDIA/k8s-device-plugin/issues/263#issuecomment-908247909 we see:

     [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            SystemdCgroup = true
            BinaryName="/usr/bin/nvidia-container-runtime"

indicating the use of the v2 shim.

It may be more useful to create a new ticket. describing the behaviour that you see and including any relevant k8s or containerd information and logs.

@davidho27941 I see from your description that you are installing version 1.0.0-beta4 of the device plugin:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml

The versioning of the NVIDIA Device plugin is inconsistent in that v0.9.0 is the latest release: https://github.com/NVIDIA/k8s-device-plugin/releases/tag/v0.9.0

Could see whether using this (or one of the more recent releases) addresses your issue?