k8s-device-plugin: Cannont pass through RTX 3090 into pod; Failed to initialize NVML: could not load NVML library.
1. Issue or feature description
Cannot pass through RTX 3090 GPU by k8s-device-plugin(both k8s-only or helm failed.)
2. Steps to reproduce the issue
My kubeadm version: 1.21.1 My kubectl version: 1.21.1 My kubelet version: 1.21.1 My CRI-O version: 1.21:1.21.1
I was trying to create a cluster using crio
container runtime interface and flannel
CNI.
My command for initialize cluster:
sudo kubeadm init --cri-socket /var/run/crio/crio.sock --pod-network-cidr 10.244.0.0/16
Adding flannel
:
kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
Adding k8s-device-plugin
by nvidia:
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml
Then, the information below is the logs reported by nvidia-device-plugin-daemonset-llthp
pod:
2021/08/30 06:04:38 Loading NVML
2021/08/30 06:04:38 Failed to initialize NVML: could not load NVML library.
2021/08/30 06:04:38 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2021/08/30 06:04:38 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2021/08/30 06:04:38 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2021/08/30 06:04:38 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
Once I try to establish a pod using the following yaml:
apiVersion: v1
kind: Pod
metadata:
name: torch
labels:
app: torch
spec:
containers:
- name: torch
image: nvcr.io/nvidia/pytorch:21.03-py3
#command: [ "/bin/bash", "-c", "--" ]
#args: [ "while true; do sleep 30; done;" ]
ports:
- containerPort: 8888
protocol: TCP
resources:
requests:
nvidia.com/gpu: 1
memory: "64Mi"
cpu: "250m"
limits:
nvidia.com/gpu: 1
memory: "128Mi"
cpu: "500m"
The Kubernetes failed to get the GPU
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 15s (x3 over 92s) default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu.
But the docker works without error when I try to run:
docker run --security-opt=no-new-privileges --cap-drop=ALL --network=none -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:v0.9.0
Output:
2021/08/30 10:38:09 Loading NVML
2021/08/30 10:38:09 Starting FS watcher.
2021/08/30 10:38:09 Starting OS watcher.
2021/08/30 10:38:09 Retreiving plugins.
2021/08/30 10:38:09 Starting GRPC server for 'nvidia.com/gpu'
2021/08/30 10:38:09 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2021/08/30 10:38:09 Registered device plugin for 'nvidia.com/gpu' with Kubelet
It seems docker can pass through GPU successfully but k8s do not.
Can anybody help me to figure out the problem?
3. Information to attach (optional if deemed irrelevant)
Common error checking:
- The output of
nvidia-smi -a
on your host
==============NVSMI LOG==============
Timestamp : Mon Aug 30 18:22:17 2021
Driver Version : 460.73.01
CUDA Version : 11.2
Attached GPUs : 1
GPU 00000000:01:00.0
Product Name : GeForce RTX 3090
Product Brand : GeForce
Display Mode : Enabled
Display Active : Enabled
Persistence Mode : Enabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-948211b6-df7a-5768-ca7b-a84e23d9404d
Minor Number : 0
VBIOS Version : 94.02.26.08.1C
MultiGPU Board : No
Board ID : 0x100
GPU Part Number : N/A
Inforom Version
Image Version : G001.0000.03.03
OEM Object : 2.0
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Device Id : 0x220410DE
Bus Id : 00000000:01:00.0
Sub System Id : 0x403B1458
GPU Link Info
PCIe Generation
Max : 4
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 1000 KB/s
Rx Throughput : 1000 KB/s
Fan Speed : 41 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 24265 MiB
Used : 1256 MiB
Free : 23009 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 14 MiB
Free : 242 MiB
Compute Mode : Default
Utilization
Gpu : 1 %
Memory : 10 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows : N/A
Temperature
GPU Current Temp : 48 C
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 93 C
GPU Target Temperature : 83 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 34.64 W
Power Limit : 350.00 W
Default Power Limit : 350.00 W
Enforced Power Limit : 350.00 W
Min Power Limit : 100.00 W
Max Power Limit : 350.00 W
Clocks
Graphics : 270 MHz
SM : 270 MHz
Memory : 405 MHz
Video : 555 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 2100 MHz
SM : 2100 MHz
Memory : 9751 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 2692
Type : G
Name : /usr/lib/xorg/Xorg
Used GPU Memory : 73 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 3028
Type : G
Name : /usr/bin/gnome-shell
Used GPU Memory : 160 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 5521
Type : G
Name : /usr/lib/xorg/Xorg
Used GPU Memory : 624 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 5654
Type : G
Name : /usr/bin/gnome-shell
Used GPU Memory : 84 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 8351
Type : G
Name : /usr/share/skypeforlinux/skypeforlinux --type=gpu-process --field-trial-handle=2437345894369599647,6238031376657225521,131072 --enable-features=WebComponentsV0Enabled --disable-features=CookiesWithoutSameSiteMustBeSecure,SameSiteByDefaultCookies,SpareRendererForSitePerProcess --enable-crash-reporter=97d5b09d-f9b0-4336-bc9a-fe11870fe1b3,no_channel --global-crash-keys=97d5b09d-f9b0-4336-bc9a-fe11870fe1b3,no_channel,_companyName=Skype,_productName=skypeforlinux,_version=8.73.0.92 --gpu-preferences=OAAAAAAAAAAgAAAQAAAAAAAAAAAAAAAAAABgAAAAAAAYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIAAAAAAAAAA== --shared-files
Used GPU Memory : 14 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 8560
Type : G
Name : /opt/google/chrome/chrome --type=gpu-process --field-trial-handle=10043073040938675921,16429150098372267894,131072 --enable-crashpad --crashpad-handler-pid=8526 --enable-crash-reporter=a844a16f-8f0f-4770-87e1-a8389ca3c415, --gpu-preferences=UAAAAAAAAAAgAAAQAAAAAAAAAAAAAAAAAABgAAAAAAAwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAQAAABgAAAAAAAAAGAAAAAAAAAAIAAAAAAAAAAgAAAAAAAAACAAAAAAAAAA= --shared-files
Used GPU Memory : 91 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 8582
Type : G
Name : /usr/lib/firefox/firefox
Used GPU Memory : 178 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 9139
Type : G
Name : /usr/lib/firefox/firefox
Used GPU Memory : 4 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 9931
Type : G
Name : gnome-control-center
Used GPU Memory : 4 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 11503
Type : G
Name : /usr/lib/firefox/firefox
Used GPU Memory : 4 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 64276
Type : G
Name : /usr/lib/firefox/firefox
Used GPU Memory : 4 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 78463
Type : G
Name : /usr/lib/firefox/firefox
Used GPU Memory : 4 MiB
- [x ] Your docker configuration file (e.g:
/etc/docker/daemon.json
)
{
"exec-opts": ["native.cgroupdriver=systemd"],
"log-driver": "json-file",
"log-opts": {
"max-size": "100m"
},
"storage-driver": "overlay2",
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
- The k8s-device-plugin container logs
2021/08/30 06:04:38 Loading NVML
2021/08/30 06:04:38 Failed to initialize NVML: could not load NVML library.
2021/08/30 06:04:38 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2021/08/30 06:04:38 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2021/08/30 06:04:38 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2021/08/30 06:04:38 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
- The kubelet logs on the node (e.g:
sudo journalctl -r -u kubelet
)
八 30 14:12:23 srv1 kubelet[108111]: I0830 14:12:23.643580 108111 eviction_manager.go:346] "Eviction manager: able to reduce resource pressure without evicting pods." resourceName="ephemeral-storage"
八 30 14:12:23 srv1 kubelet[108111]: I0830 14:12:23.457677 108111 eviction_manager.go:425] "Eviction manager: unexpected error when attempting to reduce resource pressure" resourceName="ephemeral-storage" err="wanted to free 9223372036854775807 bytes, but freed 14575560277 bytes space with errors in image deletion: [rpc error: code = U
八 30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.404808 108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="ad8c213c76c5990969673d7a22ed6bce9d13e6cdd613fefd2db967a03e1cd816" size=14575560277
八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.404791 108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by 864db3a48c0a2753840a7f994873c2c5af696d6765aeb229b49e455ea5e98c4c: image is in use by a container" image="296a6d5035e2d6919249e02709a488d680ddca91357602bd65e605eac967b89
八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.404762 108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by 864db3a48c0a2753840a7f994873c2c5af696d6765aeb229b49e455ea5e98c4c: image is in use by a container" image="296a6d5035e2d6919249e02709a488d680ddca91357602bd65e60
八 30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.404494 108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="296a6d5035e2d6919249e02709a488d680ddca91357602bd65e605eac967b899" size=42585056
八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.404479 108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by aa72c61c4181efcc0f55c70f42078481cc0af69654343aa98edd6bfac63290ba: image is in use by a container" image="8522d622299ca431311ac69992419c956fbaca6fa8289c76810c9399d17c69d
八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.404467 108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by aa72c61c4181efcc0f55c70f42078481cc0af69654343aa98edd6bfac63290ba: image is in use by a container" image="8522d622299ca431311ac69992419c956fbaca6fa8289c76810c9
八 30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.404230 108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="8522d622299ca431311ac69992419c956fbaca6fa8289c76810c9399d17c69de" size=68899837
八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.404212 108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by b53c6818067e6b95f5e4689d991f86524bb4e47baec455a0211168b321e1af1b: image is in use by a container" image="37b8c3899b153afc2c7e65e1939330654276560b8b5f6dffdfd466bd8b4f7ef
八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.404187 108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by b53c6818067e6b95f5e4689d991f86524bb4e47baec455a0211168b321e1af1b: image is in use by a container" image="37b8c3899b153afc2c7e65e1939330654276560b8b5f6dffdfd46
八 30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.403939 108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="37b8c3899b153afc2c7e65e1939330654276560b8b5f6dffdfd466bd8b4f7ef8" size=195847465
八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.403932 108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by 83c0ea8f464dc205726d29d407f564b5115e9b80bd65bac2f087463d80ff95ed: image is in use by a container" image="2c25d0f89db7a9dba5ed71b692b65e86b0ad9fcab1a9f94e946c05db18776ab
八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.403920 108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by 83c0ea8f464dc205726d29d407f564b5115e9b80bd65bac2f087463d80ff95ed: image is in use by a container" image="2c25d0f89db7a9dba5ed71b692b65e86b0ad9fcab1a9f94e946c0
八 30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.403680 108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="2c25d0f89db7a9dba5ed71b692b65e86b0ad9fcab1a9f94e946c05db18776ab3" size=121095258
八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.403673 108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by ca4d555dce70b78abd85986745371d98c2028590ae058e2320ce457f5fec0b30: image is in use by a container" image="0369cf4303ffdb467dc219990960a9baa8512a54b0ad9283eaf55bd6c0adb93
八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.403663 108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by ca4d555dce70b78abd85986745371d98c2028590ae058e2320ce457f5fec0b30: image is in use by a container" image="0369cf4303ffdb467dc219990960a9baa8512a54b0ad9283eaf55
八 30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.403428 108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="0369cf4303ffdb467dc219990960a9baa8512a54b0ad9283eaf55bd6c0adb934" size=254662613
八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.403422 108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by d4a7103f1e4829474bab231668d0377b97fc222e2a4b4332a669e912b863175a: image is in use by a container" image="993d3ec13feb2e7b7e9bd6ac4831fb0cdae7329a8e8f1e285d9f2790004b2fe
八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.403412 108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by d4a7103f1e4829474bab231668d0377b97fc222e2a4b4332a669e912b863175a: image is in use by a container" image="993d3ec13feb2e7b7e9bd6ac4831fb0cdae7329a8e8f1e285d9f2
八 30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.403187 108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="993d3ec13feb2e7b7e9bd6ac4831fb0cdae7329a8e8f1e285d9f2790004b2fe3" size=51893338
八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.403180 108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by 40de580961ae274afef6eb2737f313bc8637ac21fc42fa53863a97523c07c831: image is in use by a container" image="cef7457710b1ace64357066aea33117083dfec9a023cade594cc16c7a81d936
八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.403171 108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by 40de580961ae274afef6eb2737f313bc8637ac21fc42fa53863a97523c07c831: image is in use by a container" image="cef7457710b1ace64357066aea33117083dfec9a023cade594cc1
八 30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.402907 108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="cef7457710b1ace64357066aea33117083dfec9a023cade594cc16c7a81d936b" size=126883060
八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.402897 108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by 63d4a6aaa8f530cb3e33f02af9262d2ffd20f076b5803bc1ea1f03fc29f9ebf3: image is in use by a container" image="ef4bce0a7569b4fa83a559717c608c076a2c9d30361eb059ea4e1b7a55424d6
八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.402886 108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by 63d4a6aaa8f530cb3e33f02af9262d2ffd20f076b5803bc1ea1f03fc29f9ebf3: image is in use by a container" image="ef4bce0a7569b4fa83a559717c608c076a2c9d30361eb059ea4e1
八 30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.402498 108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="ef4bce0a7569b4fa83a559717c608c076a2c9d30361eb059ea4e1b7a55424d68" size=105130216
八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.402486 108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by 9270341c09e80de42955681f04bb0baaac9f931e7e4eb6aa400a7419337e107b: image is in use by a container" image="ed210e3e4a5bae1237f1bb44d72a05a2f1e5c6bfe7a7e73da179e2534269c45
八 30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.402467 108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by 9270341c09e80de42955681f04bb0baaac9f931e7e4eb6aa400a7419337e107b: image is in use by a container" image="ed210e3e4a5bae1237f1bb44d72a05a2f1e5c6bfe7a7e73da179e
八 30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.402130 108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="ed210e3e4a5bae1237f1bb44d72a05a2f1e5c6bfe7a7e73da179e2534269c459" size=689969
八 30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.400313 108111 image_gc_manager.go:321] "Attempting to delete unused images"
八 30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.398657 108111 container_gc.go:85] "Attempting to delete unused containers"
八 30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.398622 108111 eviction_manager.go:339] "Eviction manager: attempting to reclaim" resourceName="ephemeral-storage"
八 30 14:12:10 srv1 kubelet[108111]: I0830 14:12:10.205926 108111 eviction_manager.go:391] "Eviction manager: unable to evict any pods from the node"
Additional information that might help better understand your environment and reproduce the bug:
- Docker version from
docker version
Client: Docker Engine - Community
Version: 20.10.0
API version: 1.41
Go version: go1.13.15
Git commit: 7287ab3
Built: Tue Dec 8 18:59:53 2020
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.0
API version: 1.41 (minimum version 1.12)
Go version: go1.13.15
Git commit: eeddea2
Built: Tue Dec 8 18:57:44 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.9
GitCommit: e25210fe30a0a703442421b0f60afac609f950a3
nvidia:
Version: 1.0.1
GitCommit: v1.0.1-0-g4144b63
docker-init:
Version: 0.19.0
GitCommit: de40ad0
- Docker command, image and tag used
- Kernel version from
uname -a
Linux srv1 5.4.0-56-generic #62~18.04.1-Ubuntu SMP Tue Nov 24 10:07:50 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
- Any relevant kernel output lines from
dmesg
- NVIDIA packages version from
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
||/ Name Version Architecture Description
+++-===============================================================================-============================================-============================================-===================================================================================================================================================================
un libgldispatch0-nvidia <none> <none> (no description available)
ii libnvidia-cfg1-460:amd64 460.73.01-0ubuntu1 amd64 NVIDIA binary OpenGL/GLX configuration library
un libnvidia-cfg1-any <none> <none> (no description available)
un libnvidia-common <none> <none> (no description available)
ii libnvidia-common-460 460.73.01-0ubuntu1 all Shared files used by the NVIDIA libraries
ii libnvidia-compute-460:amd64 460.73.01-0ubuntu1 amd64 NVIDIA libcompute package
ii libnvidia-container-tools 1.4.0-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.4.0-1 amd64 NVIDIA container runtime library
un libnvidia-decode <none> <none> (no description available)
ii libnvidia-decode-460:amd64 460.73.01-0ubuntu1 amd64 NVIDIA Video Decoding runtime libraries
un libnvidia-encode <none> <none> (no description available)
ii libnvidia-encode-460:amd64 460.73.01-0ubuntu1 amd64 NVENC Video Encoding runtime library
un libnvidia-extra <none> <none> (no description available)
ii libnvidia-extra-460:amd64 460.73.01-0ubuntu1 amd64 Extra libraries for the NVIDIA driver
un libnvidia-fbc1 <none> <none> (no description available)
ii libnvidia-fbc1-460:amd64 460.73.01-0ubuntu1 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library
un libnvidia-gl <none> <none> (no description available)
ii libnvidia-gl-460:amd64 460.73.01-0ubuntu1 amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
un libnvidia-ifr1 <none> <none> (no description available)
ii libnvidia-ifr1-460:amd64 460.73.01-0ubuntu1 amd64 NVIDIA OpenGL-based Inband Frame Readback runtime library
un libnvidia-ml1 <none> <none> (no description available)
un nvidia-304 <none> <none> (no description available)
un nvidia-340 <none> <none> (no description available)
un nvidia-384 <none> <none> (no description available)
un nvidia-390 <none> <none> (no description available)
un nvidia-common <none> <none> (no description available)
ii nvidia-compute-utils-460 460.73.01-0ubuntu1 amd64 NVIDIA compute utilities
ii nvidia-container-runtime 3.5.0-1 amd64 NVIDIA container runtime
un nvidia-container-runtime-hook <none> <none> (no description available)
ii nvidia-container-toolkit 1.5.1-1 amd64 NVIDIA container runtime hook
ii nvidia-cuda-dev 9.1.85-3ubuntu1 amd64 NVIDIA CUDA development files
ii nvidia-cuda-doc 9.1.85-3ubuntu1 all NVIDIA CUDA and OpenCL documentation
ii nvidia-cuda-gdb 9.1.85-3ubuntu1 amd64 NVIDIA CUDA Debugger (GDB)
ii nvidia-cuda-toolkit 9.1.85-3ubuntu1 amd64 NVIDIA CUDA development toolkit
ii nvidia-dkms-460 460.73.01-0ubuntu1 amd64 NVIDIA DKMS package
un nvidia-dkms-kernel <none> <none> (no description available)
un nvidia-driver <none> <none> (no description available)
ii nvidia-driver-460 460.73.01-0ubuntu1 amd64 NVIDIA driver metapackage
un nvidia-driver-binary <none> <none> (no description available)
un nvidia-kernel-common <none> <none> (no description available)
ii nvidia-kernel-common-460 460.73.01-0ubuntu1 amd64 Shared files used with the kernel module
un nvidia-kernel-source <none> <none> (no description available)
ii nvidia-kernel-source-460 460.73.01-0ubuntu1 amd64 NVIDIA kernel source package
un nvidia-legacy-304xx-vdpau-driver <none> <none> (no description available)
un nvidia-legacy-340xx-vdpau-driver <none> <none> (no description available)
un nvidia-libopencl1 <none> <none> (no description available)
un nvidia-libopencl1-dev <none> <none> (no description available)
ii nvidia-modprobe 465.19.01-0ubuntu1 amd64 Load the NVIDIA kernel driver and create device files
ii nvidia-opencl-dev:amd64 9.1.85-3ubuntu1 amd64 NVIDIA OpenCL development files
un nvidia-opencl-icd <none> <none> (no description available)
un nvidia-persistenced <none> <none> (no description available)
ii nvidia-prime 0.8.16~0.18.04.1 all Tools to enable NVIDIA's Prime
ii nvidia-profiler 9.1.85-3ubuntu1 amd64 NVIDIA Profiler for CUDA and OpenCL
ii nvidia-settings 465.19.01-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver
un nvidia-settings-binary <none> <none> (no description available)
un nvidia-smi <none> <none> (no description available)
un nvidia-utils <none> <none> (no description available)
ii nvidia-utils-460 460.73.01-0ubuntu1 amd64 NVIDIA driver support binaries
un nvidia-vdpau-driver <none> <none> (no description available)
ii nvidia-visual-profiler 9.1.85-3ubuntu1 amd64 NVIDIA Visual Profiler for CUDA and OpenCL
ii xserver-xorg-video-nvidia-460 460.73.01-0ubuntu1 amd64 NVIDIA binary Xorg driver
- NVIDIA container library version from
nvidia-container-cli -V
version: 1.4.0
build date: 2021-04-24T14:25+00:00
build revision: 704a698b7a0ceec07a48e56c37365c741718c2df
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
- NVIDIA container library logs (see troubleshooting)
About this issue
- Original URL
- State: open
- Created 3 years ago
- Comments: 16 (7 by maintainers)
👌🏻 Switched to v0.10.0 and it works fine.
@luckyycode this seems like an unrelated issue to this thread.
Note that from the config in https://github.com/NVIDIA/k8s-device-plugin/issues/263#issuecomment-908247909 we see:
indicating the use of the v2 shim.
It may be more useful to create a new ticket. describing the behaviour that you see and including any relevant k8s or containerd information and logs.
@davidho27941 I see from your description that you are installing version
1.0.0-beta4
of the device plugin:The versioning of the NVIDIA Device plugin is inconsistent in that
v0.9.0
is the latest release: https://github.com/NVIDIA/k8s-device-plugin/releases/tag/v0.9.0Could see whether using this (or one of the more recent releases) addresses your issue?