podman: rootless nvidia runtime not working as expected
(Disclaimer: I might be barking up the wrong tree here, I have no idea if this is even supposed to work yet.)
Problem
nvidia runtime does not work in rootless mode without root (see debug log: /sys/fs permission)
Expected result
Rainbows.
$ sudo ^Wpodman run --rm nvidia/cuda:10.1-base nvidia-smi
Error: container_linux.go:345: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH": OCI runtime error
$ sudo ^Wpodman run --rm --runtime=nvidia nvidia/cuda:10.1-base nvidia-smi
Sun Jul 28 21:26:43 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.34 Driver Version: 430.34 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro GP100 Off | 00000000:01:00.0 Off | Off |
| 36% 48C P0 28W / 235W | 0MiB / 16277MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Debug log
$ podman --log-level=debug run --rm --runtime=nvidia nvidia/cuda:10.1-base nvidia-smi
INFO[0000] running as rootless
DEBU[0000] Initializing boltdb state at /home/hholst/.local/share/containers/storage/libpod/bolt_state.db
DEBU[0000] Using graph driver vfs
DEBU[0000] Using graph root /home/hholst/.local/share/containers/storage
DEBU[0000] Using run root /run/user/1001
DEBU[0000] Using static dir /home/hholst/.local/share/containers/storage/libpod
DEBU[0000] Using tmp dir /run/user/1001/libpod/tmp
DEBU[0000] Using volume path /home/hholst/.local/share/containers/storage/volumes
DEBU[0000] Set libpod namespace to ""
DEBU[0000] [graphdriver] trying provided driver "vfs"
DEBU[0000] Initializing event backend file
DEBU[0000] parsed reference into "[vfs@/home/hholst/.local/share/containers/storage+/run/user/1001]docker.io/nvidia/cuda:10.1-base"
DEBU[0000] parsed reference into "[vfs@/home/hholst/.local/share/containers/storage+/run/user/1001]@d5ce0ddf6959429e592ec5a55f88b73ff9040d3f77899284822f70a31cfd86fd"
DEBU[0000] exporting opaque data as blob "sha256:d5ce0ddf6959429e592ec5a55f88b73ff9040d3f77899284822f70a31cfd86fd"
DEBU[0000] parsed reference into "[vfs@/home/hholst/.local/share/containers/storage+/run/user/1001]@d5ce0ddf6959429e592ec5a55f88b73ff9040d3f77899284822f70a31cfd86fd"
DEBU[0000] exporting opaque data as blob "sha256:d5ce0ddf6959429e592ec5a55f88b73ff9040d3f77899284822f70a31cfd86fd"
DEBU[0000] parsed reference into "[vfs@/home/hholst/.local/share/containers/storage+/run/user/1001]@d5ce0ddf6959429e592ec5a55f88b73ff9040d3f77899284822f70a31cfd86fd"
DEBU[0000] Got mounts: []
DEBU[0000] Got volumes: []
DEBU[0000] Using slirp4netns netmode
DEBU[0000] created OCI spec and options for new container
DEBU[0000] Allocated lock 1 for container f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde
DEBU[0000] parsed reference into "[vfs@/home/hholst/.local/share/containers/storage+/run/user/1001]@d5ce0ddf6959429e592ec5a55f88b73ff9040d3f77899284822f70a31cfd86fd"
DEBU[0000] exporting opaque data as blob "sha256:d5ce0ddf6959429e592ec5a55f88b73ff9040d3f77899284822f70a31cfd86fd"
DEBU[0000] created container "f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde"
DEBU[0000] container "f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde" has work directory "/home/hholst/.local/share/containers/storage/vfs-containers/f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde/userdata"
DEBU[0000] container "f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde" has run directory "/run/user/1001/vfs-containers/f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde/userdata"
DEBU[0000] New container created "f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde"
DEBU[0000] container "f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde" has CgroupParent "/libpod_parent/libpod-f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde"
DEBU[0000] Not attaching to stdin
DEBU[0000] mounted container "f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde" at "/home/hholst/.local/share/containers/storage/vfs/dir/640c8ef93cb8b3e03ede3f34bb8c346869f4ffda98b234f9d63ae22411944d9d"
DEBU[0000] Created root filesystem for container f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde at /home/hholst/.local/share/containers/storage/vfs/dir/640c8ef93cb8b3e03ede3f34bb8c346869f4ffda98b234f9d63ae22411944d9d
DEBU[0000] /etc/system-fips does not exist on host, not mounting FIPS mode secret
DEBU[0000] Created OCI spec for container f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde at /home/hholst/.local/share/containers/storage/vfs-containers/f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde/userdata/config.json
DEBU[0000] /usr/bin/conmon messages will be logged to syslog
DEBU[0000] running conmon: /usr/bin/conmon args="[-c f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde -u f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde -n quizzical_lewin -r /usr/bin/nvidia-container-runtime -b /home/hholst/.local/share/containers/storage/vfs-containers/f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde/userdata -p /run/user/1001/vfs-containers/f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde/userdata/pidfile --exit-dir /run/user/1001/libpod/tmp/exits --conmon-pidfile /run/user/1001/vfs-containers/f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde/userdata/conmon.pid --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /home/hholst/.local/share/containers/storage --exit-command-arg --runroot --exit-command-arg /run/user/1001 --exit-command-arg --log-level --exit-command-arg debug --exit-command-arg --cgroup-manager --exit-command-arg cgroupfs --exit-command-arg --tmpdir --exit-command-arg /run/user/1001/libpod/tmp --exit-command-arg --runtime --exit-command-arg nvidia --exit-command-arg --storage-driver --exit-command-arg vfs --exit-command-arg container --exit-command-arg cleanup --exit-command-arg --rm --exit-command-arg f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde --socket-dir-path /run/user/1001/libpod/tmp/socket -l k8s-file:/home/hholst/.local/share/containers/storage/vfs-containers/f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde/userdata/ctr.log --log-level debug --syslog]"
WARN[0000] Failed to add conmon to cgroupfs sandbox cgroup: error creating cgroup for cpu: mkdir /sys/fs/cgroup/cpu/libpod_parent: permission denied
[conmon:d]: failed to write to /proc/self/oom_score_adj: Permission denied
DEBU[0000] Received container pid: -1
DEBU[0000] Cleaning up container f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde
DEBU[0000] Network is already cleaned up, skipping...
DEBU[0000] unmounted container "f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde"
DEBU[0000] Cleaning up container f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde
DEBU[0000] Network is already cleaned up, skipping...
DEBU[0000] Container f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde storage is already unmounted, skipping...
DEBU[0000] Container f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde storage is already unmounted, skipping...
ERRO[0001] container_linux.go:345: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:407: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/sbin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=10.1 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=396,driver<397 brand=tesla,driver>=410,driver<411 --pid=9588 /home/hholst/.local/share/containers/storage/vfs/dir/640c8ef93cb8b3e03ede3f34bb8c346869f4ffda98b234f9d63ae22411944d9d]\\\\nnvidia-container-cli: mount error: open failed: /sys/fs/cgroup/devices/user.slice/devices.allow: permission denied\\\\n\\\"\""
: OCI runtime error
Config
$ podman --version
podman version 1.4.4
$ uname -a
Linux puff.lan 5.2.3-arch1-1-ARCH #1 SMP PREEMPT Fri Jul 26 08:13:47 UTC 2019 x86_64 GNU/Linux
$ cat ~/.config/containers/libpod.conf
volume_path = "/home/hholst/.local/share/containers/storage/volumes"
image_default_transport = "docker://"
runtime = "runc"
conmon_path = ["/usr/libexec/podman/conmon", "/usr/local/lib/podman/conmon", "/usr/bin/conmon", "/usr/sbin/conmon", "/usr/local/bin/conmon", "/usr/local/sbin/conmon"]
conmon_env_vars = ["PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"]
cgroup_manager = "cgroupfs"
init_path = "/usr/libexec/podman/catatonit"
static_dir = "/home/hholst/.local/share/containers/storage/libpod"
tmp_dir = "/run/user/1001/libpod/tmp"
max_log_size = -1
no_pivot_root = false
cni_config_dir = "/etc/cni/net.d/"
cni_plugin_dir = ["/usr/libexec/cni", "/usr/lib/cni", "/usr/local/lib/cni", "/opt/cni/bin"]
infra_image = "k8s.gcr.io/pause:3.1"
infra_command = "/pause"
enable_port_reservation = true
label = true
network_cmd_path = ""
num_locks = 2048
events_logger = "file"
EventsLogFilePath = ""
detach_keys = "ctrl-p,ctrl-q"
[runtimes]
runc = ["/usr/bin/runc", "/usr/sbin/runc", "/usr/local/bin/runc", "/usr/local/sbin/runc", "/sbin/runc", "/bin/runc", "/usr/lib/cri-o-runc/sbin/runc"]
nvidia = ["/usr/bin/nvidia-container-runtime"]
$ pacman -Q nvidia-container-runtime
nvidia-container-runtime 2.0.0+3.docker18.09.6-1
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 1
- Comments: 22 (11 by maintainers)
I’m able to run with a few changes to a config and a custom hook. Of course since I’m non-root the system hooks won’t used by default, so I have to add the --hook-dir option as well.
Add/update these two sections of /etc/nvidia-container-runtime/config.toml
My quick-and-dirty hook, I put it in /usr/share/containers/oci/hooks.d/01-nvhook.json
Once that is in place, and I don’t have any mysterious bits in /run/user/1000/vfs-layers from previously using sudo podman …
Usual failures I see are when non-root is trying to open files in /var/log/ for writing. And the cgroups thing which was mentioned in the report at https://github.com/NVIDIA/nvidia-container-runtime/issues/85
The above is only a work around. I goals for fully resolving this issue would be: