pdm: Cannot install PyTorch 1.13.x with PDM

  • I have searched the issue tracker and believe that this is not a duplicate.

Make sure you run commands with -v flag before pasting the output.

Steps to reproduce

  1. Install PyTorch 1.13.x by running pdm add torch (1.13.1 is the latest version currently.)
  2. Try to import pytorch in the interpreter python -c 'import torch'.

Actual behavior

PyTorch should be imported without any errors.

Expected behavior

❯ python -c 'import torch'
Traceback (most recent call last):
  File ".../.venv/lib/python3.10/site-packages/torch/__init__.py", line 172, in _load_global_deps
    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
  File "/usr/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcublas.so.11: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File ".../.venv/lib/python3.10/site-packages/torch/__init__.py", line 217, in <module>
    _load_global_deps()
  File ".../.venv/lib/python3.10/site-packages/torch/__init__.py", line 178, in _load_global_deps
    _preload_cuda_deps()
  File ".../.venv/lib/python3.10/site-packages/torch/__init__.py", line 158, in _preload_cuda_deps
    ctypes.CDLL(cublas_path)
  File "/usr/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: .../.venv/lib/python3.10/site-packages/nvidia/cublas/lib/libcublas.so.11: cannot open shared object file: No such file or directory

Environment Information

PDM version:
  2.4.6
Python Interpreter:
  .../.venv/bin/python (3.10)
Project Root:
  ...
Project Packages:
  None
{
  "implementation_name": "cpython",
  "implementation_version": "3.10.10",
  "os_name": "posix",
  "platform_machine": "x86_64",
  "platform_release": "5.4.0-121-generic",
  "platform_system": "Linux",
  "platform_version": "#137-Ubuntu SMP Wed Jun 15 13:33:07 UTC 2022",
  "python_full_version": "3.10.10",
  "platform_python_implementation": "CPython",
  "python_version": "3.10",
  "sys_platform": "linux"
}

I “think” this is related to the fact that PyTorch 1.13.x introduced a new set of dependencies around cuda (https://github.com/pytorch/pytorch/pull/85097). Poetry had issues b/c of this (https://github.com/pytorch/pytorch/issues/88049) but it’s since been resolved, but not for pdm. My guess is that it might be b/c pdm installs the cuda dependencies separately from pytorch and b/c of that the pytorch installation doesn’t know about them. It’s a bummer, b/c I wanted to give pdm a spin for a new project, for now I’m going to have to stick to poetry. 😕

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 1
  • Comments: 19

Most upvoted comments

For anyone coming here off search engines… I wiped my lock file and .venv, and the following worked for me (thanks to #2425!):

pdm config --local install.cache_method symlink_individual

It doesn’t work with latest pdm or pytorch If there is actually a problem with nvidia, pytorch users will be happier if there is some way to compromise

I’m compelled to create a script that runs like this and copy it directly to the cache.

cp -r /home/user/.cache/pdm/packages/nvidia_nccl_cu12-2.18.1-py3-none-manylinux1_x86_64/lib/nvidia/nccl /home/user/.cache/pdm/packages/nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64/lib/nvidia/
cp -r /home/user/.cache/pdm/packages/nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64/lib/nvidia/nvtx /home/user/.cache/pdm/packages/nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64/lib/nvidia/
cp -r /home/user/.cache/pdm/packages/nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64/lib/nvidia/cufft /home/user/.cache/pdm/packages/nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64/lib/nvidia/

...

It works explicitly, but the user should not ask for it.

For example, is it possible to do a workaround that downloads only libraries from nvidia (explicitly named libraries like pdm.toml) directly instead of a symlink(cache_method)?

By the way, in my environment, pdm config install.cache_method pth did not work.

I’m having a similar (probably even the same problem) and I suspect the install.cache setting being the culprit here (I assume @yukw777 also has this set to true).

I discovered the following issue with the nvidia libraries (nvidia_cublas_cu11, nvidia_cuda_nvrtc_cu11, etc.):

With install.cache turned off, the directory structure is as follows:

nvidia
├ __init__.py
├ cublas
├ cuda_nvrtc
├ cuda_runtime
└ cudnn
nvidia_cublas_cu11-11.10.3.66.dist-info
nvidia_cuda_nvrtc_cu11-11.7.99.dist-info
nvidia_cuda_runtime_cu11-11.7.99.dist-info
nvidia_cudnn_cu11-8.5.0.96.dist-info

As soon as you activate install.cache, the directory structure changes:

nvidia -> /root/.cache/pdm/packages/nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64/lib/nvidia
nvidia_cublas_cu11-11.10.3.66.dist-info
nvidia_cuda_nvrtc_cu11-11.7.99.dist-info
nvidia_cuda_runtime_cu11-11.7.99.dist-info
nvidia_cudnn_cu11-8.5.0.96.dist-info

The content of /root/.cache/pdm/packages/nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64/lib/nvidia is obviously only

__init__.py
cudnn

I hope that this issue can be fixed somehow (I don’t know how standard compliant several packages installing into a common package folder is) because the nvidia packages are the primary reaon I activated install.cache in the first place.

The main cause is nvidia is a normal package with a blank __init__.py, in which case PDM will create a single symlink for the whole directory. Maybe we can implement a different link strategy to force PDM to create a symlink for each individual files.

I tested your suggestion but the problem still persists.

Looking at the PyTorch source code (https://github.com/pytorch/pytorch/blob/v1.13.1/torch/__init__.py#L144) reveals the underlying problem:

  • the code actually searches for the nvidia folder in all elements of the sys.path
  • if it is found, it constructs paths to the required libraries (libcublas and libcudnn) and checks for their existance
  • the last condition is never true for both of the libraries, only ever for one
  • in the end the code tries to load the two libraries from the current paths to libcublas and libcudnn which obviously fails.

So the problem here is, I think,

  1. Nvidia distributing several packages that install into the same subfolder This makes creating a symlink at the top level impossible, symlinking would have to treat the nvidia packages in a special way (create the parent folder, create symlinks for the subfolders)
  2. PyTorch using Knowledge about that directory structure directly to load the libraries This makes using install.cache_method pth also not a possibility. If they would resolve the path to the libraries for each package while iterating sys.path, it could work…

From looking at the code, PyTorch 2.0.0 might actually work with PDM and install.cache_method pth as the code that loads the cuda libraries iterates all elements of sys.path and looks for the nvidia subfolder and the library in each element individually.