alpa: [Installation] Trouble shooting `llvmlite` and `NCCL_ERROR_UNHANDLED_CUDA_ERROR`
Thanks @zhisbug and @merrymercy for helping and guiding me through the installation, here are several trouble shooting for each error message and hope this can be helpful to other.
Issue: cupy
reports cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error
Trigger:
(tensorflow2_p38) ubuntu@ip-10-0-0-171:~/alpa/benchmark/cupy$ python profile_communication.py
Error Message:
Traceback (most recent call last):
File "profile_communication.py", line 261, in <module>
ray.get([w.profile.remote() for w in workers])
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/ray/worker.py", line 1843, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(NcclError): ray::GpuHost.profile() (pid=5622, ip=10.0.0.171, repr=<profile_communication.GpuHost object at 0x7f3c56d05ac0>)
File "profile_communication.py", line 199, in profile
self.profile_allreduce(1 << i, cp.float32, [list(range(self.world_size))])
File "profile_communication.py", line 80, in profile_allreduce
comm = self.init_communicator(groups)
File "profile_communication.py", line 73, in init_communicator
comm = cp.cuda.nccl.NcclCommunicator(
File "cupy_backends/cuda/libs/nccl.pyx", line 283, in cupy_backends.cuda.libs.nccl.NcclCommunicator.__init__
File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status
cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error
Solution:
python -m cupyx.tools.install_library --cuda 11.3 --library nccl
I might messed up cupy version when switching the cuda
Error Message:
numba OSError: Could not load shared object file: libllvmlite.so
or Numba issue
Traceback (most recent call last):
File "tests/test_install.py", line 13, in <module>
from alpa import (init, parallelize, grad, ShardParallel,
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/__init__.py", line 1, in <module>
from alpa.api import (init, shutdown, parallelize, grad, value_and_grad,
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/api.py", line 15, in <module>
from alpa.parallel_method import ParallelMethod, ShardParallel
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/parallel_method.py", line 26, in <module>
from alpa.pipeline_parallel.compile_executable import compile_pipeshard_executable
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/pipeline_parallel/compile_executable.py", line 28, in <module>
from alpa.pipeline_parallel.stage_construction import (
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/pipeline_parallel/stage_construction.py", line 10, in <module>
import numba
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/numba/__init__.py", line 19, in <module>
from numba.core import config
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/numba/core/config.py", line 16, in <module>
import llvmlite.binding as ll
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/llvmlite/binding/__init__.py", line 4, in <module>
from .dylib import *
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/llvmlite/binding/dylib.py", line 3, in <module>
from llvmlite.binding import ffi
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/llvmlite/binding/ffi.py", line 191, in <module>
raise OSError("Could not load shared object file: {}".format(_lib_name))
Solution:
Fixed by installing the latest version of alpa
pip uninstall alpa
pip install pip --upgrade
pip install alpa --upgrade
Trigger:
sudo apt install coinor-cbc
Error Message:
[E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)](https://askubuntu.com/questions/1109982/e-could-not-get-lock-var-lib-dpkg-lock-frontend-open-11-resource-temporari)
Solution: Fixed by removing the lock on ubuntu: link
sudo rm /var/lib/apt/lists/lock
sudo rm /var/cache/apt/archives/lock
sudo rm /var/lib/dpkg/lock*
sudo dpkg --configure -a
sudo apt update
And then reinstall the package.
- still wip
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 3
- Comments: 15 (15 by maintainers)
In my case, I firstly using NCCL_DEBUG=WARN to find that there are gpus can’t communicate each other under p2p setting. Then i use nvidia-smi topo -m to show the connection matrix between the GPUs and the CPUs. It shows that only SYS and PIX supported. Then i setting NCCL_P2P_LEVEL= PIX and also NCCL_SHM_DISABLE=1. it fixed now
Sure, happy to work on That!