alpa: [Installation] Trouble shooting `llvmlite` and `NCCL_ERROR_UNHANDLED_CUDA_ERROR`

Thanks @zhisbug and @merrymercy for helping and guiding me through the installation, here are several trouble shooting for each error message and hope this can be helpful to other.

Issue: cupy reports cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error Trigger:

(tensorflow2_p38) ubuntu@ip-10-0-0-171:~/alpa/benchmark/cupy$ python profile_communication.py

Error Message:

Traceback (most recent call last):
  File "profile_communication.py", line 261, in <module>
    ray.get([w.profile.remote() for w in workers])
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/ray/worker.py", line 1843, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(NcclError): ray::GpuHost.profile() (pid=5622, ip=10.0.0.171, repr=<profile_communication.GpuHost object at 0x7f3c56d05ac0>)
  File "profile_communication.py", line 199, in profile
    self.profile_allreduce(1 << i, cp.float32, [list(range(self.world_size))])
  File "profile_communication.py", line 80, in profile_allreduce
    comm = self.init_communicator(groups)
  File "profile_communication.py", line 73, in init_communicator
    comm = cp.cuda.nccl.NcclCommunicator(
  File "cupy_backends/cuda/libs/nccl.pyx", line 283, in cupy_backends.cuda.libs.nccl.NcclCommunicator.__init__
  File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status
cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error

Solution:

python -m cupyx.tools.install_library --cuda 11.3 --library nccl

I might messed up cupy version when switching the cuda

Error Message:

numba OSError: Could not load shared object file: libllvmlite.so

or Numba issue

Traceback (most recent call last):
  File "tests/test_install.py", line 13, in <module>
    from alpa import (init, parallelize, grad, ShardParallel,
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/__init__.py", line 1, in <module>
    from alpa.api import (init, shutdown, parallelize, grad, value_and_grad,
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/api.py", line 15, in <module>
    from alpa.parallel_method import ParallelMethod, ShardParallel
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/parallel_method.py", line 26, in <module>
    from alpa.pipeline_parallel.compile_executable import compile_pipeshard_executable
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/pipeline_parallel/compile_executable.py", line 28, in <module>
    from alpa.pipeline_parallel.stage_construction import (
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/pipeline_parallel/stage_construction.py", line 10, in <module>
    import numba
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/numba/__init__.py", line 19, in <module>
    from numba.core import config
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/numba/core/config.py", line 16, in <module>
    import llvmlite.binding as ll
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/llvmlite/binding/__init__.py", line 4, in <module>
    from .dylib import *
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/llvmlite/binding/dylib.py", line 3, in <module>
    from llvmlite.binding import ffi
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/llvmlite/binding/ffi.py", line 191, in <module>
    raise OSError("Could not load shared object file: {}".format(_lib_name))

Solution: Fixed by installing the latest version of alpa

pip uninstall alpa
pip install pip --upgrade
pip install alpa --upgrade

Trigger:

sudo apt install coinor-cbc

Error Message:

[E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)](https://askubuntu.com/questions/1109982/e-could-not-get-lock-var-lib-dpkg-lock-frontend-open-11-resource-temporari)

Solution: Fixed by removing the lock on ubuntu: link

sudo rm /var/lib/apt/lists/lock
sudo rm /var/cache/apt/archives/lock
sudo rm /var/lib/dpkg/lock*

sudo dpkg --configure -a
sudo apt update

And then reinstall the package.

  • still wip

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 3
  • Comments: 15 (15 by maintainers)

Most upvoted comments

Thanks @zhisbug and @merrymercy for helping and guiding me through the installation, here are several trouble shooting for each error message and hope this can be helpful to other.

Issue: cupy reports cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error Trigger:

(tensorflow2_p38) ubuntu@ip-10-0-0-171:~/alpa/benchmark/cupy$ python profile_communication.py

Error Message:

Traceback (most recent call last):
  File "profile_communication.py", line 261, in <module>
    ray.get([w.profile.remote() for w in workers])
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/ray/worker.py", line 1843, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(NcclError): ray::GpuHost.profile() (pid=5622, ip=10.0.0.171, repr=<profile_communication.GpuHost object at 0x7f3c56d05ac0>)
  File "profile_communication.py", line 199, in profile
    self.profile_allreduce(1 << i, cp.float32, [list(range(self.world_size))])
  File "profile_communication.py", line 80, in profile_allreduce
    comm = self.init_communicator(groups)
  File "profile_communication.py", line 73, in init_communicator
    comm = cp.cuda.nccl.NcclCommunicator(
  File "cupy_backends/cuda/libs/nccl.pyx", line 283, in cupy_backends.cuda.libs.nccl.NcclCommunicator.__init__
  File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status
cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error

Solution:

python -m cupyx.tools.install_library --cuda 11.3 --library nccl

I might messed up cupy version when switching the cuda

Error Message:

numba OSError: Could not load shared object file: libllvmlite.so

or Numba issue

Traceback (most recent call last):
  File "tests/test_install.py", line 13, in <module>
    from alpa import (init, parallelize, grad, ShardParallel,
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/__init__.py", line 1, in <module>
    from alpa.api import (init, shutdown, parallelize, grad, value_and_grad,
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/api.py", line 15, in <module>
    from alpa.parallel_method import ParallelMethod, ShardParallel
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/parallel_method.py", line 26, in <module>
    from alpa.pipeline_parallel.compile_executable import compile_pipeshard_executable
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/pipeline_parallel/compile_executable.py", line 28, in <module>
    from alpa.pipeline_parallel.stage_construction import (
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/pipeline_parallel/stage_construction.py", line 10, in <module>
    import numba
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/numba/__init__.py", line 19, in <module>
    from numba.core import config
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/numba/core/config.py", line 16, in <module>
    import llvmlite.binding as ll
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/llvmlite/binding/__init__.py", line 4, in <module>
    from .dylib import *
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/llvmlite/binding/dylib.py", line 3, in <module>
    from llvmlite.binding import ffi
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/llvmlite/binding/ffi.py", line 191, in <module>
    raise OSError("Could not load shared object file: {}".format(_lib_name))

Solution: Fixed by installing the latest version of alpa

pip uninstall alpa
pip install pip --upgrade
pip install alpa --upgrade

Trigger:

sudo apt install coinor-cbc

Error Message:

[E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)](https://askubuntu.com/questions/1109982/e-could-not-get-lock-var-lib-dpkg-lock-frontend-open-11-resource-temporari)

Solution: Fixed by removing the lock on ubuntu: link

sudo rm /var/lib/apt/lists/lock
sudo rm /var/cache/apt/archives/lock
sudo rm /var/lib/dpkg/lock*

sudo dpkg --configure -a
sudo apt update

And then reinstall the package.

  • still wip

In my case, I firstly using NCCL_DEBUG=WARN to find that there are gpus can’t communicate each other under p2p setting. Then i use nvidia-smi topo -m to show the connection matrix between the GPUs and the CPUs. It shows that only SYS and PIX supported. Then i setting NCCL_P2P_LEVEL= PIX and also NCCL_SHM_DISABLE=1. it fixed now

Sure, happy to work on That!