alpa: [BUG]Backend 'gpu' failed to initialize: FAILED_PRECONDITION: No visible GPU devices.

Hi, I don’t know if it’s appropriate to file this as a bug, but it’s been bugging me for a long time and I have no way to fix it.

I’m operating on a cluster. Ray saw my GPU but alpa didn’t. I followed the installation documentation Install Alpa. And I confirmed I used --enable_cuda when I compiled jax-alpa. When running tests/test_install.py errors are reported, you can see the error log attached below for more details.

System information and environment

OS Platform and Distribution (e.g., Linux Ubuntu 16.04, docker):
Python version: 3.7
CUDA version: 11.2
NCCL version: 2.9.8
cupy version: 112
GPU model and memory: A100 80G
Alpa version: 0.0.0 (conda list and pip list show the Alpa version is 0.0.0)
TensorFlow version:
JAX version: 0.3.5

To Reproduce I ran

RAY_ADDRESS="10.140.1.112:6379" XLA_FLAGS="--xla_gpu_cuda_data_dir=/mnt/cache/share/cuda-11.2" srun -p caif_dev --gres=gpu:1 -w SH-IDC1-10-140-1-112 -n1 bash test_install.sh

and my test_install.sh is:

echo "ray being starting"
ray start --head --node-ip-address 10.140.0.112 --address='10.140.0.112:6379'
echo "succeed==========="
ray status
echo "now running python script"
XLA_FLAGS="--xla_gpu_cuda_data_dir=/mnt/cache/share/cuda-11.2" python /mnt/cache/zhangyuchang/750/alpa/tests/test_install.py
ray stop

Log

phoenix-srun: Job 126515 scheduled successfully!
Current QUOTA_TYPE is [reserved], which means the job has occupied quota in RESERVED_TOTAL under your partition.
Current PHX_PRIORITY is normal

ray being starting
Usage stats collection will be enabled by default in the next release. See https://github.com/ray-project/ray/issues/20857 for more details.
2022-05-10 19:50:46,487 ERROR services.py:1474 -- Failed to start the dashboard: Failed to start the dashboard
 The last 10 lines of /tmp/ray/session_2022-05-10_19-50-20_209620_45113/logs/dashboard.log:
2022-05-10 19:50:38,947 INFO utils.py:99 -- Get all modules by type: DashboardHeadModule

2022-05-10 19:50:46,487 ERROR services.py:1475 -- Failed to start the dashboard
 The last 10 lines of /tmp/ray/session_2022-05-10_19-50-20_209620_45113/logs/dashboard.log:
2022-05-10 19:50:38,947 INFO utils.py:99 -- Get all modules by type: DashboardHeadModule
Traceback (most recent call last):
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/ray/_private/services.py", line 1451, in start_dashboard
    raise Exception(err_msg + last_log_str)
Exception: Failed to start the dashboard
 The last 10 lines of /tmp/ray/session_2022-05-10_19-50-20_209620_45113/logs/dashboard.log:
2022-05-10 19:50:38,947 INFO utils.py:99 -- Get all modules by type: DashboardHeadModule

2022-05-10 19:50:20,155 WARN scripts.py:652 -- Specifying --address for external Redis address is deprecated. Please specify environment variable RAY_REDIS_ADDRESS=10.140.1.112:6379 instead.
2022-05-10 19:50:20,155 INFO scripts.py:659 -- Will use `10.140.1.112:6379` as external Redis server address(es). If the primary one is not reachable, we starts new one(s) with `--port` in local.
2022-05-10 19:50:20,155 INFO scripts.py:681 -- The primary external redis server `10.140.1.112:6379` is not reachable. Will starts new one(s) with `--port` in local.
2022-05-10 19:50:20,190 INFO scripts.py:697 -- Local node IP: 10.140.1.112
2022-05-10 19:50:47,423 SUCC scripts.py:739 -- --------------------
2022-05-10 19:50:47,423 SUCC scripts.py:740 -- Ray runtime started.
2022-05-10 19:50:47,423 SUCC scripts.py:741 -- --------------------
2022-05-10 19:50:47,423 INFO scripts.py:743 -- Next steps
2022-05-10 19:50:47,423 INFO scripts.py:744 -- To connect to this Ray runtime from another node, run
2022-05-10 19:50:47,423 INFO scripts.py:749 --   ray start --address='10.140.1.112:6379'
2022-05-10 19:50:47,423 INFO scripts.py:752 -- Alternatively, use the following Python code:
2022-05-10 19:50:47,423 INFO scripts.py:754 -- import ray
2022-05-10 19:50:47,424 INFO scripts.py:767 -- ray.init(address='auto', _node_ip_address='10.140.1.112')
2022-05-10 19:50:47,424 INFO scripts.py:771 -- To connect to this Ray runtime from outside of the cluster, for example to
2022-05-10 19:50:47,424 INFO scripts.py:775 -- connect to a remote cluster from your laptop directly, use the following
2022-05-10 19:50:47,424 INFO scripts.py:778 -- Python code:
2022-05-10 19:50:47,424 INFO scripts.py:780 -- import ray
2022-05-10 19:50:47,424 INFO scripts.py:786 -- ray.init(address='ray://<head_node_ip_address>:10001')
2022-05-10 19:50:47,424 INFO scripts.py:792 -- If connection fails, check your firewall settings and network configuration.
2022-05-10 19:50:47,424 INFO scripts.py:798 -- To terminate the Ray runtime, run
2022-05-10 19:50:47,424 INFO scripts.py:799 --   ray stop
succeed===========
Node status
---------------------------------------------------------------
Healthy:
 1 node_42a384b6d502cd18b6d052e98420df74116e83b73ef8146e44596910
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/128.0 CPU
 0.0/1.0 GPU
 0.00/804.774 GiB memory
 0.00/186.265 GiB object_store_memory

Demands:
 (no resource demands)
now running python script
2022-05-10 19:53:29.234043: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:272] failed call to cuInit: UNKNOWN ERROR (34)
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
EE
======================================================================
ERROR: test_1_shard_parallel (__main__.InstallationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/cache/zhangyuchang/750/alpa/tests/test_install.py", line 128, in <module>
    runner.run(suite())
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/unittest/runner.py", line 176, in run
    test(result)
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/unittest/suite.py", line 122, in run
    test(result)
      File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/unittest/case.py", line 676, in __call__
    return self.run(*args, **kwds)
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/unittest/case.py", line 628, in run
    testMethod()
  File "/mnt/cache/zhangyuchang/750/alpa/tests/test_install.py", line 80, in test_1_shard_parallel
    actual_state = parallel_train_step(state, batch)
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
    return fun(*args, **kwargs)
  File "/mnt/cache/zhangyuchang/750/alpa/alpa/api.py", line 108, in ret_func
    global_config.memory_budget_per_device, *abstract_args)
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/jax/linear_util.py", line 272, in memoized_fun
    ans = call(fun, *args)
  File "/mnt/cache/zhangyuchang/750/alpa/alpa/api.py", line 176, in parallelize_callable
    memory_budget_per_device, *avals)
  File "/mnt/cache/zhangyuchang/750/alpa/alpa/shard_parallel/shard_callable.py", line 127, in shard_parallel_callable
    *avals)
  File "/mnt/cache/zhangyuchang/750/alpa/alpa/shard_parallel/shard_callable.py", line 235, in shard_parallel_internal_gradient_accumulation
    backend = xb.get_backend("gpu")
  File "/mnt/cache/zhangyuchang/750/alpa/alpa/monkey_patch.py", line 41, in override_get_backend
    return default_get_backend(*args, **kwargs)
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/jax/_src/lib/xla_bridge.py", line 314, in get_backend
    return _get_backend_uncached(platform)
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/jax/_src/lib/xla_bridge.py", line 304, in _get_backend_uncached
    raise RuntimeError(f"Backend '{platform}' failed to initialize: "
jax._src.traceback_util.UnfilteredStackTrace: RuntimeError: Backend 'gpu' failed to initialize: FAILED_PRECONDITION: No visible GPU devices.

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.

--------------------
The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/cache/zhangyuchang/750/alpa/tests/test_install.py", line 80, in test_1_shard_parallel
    actual_state = parallel_train_step(state, batch)
  File "/mnt/cache/zhangyuchang/750/alpa/alpa/api.py", line 108, in ret_func
    global_config.memory_budget_per_device, *abstract_args)
  File "/mnt/cache/zhangyuchang/750/alpa/alpa/api.py", line 176, in parallelize_callable
    memory_budget_per_device, *avals)
  File "/mnt/cache/zhangyuchang/750/alpa/alpa/shard_parallel/shard_callable.py", line 127, in shard_parallel_callable
    *avals)
  File "/mnt/cache/zhangyuchang/750/alpa/alpa/shard_parallel/shard_callable.py", line 235, in shard_parallel_internal_gradient_accumulation
    backend = xb.get_backend("gpu")
  File "/mnt/cache/zhangyuchang/750/alpa/alpa/monkey_patch.py", line 41, in override_get_backend
    return default_get_backend(*args, **kwargs)
RuntimeError: Backend 'gpu' failed to initialize: FAILED_PRECONDITION: No visible GPU devices.

======================================================================
ERROR: test_2
_pipeline_parallel (__main__.InstallationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/cache/zhangyuchang/750/alpa/tests/test_install.py", line 86, in test_2_pipeline_parallel
    ray.init(address="auto")
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/ray/worker.py", line 1072, in init
    connect_only=True,
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/ray/node.py", line 177, in __init__
    self.validate_ip_port(self.address)
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/ray/node.py", line 332, in validate_ip_port
    _ = int(port)
ValueError: invalid literal for int() with base 10: '10.140.1.112'

----------------------------------------------------------------------
Ran 2 tests in 3.069s

FAILED (errors=2)

and my Environment Variables are:

CC=/mnt/cache/share/gcc/gcc-7.5.0/bin/gcc-7.5.0/bin/gcc
CONDA_DEFAULT_ENV=alpa_ray_7.5.0
CONDA_EXE=/mnt/cache/share/platform/env/miniconda3.7/bin/conda
CONDA_PREFIX=/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0
CONDA_PROMPT_MODIFIER='(alpa_ray_7.5.0) '
CONDA_PYTHON_EXE=/mnt/cache/share/platform/env/miniconda3.7/bin/python
CONDA_SHLVL=1
CPATH=/mnt/cache/share/cuda-11.2/targets/x86_64-linux/include/:
CUDACXX=/mnt/cache/share/cuda-11.2/bin/nvcc
CUDA_HOME=/mnt/cache/share/cuda-11.2
CUDA_PATH=/mnt/cache/share/cuda-11.2
CUDA_TOOLKIT_ROOT_DIR=/mnt/cache/share/cuda-11.2
CXX=/mnt/cache/share/gcc/gcc-7.5.0/bin/g++
HISTCONTROL=ignoredups
HISTSIZE=50000
HISTTIMEFORMAT='%F %T zhangyuchang '
HOME=/mnt/lustre/zhangyuchang
HOSTNAME=SH-IDC1-10-140-0-32
LANG=en_US.UTF-8
LD_LIBRARY_PATH=/mnt/cache/share/cuda-11.2/lib64:/mnt/cache/share/cuda-11.2/extras/CUPTI/lib64:/mnt/cache/share/gcc/gcc-7.5.0/lib:/mnt/cache/share/gcc/gcc-7.5.0/lib64:/mnt/cache/share/gcc/gcc-7.5.0/include:/mnt/cache/share/gcc/gcc-7.5.0/bin:/mnt/cache/share/gcc/gmp-4.3.2/lib/:/mnt/cache/share/gcc/mpfr-2.4.2/lib/:/mnt/cache/share/gcc/mpc-0.8.1/lib/:/mnt/cache/share/cuda-11.2/targets/x86_64-linux/lib:/mnt/lustre/zhangyuchang/bin:/mnt/cache/share/platform/dep/nccl-2.9.8-cuda11.0/lib/:/mnt/cache/share/platform/dep/binutils-2.27/lib:/mnt/cache/share/platform/dep/openmpi-4.0.5-cuda11.0/lib:/mnt/cache/share/platform/dep/cuda11.0-cudnn8.0/lib64:/mnt/cache/share/platform/dep/cuda11.0-cudnn8.0/extras/CUPTI/lib64/:/mnt/cache/share/platform/env/miniconda3.6/lib:/mnt/cache/share/platform/dep/nccl-2.9.8-cuda11.0/lib/:/mnt/cache/share/platform/dep/binutils-2.27/lib:/mnt/cache/share/platform/dep/openmpi-4.0.5-cuda11.0/lib:/mnt/cache/share/platform/dep/cuda11.0-cudnn8.0/lib64:/mnt/cache/share/platform/dep/cuda11.0-cudnn8.0/extras/CUPTI/lib64/:/mnt/cache/share/platform/env/miniconda3.6/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64/
LESS=-R
LESSOPEN='||/usr/bin/lesspipe.sh %s'
LIBRARY_PATH=/mnt/cache/share/cuda-11.2/lib64:
LOADEDMODULES=''
LOGNAME=zhangyuchang
LSCOLORS=Gxfxcxdxbxegedabagacad
LS_COLORS='rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36:'
MAIL=/var/spool/mail/zhangyuchang
MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles
MODULESHOME=/usr/share/Modules
NCCL_INSTALL_PATH=/mnt/cache/share/platform/dep/nccl-2.9.8-cuda11.0
NCCL_SOCKET_IFNAME=eth0
PAGER=less
PATH=/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/bin:/mnt/cache/share/cuda-11.2/bin:/mnt/cache/share/gcc/gcc-7.5.0/bin/:/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray/bin:/mnt/cache/share/platform/env/miniconda3.7/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/mnt/lustre/zhangyuchang/bin:/mnt/lustre/zhangyuchang/bin
PWD=/mnt/cache/zhangyuchang/750/alpa
QT_GRAPHICSSYSTEM_CHECKED=1
SHELL=/usr/local/bash/bin/bash
SHLVL=3
SSH_CLIENT='10.201.32.68 45569 22'
SSH_CONNECTION='10.201.36.3 52001 10.140.0.32 22'
SSH_TTY=/dev/pts/267
TERM=screen
TF_PATH=/mnt/cache/zhangyuchang/750/tensorflow-alpa
TMUX=/tmp/tmux-200000422/default,194688,3
TMUX_PANE=%3
USER=zhangyuchang
XDG_RUNTIME_DIR=/run/user/200000422
XDG_SESSION_ID=680243
ZSH=/mnt/lustre/zhangyuchang/.oh-my-zsh
_=export

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 15 (6 by maintainers)

Most upvoted comments

I found the solution. The error caused by the Proxy of the cluster. So the slurm cluster users should check their own proxy before they using alpa on their slurm cluster. Maybe you can remind people who uses slurm cluster in your doc. I’m finally going to start using alpa, maybe other problems will heppend. I’ll open another isuee if I had other problems. So this issue can be closed. Thank you all. @zhisbug @zhuohan123 @TarzanZhao

zyc-bit on May 31, 2022

I did not meet this error before after searching in my personal notes.

TarzanZhao on May 16, 2022