alpa: time out error (Deadline Exceeded) when run python -m alpa.test_install

Please describe the bug I followed the install instructions in tutorial, and all things seems to be ok. But when I followed Check Installation section, use python -m alpa.test_install, then meet the time out error (Deadline Exceeded) unfortunately. so can anyone give some suggestions?

System information and environment

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04, docker): Ubuntu 18.04
  • Python version: 3.7.13
  • CUDA version: 11.1
  • NCCL version: 2.10.3
  • cupy version: 11.1.0
  • GPU model and memory: 8×Tesla P100-PCIE-16GB
  • Alpa version: 0.2.0
  • TensorFlow version: 2.10.0
  • JAX version: 0.3.15
  • JAXlib version: 0.3.15+cuda111.cudnn805

To Reproduce Steps to reproduce the behavior:

  1. python -m alpa.test_install

Error Info

(alpa) root@28c67ac89ed8:/home/gehao/Alpa/alpa# python -m alpa.test_install.py
.2022-09-13 09:33:09,527        INFO worker.py:1333 -- Connecting to existing Ray cluster at address: 172.17.0.3:6379...
2022-09-13 09:33:09,539 INFO worker.py:1518 -- Connected to Ray cluster.
E
======================================================================
ERROR: test_2_pipeline_parallel (alpa.test_install.InstallationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "tests/runtime/test_install.py", line 7, in <module>
    runner.run(suite())
  File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/unittest/runner.py", line 176, in run
    test(result)
  File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
  File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/unittest/suite.py", line 122, in run
    test(result)
  File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/unittest/case.py", line 676, in __call__
    return self.run(*args, **kwds)
  File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/unittest/case.py", line 628, in run
    testMethod()
  File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/test_install.py", line 49, in test_2_pipeline_parallel
    actual_output = p_train_step(state, batch)
  File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
    return fun(*args, **kwargs)
  File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/api.py", line 118, in __call__
    self._decode_args_and_get_executable(*args))
  File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/api.py", line 191, in _decode_args_and_get_executable
    self.method, *abstract_args)
  File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/jax/linear_util.py", line 295, in memoized_fun
    ans = call(fun, *args)
  File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/api.py", line 217, in _compile_parallel_executable
    batch_invars, *avals)
  File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/parallel_method.py", line 236, in compile_executable
    self.stage_input_shardings, *avals)
  File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/pipeline_parallel/compile_executable.py", line 96, in compile_pipeshard_executable
    None, stage_input_shardings)
  File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/pipeline_parallel/compile_executable.py", line 268, in compile_pipeshard_executable_internal
    allreduce_groups=allreduce_groups).compile()
  File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/pipeline_parallel/runtime_emitter.py", line 400, in compile
    self._compile_resharding_tasks()
  File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/pipeline_parallel/runtime_emitter.py", line 340, in _compile_resharding_tasks
    dst_mesh)
  File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 195, in __init__
    self._compile()
  File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 220, in _compile
    self.put_all_tasks()
  File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 240, in put_all_tasks
    ray.get(task_dones)
  File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/ray/_private/worker.py", line 2277, in get
    raise value
jax._src.traceback_util.UnfilteredStackTrace: ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::MeshHostWorker.__init__() (pid=136957, ip=172.17.0.3, repr=<alpa.device_mesh.MeshHostWorker object at 0x7f69ee88ee10>)
  File "/home/gehao/Alpa/alpa/alpa/device_mesh.py", line 126, in __init__
    self.distributed_client.connect()
jaxlib.xla_extension.XlaRuntimeError: DEADLINE_EXCEEDED: Connect() timed out after 0 with 1 attempts. Most recent failure was: DEADLINE_EXCEEDED: Deadline Exceeded

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.

--------------------

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/test_install.py", line 49, in test_2_pipeline_parallel
    actual_output = p_train_step(state, batch)
  File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/pipeline_parallel/compile_executable.py", line 96, in compile_pipeshard_executable
    None, stage_input_shardings)
  File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/pipeline_parallel/compile_executable.py", line 268, in compile_pipeshard_executable_internal
    allreduce_groups=allreduce_groups).compile()
  File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/pipeline_parallel/runtime_emitter.py", line 400, in compile
    self._compile_resharding_tasks()
  File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/pipeline_parallel/runtime_emitter.py", line 340, in _compile_resharding_tasks
    dst_mesh)
  File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 195, in __init__
    self._compile()
  File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 220, in _compile
    self.put_all_tasks()
  File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 240, in put_all_tasks
    ray.get(task_dones)
  File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/ray/_private/worker.py", line 2277, in get
    raise value
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::MeshHostWorker.__init__() (pid=136957, ip=172.17.0.3, repr=<alpa.device_mesh.MeshHostWorker object at 0x7f69ee88ee10>)
  File "/home/gehao/Alpa/alpa/alpa/device_mesh.py", line 126, in __init__
    self.distributed_client.connect()
jaxlib.xla_extension.XlaRuntimeError: DEADLINE_EXCEEDED: Connect() timed out after 0 with 1 attempts. Most recent failure was: DEADLINE_EXCEEDED: Deadline Exceeded

----------------------------------------------------------------------
Ran 2 tests in 141.608s

FAILED (errors=1)
(MeshHostWorker pid=136957) 2022-09-13 09:35:18.893979: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/distributed/client.cc:215] Connect() failed after 1 retries in 0; most recent failure status: DEADLINE_EXCEEDED: Deadline Exceeded
(MeshHostWorker pid=136957) 2022-09-13 09:35:18,896     ERROR worker.py:756 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::MeshHostWorker.__init__() (pid=136957, ip=172.17.0.3, repr=<alpa.device_mesh.MeshHostWorker object at 0x7f69ee88ee10>)
(MeshHostWorker pid=136957)   File "/home/gehao/Alpa/alpa/alpa/device_mesh.py", line 126, in __init__
(MeshHostWorker pid=136957)     self.distributed_client.connect()
(MeshHostWorker pid=136957) jaxlib.xla_extension.XlaRuntimeError: DEADLINE_EXCEEDED: Connect() timed out after 0 with 1 attempts. Most recent failure was: DEADLINE_EXCEEDED: Deadline Exceeded
(MeshHostWorker pid=136956) 2022-09-13 09:35:18.950654: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/distributed/client.cc:215] Connect() failed after 1 retries in 0; most recent failure status: DEADLINE_EXCEEDED: Deadline Exceeded
(MeshHostWorker pid=136956) 2022-09-13 09:35:18,952     ERROR worker.py:756 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::MeshHostWorker.__init__() (pid=136956, ip=172.17.0.3, repr=<alpa.device_mesh.MeshHostWorker object at 0x7f27f9795b50>)
(MeshHostWorker pid=136956)   File "/home/gehao/Alpa/alpa/alpa/device_mesh.py", line 126, in __init__
(MeshHostWorker pid=136956)     self.distributed_client.connect()
(MeshHostWorker pid=136956) jaxlib.xla_extension.XlaRuntimeError: DEADLINE_EXCEEDED: Connect() timed out after 0 with 1 attempts. Most recent failure was: DEADLINE_EXCEEDED: Deadline Exceeded

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 19 (6 by maintainers)

Most upvoted comments

@Sakura-gh what’s your ray version?