alpa: time out error (Deadline Exceeded) when run python -m alpa.test_install
Please describe the bug
I followed the install instructions in tutorial, and all things seems to be ok. But when I followed Check Installation section, use python -m alpa.test_install
, then meet the time out error (Deadline Exceeded) unfortunately. so can anyone give some suggestions?
System information and environment
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04, docker): Ubuntu 18.04
- Python version: 3.7.13
- CUDA version: 11.1
- NCCL version: 2.10.3
- cupy version: 11.1.0
- GPU model and memory: 8×Tesla P100-PCIE-16GB
- Alpa version: 0.2.0
- TensorFlow version: 2.10.0
- JAX version: 0.3.15
- JAXlib version: 0.3.15+cuda111.cudnn805
To Reproduce Steps to reproduce the behavior:
- python -m alpa.test_install
Error Info
(alpa) root@28c67ac89ed8:/home/gehao/Alpa/alpa# python -m alpa.test_install.py
.2022-09-13 09:33:09,527 INFO worker.py:1333 -- Connecting to existing Ray cluster at address: 172.17.0.3:6379...
2022-09-13 09:33:09,539 INFO worker.py:1518 -- Connected to Ray cluster.
E
======================================================================
ERROR: test_2_pipeline_parallel (alpa.test_install.InstallationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "tests/runtime/test_install.py", line 7, in <module>
runner.run(suite())
File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/unittest/runner.py", line 176, in run
test(result)
File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/unittest/suite.py", line 84, in __call__
return self.run(*args, **kwds)
File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/unittest/suite.py", line 122, in run
test(result)
File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/unittest/case.py", line 676, in __call__
return self.run(*args, **kwds)
File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/unittest/case.py", line 628, in run
testMethod()
File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/test_install.py", line 49, in test_2_pipeline_parallel
actual_output = p_train_step(state, batch)
File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
return fun(*args, **kwargs)
File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/api.py", line 118, in __call__
self._decode_args_and_get_executable(*args))
File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/api.py", line 191, in _decode_args_and_get_executable
self.method, *abstract_args)
File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/jax/linear_util.py", line 295, in memoized_fun
ans = call(fun, *args)
File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/api.py", line 217, in _compile_parallel_executable
batch_invars, *avals)
File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/parallel_method.py", line 236, in compile_executable
self.stage_input_shardings, *avals)
File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/pipeline_parallel/compile_executable.py", line 96, in compile_pipeshard_executable
None, stage_input_shardings)
File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/pipeline_parallel/compile_executable.py", line 268, in compile_pipeshard_executable_internal
allreduce_groups=allreduce_groups).compile()
File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/pipeline_parallel/runtime_emitter.py", line 400, in compile
self._compile_resharding_tasks()
File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/pipeline_parallel/runtime_emitter.py", line 340, in _compile_resharding_tasks
dst_mesh)
File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 195, in __init__
self._compile()
File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 220, in _compile
self.put_all_tasks()
File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 240, in put_all_tasks
ray.get(task_dones)
File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/ray/_private/worker.py", line 2277, in get
raise value
jax._src.traceback_util.UnfilteredStackTrace: ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::MeshHostWorker.__init__() (pid=136957, ip=172.17.0.3, repr=<alpa.device_mesh.MeshHostWorker object at 0x7f69ee88ee10>)
File "/home/gehao/Alpa/alpa/alpa/device_mesh.py", line 126, in __init__
self.distributed_client.connect()
jaxlib.xla_extension.XlaRuntimeError: DEADLINE_EXCEEDED: Connect() timed out after 0 with 1 attempts. Most recent failure was: DEADLINE_EXCEEDED: Deadline Exceeded
The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.
--------------------
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/test_install.py", line 49, in test_2_pipeline_parallel
actual_output = p_train_step(state, batch)
File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/pipeline_parallel/compile_executable.py", line 96, in compile_pipeshard_executable
None, stage_input_shardings)
File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/pipeline_parallel/compile_executable.py", line 268, in compile_pipeshard_executable_internal
allreduce_groups=allreduce_groups).compile()
File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/pipeline_parallel/runtime_emitter.py", line 400, in compile
self._compile_resharding_tasks()
File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/pipeline_parallel/runtime_emitter.py", line 340, in _compile_resharding_tasks
dst_mesh)
File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 195, in __init__
self._compile()
File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 220, in _compile
self.put_all_tasks()
File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 240, in put_all_tasks
ray.get(task_dones)
File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/gehao/anaconda3/envs/alpa/lib/python3.7/site-packages/ray/_private/worker.py", line 2277, in get
raise value
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::MeshHostWorker.__init__() (pid=136957, ip=172.17.0.3, repr=<alpa.device_mesh.MeshHostWorker object at 0x7f69ee88ee10>)
File "/home/gehao/Alpa/alpa/alpa/device_mesh.py", line 126, in __init__
self.distributed_client.connect()
jaxlib.xla_extension.XlaRuntimeError: DEADLINE_EXCEEDED: Connect() timed out after 0 with 1 attempts. Most recent failure was: DEADLINE_EXCEEDED: Deadline Exceeded
----------------------------------------------------------------------
Ran 2 tests in 141.608s
FAILED (errors=1)
(MeshHostWorker pid=136957) 2022-09-13 09:35:18.893979: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/distributed/client.cc:215] Connect() failed after 1 retries in 0; most recent failure status: DEADLINE_EXCEEDED: Deadline Exceeded
(MeshHostWorker pid=136957) 2022-09-13 09:35:18,896 ERROR worker.py:756 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::MeshHostWorker.__init__() (pid=136957, ip=172.17.0.3, repr=<alpa.device_mesh.MeshHostWorker object at 0x7f69ee88ee10>)
(MeshHostWorker pid=136957) File "/home/gehao/Alpa/alpa/alpa/device_mesh.py", line 126, in __init__
(MeshHostWorker pid=136957) self.distributed_client.connect()
(MeshHostWorker pid=136957) jaxlib.xla_extension.XlaRuntimeError: DEADLINE_EXCEEDED: Connect() timed out after 0 with 1 attempts. Most recent failure was: DEADLINE_EXCEEDED: Deadline Exceeded
(MeshHostWorker pid=136956) 2022-09-13 09:35:18.950654: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/distributed/client.cc:215] Connect() failed after 1 retries in 0; most recent failure status: DEADLINE_EXCEEDED: Deadline Exceeded
(MeshHostWorker pid=136956) 2022-09-13 09:35:18,952 ERROR worker.py:756 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::MeshHostWorker.__init__() (pid=136956, ip=172.17.0.3, repr=<alpa.device_mesh.MeshHostWorker object at 0x7f27f9795b50>)
(MeshHostWorker pid=136956) File "/home/gehao/Alpa/alpa/alpa/device_mesh.py", line 126, in __init__
(MeshHostWorker pid=136956) self.distributed_client.connect()
(MeshHostWorker pid=136956) jaxlib.xla_extension.XlaRuntimeError: DEADLINE_EXCEEDED: Connect() timed out after 0 with 1 attempts. Most recent failure was: DEADLINE_EXCEEDED: Deadline Exceeded
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 19 (6 by maintainers)
@Sakura-gh what’s your ray version?