ray: [dask on ray issues]: errors running dask matmul
What is the problem?
Ray version and other system information (Python version, TensorFlow version, OS): ray: 2.0.0.dev0 dask==2021.1.1
When enable the object spilling on a ray cluster(by passing --system-config=‘{“automatic_object_spilling_enabled”:true,“max_io_workers”:2,“object_spilling_config”:“{"type":"filesystem","params":{"directory_path":"/tmp/spill"}}”}’ to the head node configuration), the program aborts with bus error.
Error output:
Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Fetched IP: 3.238.219.129
Shared connection to 3.238.219.129 closed.
Shared connection to 3.238.219.129 closed.
Fetched IP: 3.238.219.129
Shared connection to 3.238.219.129 closed.
2021-02-15 14:26:48,908 INFO worker.py:655 -- Connecting to existing Ray cluster at address: 172.31.7.112:6379
(autoscaler +26s) Tip: use `ray status` to view detailed autoscaling status. To disable autoscaler event messages, you can set AUTOSCALER_EVENTS=0.
(autoscaler +26s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +32s) Adding 3 nodes of type ray-legacy-worker-node-type.
(autoscaler +57s) Resized to 12 CPUs.
(autoscaler +57s) Adding 1 nodes of type ray-legacy-worker-node-type.
(autoscaler +1m4s) Resized to 24 CPUs.
(autoscaler +1m4s) Adding 3 nodes of type ray-legacy-worker-node-type.
(autoscaler +1m10s) Resized to 28 CPUs.
(autoscaler +1m29s) Resized to 32 CPUs.
(autoscaler +1m35s) Resized to 36 CPUs.
(autoscaler +1m41s) Resized to 44 CPUs.
Bus error (core dumped)
Shared connection to 3.238.219.129 closed.
Error: Command failed:
ssh -tt -i /home/zhitingz/.ssh/ray-autoscaler_us-east-1.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_4dcb4daf68/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@3.238.219.129 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker exec -it ray_container /bin/bash -c '"'"'bash --login -c -i '"'"'"'"'"'"'"'"'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (python /home/ray/gemm.py --download 25000 25000 uint16)'"'"'"'"'"'"'"'"''"'"' )'
Reproduction (REQUIRED)
Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):
from timeit import default_timer as dtimer
import ray
from ray.util.dask import ray_dask_get
import dask
import dask.array as da
import numpy as np
def main():
_ = ray.init(address="auto")
dask.config.set(scheduler=ray_dask_get)
chunks=(1000, 1000)
x = da.random.randint(0, 65_535, size=(25000, 25000),
dtype=np.uint16, chunks=chunks)
y = da.random.randint(0, 65_535, size=(25000, 25000),
dtype=np.uint16, chunks=chunks)
start = dtimer()
z = da.matmul(x, y)
# print(z)
# z.visualize(filename=f'gemm_{rows}x{cols}_{tp}_{chunks}.svg')
z_val = z.compute()
end = dtimer()
print(z_val)
print(f"time: {end-start} s")
if __name__ == "__main__":
main()
If the code snippet cannot be run by itself, the issue will be closed with “needs-repro-script”.
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 25 (13 by maintainers)
I change the chunk size to 5000 to make the issue easier to reproduce:
The raylet.err outputs this:
For shm, I cat on the head node:
Is there any command to execute commands on all nodes?