ray: [dask on ray issues]: errors running dask matmul

What is the problem?

Ray version and other system information (Python version, TensorFlow version, OS): ray: 2.0.0.dev0 dask==2021.1.1

When enable the object spilling on a ray cluster(by passing --system-config=‘{“automatic_object_spilling_enabled”:true,“max_io_workers”:2,“object_spilling_config”:“{"type":"filesystem","params":{"directory_path":"/tmp/spill"}}”}’ to the head node configuration), the program aborts with bus error.

Error output:

Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Fetched IP: 3.238.219.129
Shared connection to 3.238.219.129 closed.
Shared connection to 3.238.219.129 closed.
Fetched IP: 3.238.219.129
Shared connection to 3.238.219.129 closed.
2021-02-15 14:26:48,908 INFO worker.py:655 -- Connecting to existing Ray cluster at address: 172.31.7.112:6379
(autoscaler +26s) Tip: use `ray status` to view detailed autoscaling status. To disable autoscaler event messages, you can set AUTOSCALER_EVENTS=0.
(autoscaler +26s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +32s) Adding 3 nodes of type ray-legacy-worker-node-type.
(autoscaler +57s) Resized to 12 CPUs.
(autoscaler +57s) Adding 1 nodes of type ray-legacy-worker-node-type.
(autoscaler +1m4s) Resized to 24 CPUs.
(autoscaler +1m4s) Adding 3 nodes of type ray-legacy-worker-node-type.
(autoscaler +1m10s) Resized to 28 CPUs.
(autoscaler +1m29s) Resized to 32 CPUs.
(autoscaler +1m35s) Resized to 36 CPUs.
(autoscaler +1m41s) Resized to 44 CPUs.
Bus error (core dumped)
Shared connection to 3.238.219.129 closed.
Error: Command failed:

  ssh -tt -i /home/zhitingz/.ssh/ray-autoscaler_us-east-1.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_4dcb4daf68/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@3.238.219.129 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker exec -it  ray_container /bin/bash -c '"'"'bash --login -c -i '"'"'"'"'"'"'"'"'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (python /home/ray/gemm.py --download 25000 25000 uint16)'"'"'"'"'"'"'"'"''"'"' )'

Reproduction (REQUIRED)

Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):

from timeit import default_timer as dtimer
import ray
from ray.util.dask import ray_dask_get
import dask
import dask.array as da
import numpy as np


def main():
    _ = ray.init(address="auto")
    dask.config.set(scheduler=ray_dask_get)
    chunks=(1000, 1000)
    x = da.random.randint(0, 65_535, size=(25000, 25000),
                          dtype=np.uint16, chunks=chunks)
    y = da.random.randint(0, 65_535, size=(25000, 25000),
                          dtype=np.uint16, chunks=chunks)
    start = dtimer()
    z = da.matmul(x, y)
    # print(z)
    # z.visualize(filename=f'gemm_{rows}x{cols}_{tp}_{chunks}.svg')

    z_val = z.compute()
    end = dtimer()
    print(z_val)
    print(f"time: {end-start} s")


if __name__ == "__main__":
    main()

If the code snippet cannot be run by itself, the issue will be closed with “needs-repro-script”.

I have verified my script runs in a clean environment and reproduces the issue.
I have verified the issue also occurs with the latest wheels.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 25 (13 by maintainers)

Most upvoted comments

I change the chunk size to 5000 to make the issue easier to reproduce:

from timeit import default_timer as dtimer
import ray
from ray.util.dask import ray_dask_get
import dask
import dask.array as da
import numpy as np


def main():
    _ = ray.init(address="auto")
    dask.config.set(scheduler=ray_dask_get)
    chunks=(5000, 5000)
    x = da.random.randint(0, 65_535, size=(25000, 25000),
                          dtype=np.uint16, chunks=chunks)
    y = da.random.randint(0, 65_535, size=(25000, 25000),
                          dtype=np.uint16, chunks=chunks)
    start = dtimer()
    z = da.matmul(x, y)
    # print(z)
    # z.visualize(filename=f'gemm_{rows}x{cols}_{tp}_{chunks}.svg')

    z_val = z.compute()
    end = dtimer()
    print(z_val)
    print(f"time: {end-start} s")


if __name__ == "__main__":
    main()

photoszzt on Feb 15, 2021

The raylet.err outputs this:

Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Fetched IP: 3.238.219.129
Warning: Permanently added '3.238.219.129' (ECDSA) to the list of known hosts.
[2021-02-15 15:19:50,617 C 39361 39361] local_object_manager.cc:340:  Check failed: spilled_objects_url_.count(object_id) > 0
[2021-02-15 15:19:50,617 E 39361 39361] logging.cc:435: *** Aborted at 1613431190 (unix time) try "date -d @1613431190" if you are using GNU date ***
[2021-02-15 15:19:50,618 E 39361 39361] logging.cc:435: PC: @                0x0 (unknown)
[2021-02-15 15:19:50,623 E 39361 39361] logging.cc:435: *** SIGABRT (@0x3e8000099c1) received by PID 39361 (TID 0x7f837aeab800) from PID 39361; stack trace: ***
[2021-02-15 15:19:50,624 E 39361 39361] logging.cc:435:     @     0x561d996eacef google::(anonymous namespace)::FailureSignalHandler()
[2021-02-15 15:19:50,624 E 39361 39361] logging.cc:435:     @     0x7f837b40d3c0 (unknown)
[2021-02-15 15:19:50,625 E 39361 39361] logging.cc:435:     @     0x7f837aef618b gsignal
[2021-02-15 15:19:50,625 E 39361 39361] logging.cc:435:     @     0x7f837aed5859 abort
[2021-02-15 15:19:50,627 E 39361 39361] logging.cc:435:     @     0x561d996dc0c5 ray::SpdLogMessage::Flush()
[2021-02-15 15:19:50,628 E 39361 39361] logging.cc:435:     @     0x561d996dc0fd ray::RayLog::~RayLog()
[2021-02-15 15:19:50,629 E 39361 39361] logging.cc:435:     @     0x561d99329e07 ray::raylet::LocalObjectManager::AsyncRestoreSpilledObject()
[2021-02-15 15:19:50,629 E 39361 39361] logging.cc:435:     @     0x561d992ca0cc _ZNSt17_Function_handlerIFvRKN3ray8ObjectIDERKSsRKNS0_6NodeIDESt8functionIFvRKNS0_6StatusEEEEZNS0_6raylet6RayletC4ERN5boost4asio10io_contextES5_S5_S5_iS5_RKNSG_17NodeManagerConfigERKNS0_19ObjectManagerConfigESt10shared_ptrINS0_3gcs9GcsClientEEiEUlS3_S5_S8_SE_E_E9_M_invokeERKSt9_Any_dataS3_S5_S8_OSE_
[2021-02-15 15:19:50,631 E 39361 39361] logging.cc:435:     @     0x561d993e053a ray::PullManager::TryToMakeObjectLocal()
[2021-02-15 15:19:50,633 E 39361 39361] logging.cc:435:     @     0x561d993e0bcb ray::PullManager::UpdatePullsBasedOnAvailableMemory()
[2021-02-15 15:19:50,635 E 39361 39361] logging.cc:435:     @     0x561d993e1c20 ray::PullManager::OnLocationChange()
[2021-02-15 15:19:50,636 E 39361 39361] logging.cc:435:     @     0x561d993b9b48 _ZNSt17_Function_handlerIFvRKN3ray8ObjectIDERKSt6vectorINS0_3rpc20ObjectLocationChangeESaIS6_EEEZNS0_15ObjectDirectory24SubscribeObjectLocationsERKNS0_8UniqueIDES3_RKNS5_7AddressERKSt8functionIFvS3_RKSt13unordered_setINS0_6NodeIDESt4hashISL_ESt8equal_toISL_ESaISL_EERKSsRKSL_mEEEUlS3_SA_E_E9_M_invokeERKSt9_Any_dataS3_SA_
[2021-02-15 15:19:50,637 E 39361 39361] logging.cc:435:     @     0x561d9947e713 _ZNSt17_Function_handlerIFvN3ray6StatusERKN5boost8optionalINS0_3rpc18ObjectLocationInfoEEEEZZNS0_3gcs30ServiceBasedObjectInfoAccessor25AsyncSubscribeToLocationsERKNS0_8ObjectIDERKSt8functionIFvSE_RKSt6vectorINS4_20ObjectLocationChangeESaISH_EEEERKSF_IFvS1_EEENKUlST_E_clEST_EUlRKS1_S8_E_E9_M_invokeERKSt9_Any_dataOS1_S8_
[2021-02-15 15:19:50,637 E 39361 39361] logging.cc:435:     @     0x561d99485a30 _ZNSt17_Function_handlerIFvRKN3ray6StatusERKNS0_3rpc23GetObjectLocationsReplyEEZNS0_3gcs30ServiceBasedObjectInfoAccessor17AsyncGetLocationsERKNS0_8ObjectIDERKSt8functionIFvS1_RKN5boost8optionalINS4_18ObjectLocationInfoEEEEEEUlS3_S7_E_E9_M_invokeERKSt9_Any_dataS3_S7_
[2021-02-15 15:19:50,639 E 39361 39361] logging.cc:435:     @     0x561d994542d1 _ZNSt17_Function_handlerIFvRKN3ray6StatusERKNS0_3rpc23GetObjectLocationsReplyEEZNS4_12GcsRpcClient18GetObjectLocationsERKNS4_25GetObjectLocationsRequestERKSt8functionIS8_EEUlS3_S7_E_E9_M_invokeERKSt9_Any_dataS3_S7_
[2021-02-15 15:19:50,640 E 39361 39361] logging.cc:435:     @     0x561d9945702f ray::rpc::ClientCallImpl<>::OnReplyReceived()
[2021-02-15 15:19:50,641 E 39361 39361] logging.cc:435:     @     0x561d99345472 _ZN5boost4asio6detail18completion_handlerIZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
[2021-02-15 15:19:50,643 E 39361 39361] logging.cc:435:     @     0x561d99a59901 boost::asio::detail::scheduler::do_run_one()
[2021-02-15 15:19:50,645 E 39361 39361] logging.cc:435:     @     0x561d99a5afa9 boost::asio::detail::scheduler::run()
[2021-02-15 15:19:50,645 E 39361 39361] logging.cc:435:     @     0x561d99a5d497 boost::asio::io_context::run()
[2021-02-15 15:19:50,646 E 39361 39361] logging.cc:435:     @     0x561d992a2072 main
[2021-02-15 15:19:50,647 E 39361 39361] logging.cc:435:     @     0x7f837aed70b3 __libc_start_main
[2021-02-15 15:19:50,649 E 39361 39361] logging.cc:435:     @     0x561d992b7165 (unknown)

For shm, I cat on the head node:

Filesystem     1K-blocks    Used Available Use% Mounted on
shm              5101024 4589952    511072  90% /dev/shm

Is there any command to execute commands on all nodes?

photoszzt on Feb 15, 2021