ray: [release][CI] dataset_shuffle_random_shuffle_1tb failed
What happened + What you expected to happen
Traceback (most recent call last):
File "dataset/sort.py", line 165, in <module>
raise exc
File "dataset/sort.py", line 117, in <module>
ds = ds.random_shuffle()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/dataset.py", line 857, in random_shuffle
return Dataset(plan, self._epoch, self._lazy)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/dataset.py", line 219, in __init__
self._plan.execute(allow_clear_input_blocks=False)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/plan.py", line 310, in execute
blocks, clear_input_blocks, self._run_by_consumer
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/plan.py", line 765, in __call__
blocks, clear_input_blocks, self.block_udf, self.ray_remote_args
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/stage_impl.py", line 118, in do_shuffle
reduce_ray_remote_args=remote_args,
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/shuffle.py", line 117, in execute
new_metadata = reduce_bar.fetch_until_complete(list(new_metadata))
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/progress_bar.py", line 75, in fetch_until_complete
for ref, result in zip(done, ray.get(done)):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 2289, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::reduce() (pid=812, ip=172.31.116.182)
At least one of the input arguments for this task could not be computed:
ray.exceptions.ObjectFetchTimedOutError: Failed to retrieve object a634bb3b3ca530f0ffffffffffffffffffffffff0200000002000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
Fetch for object a634bb3b3ca530f0ffffffffffffffffffffffff0200000002000000 timed out because no locations were found for the object. This may indicate a system-level bug.
But looks like this might be root cause:
worker.py:1839 -- The node with node id: cfa8ba8a1da67a2ca0d324cfa1f25379459d4e4595534bc9339d572f and address: 172.31.127.55 and node name: 172.31.127.55 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, preempted node, etc.)
(2) raylet has lagging heartbeats due to slow network or busy workload.
Shuffle Map: 58%|█████▊ | 583/1000 [13:12<16:19, 2.35s/it] 2022-10-12 13:13:42,685 WARNING worker.py:1839 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: 80725d1404687f199f85f783b5726797df932a6b02000000 Worker ID: f5ad619ac9eb377ba1940e4aeee99e325b05a9974c0e9294bac3cb2c Node ID: 333d58f0360b87d85caac78f08855930c3209c04eb5f819d2ee63610 Worker IP address: 172.31.119.129 Worker port: 10005 Worker PID: 1049 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
(raylet, ip=172.31.127.55) [2022-10-12 13:13:42,754 C 79 132] (raylet) node_manager.cc:173: This node has beem marked as dead.
(raylet, ip=172.31.127.55) *** StackTrace Information ***
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x49bd1a) [0x563db3153d1a] ray::operator<<()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x49d7f2) [0x563db31557f2] ray::SpdLogMessage::Flush()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x49db07) [0x563db3155b07] ray::RayLog::~RayLog()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x242464) [0x563db2efa464] std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x375bc4) [0x563db302dbc4] std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x3cabd0) [0x563db3082bd0] ray::rpc::GcsRpcClient::ReportHeartbeat()::{lambda()#2}::operator()()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x373a32) [0x563db302ba32] ray::rpc::ClientCallImpl<>::OnReplyReceived()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x2290b5) [0x563db2ee10b5] std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x47fb46) [0x563db3137b46] EventTracker::RecordExecution()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x42030e) [0x563db30d830e] std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x420786) [0x563db30d8786] boost::asio::detail::completion_handler<>::do_complete()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x9ada0b) [0x563db3665a0b] boost::asio::detail::scheduler::do_run_one()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x9af1d1) [0x563db36671d1] boost::asio::detail::scheduler::run()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x9af400) [0x563db3667400] boost::asio::io_context::run()
(raylet, ip=172.31.127.55) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x9fe110) [0x563db36b6110] execute_native_thread_routine
(raylet, ip=172.31.127.55) /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7ff78c649609] start_thread
(raylet, ip=172.31.127.55) /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7ff78c218133] __clone
(raylet, ip=172.31.127.55)
Versions / Dependencies
Release 2.1 master
Reproduction script
NA
Issue Severity
No response
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 17 (17 by maintainers)
should we try the oom killer if the node is timing out due to oom / freezing?
@clarkzinzow i’m a bit overloaded right now (5 total release blocker). Mind take this one?