ray: [Ray 2.3 Release] Failure in `dataset_shuffle_sort_1tb` "no locations were found for the object"

Marking as a release blocker until we think otherwise. BuildKite: https://buildkite.com/ray-project/release-tests-branch/builds/1333#0186130c-207a-4e3e-a563-a8f2f329b498

ray.exceptions.RayTaskError: ray::_sample_block() (pid=143721, ip=172.31.205.93)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.ObjectFetchTimedOutError: Failed to retrieve object f57510499cea6ea7ffffffffffffffffffffffff0300000002000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.

Fetch for object f57510499cea6ea7ffffffffffffffffffffffff0300000002000000 timed out because no locations were found for the object. This may indicate a system-level bug.

About this issue

Original URL
State: closed
Created a year ago
Comments: 27 (24 by maintainers)

Commits related to this issue

Use retriable_lifo policy for shuffle 1tb nightly test (#32417) Fix release blocker issue: #32203 Ran 6 times and all of them passed. Signed-off-by: jianoaix <iamjianxiao@gmail.com> — committed to ray-project/ray by jianoaix a year ago
Use retriable_lifo policy for shuffle 1tb nightly test (#32417) Fix release blocker issue: #32203 Ran 6 times and all of them passed. Signed-off-by: jianoaix <iamjianxiao@gmail.com> — committed to cadedaniel/ray by jianoaix a year ago
Use retriable_lifo policy for shuffle 1tb nightly test (#32417) (#32445) Fix release blocker issue: #32203 Ran 6 times and all of them passed. Signed-off-by: jianoaix <iamjianxiao@gmail.com> C... — committed to ray-project/ray by cadedaniel a year ago
Use retriable_lifo policy for shuffle 1tb nightly test (#32417) Fix release blocker issue: #32203 Ran 6 times and all of them passed. Signed-off-by: jianoaix <iamjianxiao@gmail.com> — committed to cadedaniel/ray by jianoaix a year ago
Use retriable_lifo policy for shuffle 1tb nightly test (#32417) Fix release blocker issue: #32203 Ran 6 times and all of them passed. Signed-off-by: jianoaix <iamjianxiao@gmail.com> Signed-off-by: ... — committed to edoakes/ray by jianoaix a year ago
Use retriable_lifo policy for shuffle 1tb nightly test (#32417) Fix release blocker issue: #32203 Ran 6 times and all of them passed. Signed-off-by: jianoaix <iamjianxiao@gmail.com> — committed to scottsun94/ray by jianoaix a year ago
Use retriable_lifo policy for shuffle 1tb nightly test (#32417) Fix release blocker issue: #32203 Ran 6 times and all of them passed. Signed-off-by: jianoaix <iamjianxiao@gmail.com> — committed to cassidylaidlaw/ray by jianoaix a year ago

Most upvoted comments

Ok, got log from a failure now: https://console.anyscale.com/o/anyscale-internal/projects/prj_FKRmeV5pA6X72aVscFALNC32/clusters/ses_jxk4uxrewk4lvkstygni9q6641

It’s failing on OOM:

Traceback (most recent call last):
  File "dataset/sort.py", line 174, in <module>
    raise exc
  File "dataset/sort.py", line 122, in <module>
    ds.fully_executed()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/dataset.py", line 3945, in fully_executed
    self._plan.execute(force_read=True)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/plan.py", line 536, in execute
    dataset_uuid=self._dataset_uuid,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/execution/legacy_compat.py", line 84, in execute_to_legacy_block_list
    bundles = executor.execute(dag, initial_stats=stats)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/execution/bulk_executor.py", line 82, in execute
    return execute_recursive(dag)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/execution/bulk_executor.py", line 62, in execute_recursive
    op.inputs_done()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/execution/operators/all_to_all_operator.py", line 57, in inputs_done
    self._output_buffer, self._stats = self._bulk_fn(self._input_buffer, ctx)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/execution/legacy_compat.py", line 245, in bulk_fn
    block_list, ctx, input_owned, block_udf, remote_args
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/stage_impl.py", line 214, in do_sort
    return sort_impl(blocks, clear_input_blocks, key, descending)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/sort.py", line 153, in sort_impl
    clear_input_blocks,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/shuffle.py", line 117, in execute
    new_metadata = reduce_bar.fetch_until_complete(list(new_metadata))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/progress_bar.py", line 76, in fetch_until_complete
    for ref, result in zip(done, ray.get(done)):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 2391, in get
    raise value
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: 172.31.168.7, ID: b562bad5e2d1df7ff24635eeb87fd02f736594b61d49941aea197143) where the task (task ID: 049cde2ae8f42f3cca424a2eb2f4f363c23d944803000000, name=reduce, pid=100516, memory used=3.02GB) was running was 54.74GB / 57.60GB (0.950292), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 7c69a053bf2ba6079e914f111a4a051e128f73949cd67bfd0a219d47) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 172.31.168.7`. To see the logs of the worker, use `ray logs worker-7c69a053bf2ba6079e914f111a4a051e128f73949cd67bfd0a219d47*out -ip 172.31.168.7. Top 10 memory users:
PID	MEM(GB)	COMMAND
1242	5.30	ray::IDLE_SpillWorker
124	3.94	/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=...
100516	3.02	ray::reduce

I think @clarng is working on a change to avoid throwing this error, which should be able to fix this test.

Maybe one followup left for us is to figure out if there is a regression in memory efficiency.

jianoaix on Feb 10, 2023

The failure/flakiness ran by master’s nightly recently has been consistent, it’s likely reproducible (just needs a few more tries): Screen Shot 2023-02-09 at 3 35 42 PM

jianoaix on Feb 9, 2023

master, sounds good to start a few in 2.3 branch.

jianoaix on Feb 9, 2023

Yet to produce a failure to see the log. I’m running two concurrently now and should get result in about half hour. Previously it didn’t log anything about the failure before it re-ran the execution with ds.stats() which then hit timeout.

jianoaix on Feb 9, 2023

That PR is a bug fix, so we’ll need to keep it (unless we revert even more). We’ll work on debugging/fixing it today.

jianoaix on Feb 8, 2023