ray: [Core][Nightly] datasets_ingest_train_infer failing
What happened + What you expected to happen
Step 1: Dropping nulls, creating new_col, updating feature_1
Map_Batches: 99%|█████████████████████████████████████████████████████████████████████████████████████▏| 396/400 [00:19<00:00, 78.36it/s](pid=gcs_server) E0725 08:34:14.326870515 253 chttp2_transport.cc:2846] keepalive_ping_end state error: 0 (expect: 1)
(raylet, ip=10.0.0.128) [2022-07-25 08:35:14,805 C 157 212] (raylet) gcs_rpc_client.h:537: Check failed: absl::ToInt64Seconds(absl::Now() - gcs_last_alive_time_) < ::RayConfig::instance().gcs_rpc_server_reconnect_timeout_s() Failed to connect to GCS within 60 seconds
(raylet, ip=10.0.0.128) *** StackTrace Information ***
(raylet, ip=10.0.0.128) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x47af5a) [0x55e3772d0f5a] ray::operator<<()
(raylet, ip=10.0.0.128) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x47ca52) [0x55e3772d2a52] ray::SpdLogMessage::Flush()
(raylet, ip=10.0.0.128) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x47cd67) [0x55e3772d2d67] ray::RayLog::~RayLog()
(raylet, ip=10.0.0.128) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x353253) [0x55e3771a9253] ray::rpc::GcsRpcClient::CheckChannelStatus()
(raylet, ip=10.0.0.128) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x3536c3) [0x55e3771a96c3] boost::asio::detail::wait_handler<>::do_complete()
(raylet, ip=10.0.0.128) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x9896eb) [0x55e3777df6eb] boost::asio::detail::scheduler::do_run_one()
(raylet, ip=10.0.0.128) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x98aeb1) [0x55e3777e0eb1] boost::asio::detail::scheduler::run()
(raylet, ip=10.0.0.128) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x98b0e0) [0x55e3777e10e0] boost::asio::io_context::run()
(raylet, ip=10.0.0.128) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x9d9fa0) [0x55e37782ffa0] execute_native_thread_routine
(raylet, ip=10.0.0.128) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f108a3366db] start_thread
(raylet, ip=10.0.0.128) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f108951861f] __clone
(raylet, ip=10.0.0.128)
Raylet is terminated: ip=10.0.0.128, id=ab515a6f7fd0794fab61dc4ac46742175cdff63997306a6de835bfb6. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
[state-dump] AgentManagerService.grpc_server.RegisterAgent - 1 total (0 active), CPU time: mean = 291.934 us, total = 291.934 us
[state-dump] NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 489.663 us, total = 489.663 us
[state-dump] NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), CPU time: mean = 1.014 s, total = 1.014 s
[state-dump] DebugString() time ms: 1
[state-dump]
[state-dump]
[2022-07-25 08:35:14,805 C 157 212] (raylet) gcs_rpc_client.h:537: Check failed: absl::ToInt64Seconds(absl::Now() - gcs_last_alive_time_) < ::RayConfig::instance().gcs_rpc_server_reconnect_timeout_s() Failed to connect to GCS within 60 seconds
*** StackTrace Information ***
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x47af5a) [0x55e3772d0f5a] ray::operator<<()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x47ca52) [0x55e3772d2a52] ray::SpdLogMessage::Flush()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x47cd67) [0x55e3772d2d67] ray::RayLog::~RayLog()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x353253) [0x55e3771a9253] ray::rpc::GcsRpcClient::CheckChannelStatus()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x3536c3) [0x55e3771a96c3] boost::asio::detail::wait_handler<>::do_complete()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x9896eb) [0x55e3777df6eb] boost::asio::detail::scheduler::do_run_one()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x98aeb1) [0x55e3777e0eb1] boost::asio::detail::scheduler::run()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x98b0e0) [0x55e3777e10e0] boost::asio::io_context::run()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x9d9fa0) [0x55e37782ffa0] execute_native_thread_routine
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f108a3366db] start_thread
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f108951861f] __clone
(pid=gcs_server) [2022-07-25 08:35:34,426 E 217 217] (gcs_server) gcs_server.cc:295: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
Raylet is terminated: ip=10.0.0.150, id=717b4ae7fe19f51c9d16947065f5a1144e034fa386e3da99a3fbda50. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
[state-dump] NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 431.974 us, total = 431.974 us
[state-dump] JobInfoGcsService.grpc_client.AddJob - 1 total (0 active), CPU time: mean = 41.452 us, total = 41.452 us
[state-dump] DebugString() time ms: 1
[state-dump]
[state-dump]
[2022-07-25 08:35:36,203 W 273 337] (raylet) node_manager.cc:158: Last heartbeat was sent 8987 ms ago. There might be resource pressure on this node. If heartbeat keeps lagging, this node can be marked as dead mistakenly.
[2022-07-25 08:35:40,492 C 273 339] (raylet) gcs_rpc_client.h:537: Check failed: absl::ToInt64Seconds(absl::Now() - gcs_last_alive_time_) < ::RayConfig::instance().gcs_rpc_server_reconnect_timeout_s() Failed to connect to GCS within 60 seconds
*** StackTrace Information ***
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x47af5a) [0x55da47c05f5a] ray::operator<<()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x47ca52) [0x55da47c07a52] ray::SpdLogMessage::Flush()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x47cd67) [0x55da47c07d67] ray::RayLog::~RayLog()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x353253) [0x55da47ade253] ray::rpc::GcsRpcClient::CheckChannelStatus()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x3536c3) [0x55da47ade6c3] boost::asio::detail::wait_handler<>::do_complete()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x9896eb) [0x55da481146eb] boost::asio::detail::scheduler::do_run_one()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x98aeb1) [0x55da48115eb1] boost::asio::detail::scheduler::run()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x98b0e0) [0x55da481160e0] boost::asio::io_context::run()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x9d9fa0) [0x55da48164fa0] execute_native_thread_routine
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fbe6cbd56db] start_thread
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fbe6bdb761f] __clone
(pid=gcs_server) [2022-07-25 08:35:41,696 E 217 217] (gcs_server) gcs_server.cc:295: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-07-25 08:35:41,709 E 217 217] (gcs_server) gcs_server.cc:295: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(raylet) [2022-07-25 08:35:40,492 C 273 339] (raylet) gcs_rpc_client.h:537: Check failed: absl::ToInt64Seconds(absl::Now() - gcs_last_alive_time_) < ::RayConfig::instance().gcs_rpc_server_reconnect_timeout_s() Failed to connect to GCS within 60 seconds
(raylet) *** StackTrace Information ***
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x47af5a) [0x55da47c05f5a] ray::operator<<()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x47ca52) [0x55da47c07a52] ray::SpdLogMessage::Flush()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x47cd67) [0x55da47c07d67] ray::RayLog::~RayLog()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x353253) [0x55da47ade253] ray::rpc::GcsRpcClient::CheckChannelStatus()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x3536c3) [0x55da47ade6c3] boost::asio::detail::wait_handler<>::do_complete()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x9896eb) [0x55da481146eb] boost::asio::detail::scheduler::do_run_one()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x98aeb1) [0x55da48115eb1] boost::asio::detail::scheduler::run()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x98b0e0) [0x55da481160e0] boost::asio::io_context::run()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x9d9fa0) [0x55da48164fa0] execute_native_thread_routine
(raylet) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fbe6cbd56db] start_thread
(raylet) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fbe6bdb761f] __clone
(raylet)
(pid=gcs_server) [2022-07-25 08:35:42,584 E 217 217] (gcs_server) gcs_server.cc:295: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-07-25 08:35:42,584 E 217 217] (gcs_server) gcs_server.cc:295: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-07-25 08:35:43,587 E 217 217] (gcs_server) gcs_server.cc:295: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-07-25 08:35:43,588 E 217 217] (gcs_server) gcs_server.cc:295: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-07-25 08:35:44,590 E 217 217] (gcs_server) gcs_server.cc:295: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-07-25 08:35:44,590 E 217 217] (gcs_server) gcs_server.cc:295: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-07-25 08:35:45,593 E 217 217] (gcs_server) gcs_server.cc:295: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-07-25 08:35:45,593 E 217 217] (gcs_server) gcs_server.cc:295: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-07-25 08:35:46,595 E 217 217] (gcs_server) gcs_server.cc:295: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-07-25 08:35:46,595 E 217 217] (gcs_server) gcs_server.cc:295: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-07-25 08:35:47,598 E 217 217] (gcs_server) gcs_server.cc:295: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-07-25 08:35:47,599 E 217 217] (gcs_server) gcs_server.cc:295: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-07-25 08:35:48,601 E 217 217] (gcs_server) gcs_server.cc:295: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2022-07-25 08:35:48,601 E 217 217] (gcs_server) gcs_server.cc:295: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
Traceback (most recent call last):
File "ray_sgd_training.py", line 524, in <module>
read_dataset(data_path)
File "ray_sgd_training.py", line 47, in preprocess_train_data
return self._preprocess(ds, False)
File "ray_sgd_training.py", line 75, in _preprocess
ds = ds.map_batches(batch_transformer, batch_format="pandas")
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/dataset.py", line 499, in map_batches
return Dataset(plan, self._epoch, self._lazy)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/dataset.py", line 207, in __init__
self._plan.execute(allow_clear_input_blocks=False)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/plan.py", line 309, in execute
blocks, clear_input_blocks, self._run_by_consumer
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/plan.py", line 672, in __call__
fn_constructor_kwargs=self.fn_constructor_kwargs,
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/compute.py", line 113, in _apply
results = map_bar.fetch_until_complete(refs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/progress_bar.py", line 75, in fetch_until_complete
for ref, result in zip(done, ray.get(done)):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 2233, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 668, in get_objects
object_refs, self.current_task_id, timeout_ms
File "python/ray/_raylet.pyx", line 1219, in ray._raylet.CoreWorker.get_objects
File "python/ray/_raylet.pyx", line 179, in ray._raylet.check_status
ray.exceptions.RaySystemError: System error: Connection reset by peer
Versions / Dependencies
latest ray
Reproduction script
run nightly test
Issue Severity
High: It blocks me from completing my task.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 15 (15 by maintainers)
Reopen for the further investigation on GCS disconnection. TL;DR Raylets should never be disconnected like that although there are some application loads https://github.com/ray-project/ray/commit/e19cf164fd51c4f6bf730e999cba46b30c39ff83 caused. We should investigate the root cause and make that part more robust.
I think after https://github.com/ray-project/ray/commit/e19cf164fd51c4f6bf730e999cba46b30c39ff83 is addressed, we should still work on improving the robustness of GCS in the same situation? (as a P1 issue?)