ray: [Core] Raylet breaks when many actor tasks are submitted
What is the problem?
Ray version and other system information (Python version, TensorFlow version, OS): Tested with 16 core macbook pro
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
This is caused by a low ulimt but we should have a better error message.
Note that actor pool ensures that there is at most one in flight task per actor.
import ray
from ray.util import ActorPool
@ray.remote(num_cpus=0)
class DummyActor:
def __init__(self):
pass
def do_stuff(self):
pass
ray.init()
things = [x for x in range(10000)]
nworkers = int(ray.cluster_resources()['CPU']) * 4
actors = [DummyActor.remote() for _ in range(int(nworkers))]
pool = ActorPool(actors)
res = pool.map(lambda a, v: a.do_stuff.remote(), things)
for i, x in enumerate(res):
if i % 100 == 0:
print(x)
Output:
(pid=49596) F0904 12:07:30.034916 49596 377699776 raylet_client.cc:108] Check failed: _s.ok() [RayletClient] Unable to register worker with raylet.: IOError: No such file or directory
(pid=49596) *** Check failure stack trace: ***
(pid=raylet) F0904 12:07:30.037915 49403 281886144 worker_pool.cc:364] Failed to start worker with return value system:24: Too many open files
(pid=raylet) *** Check failure stack trace: ***
(pid=raylet) @ 0x1083e0112 google::LogMessage::~LogMessage()
(pid=raylet) @ 0x10837cdc5 ray::RayLog::~RayLog()
(pid=raylet) @ 0x107f6f96e ray::raylet::WorkerPool::StartProcess()
(pid=raylet) @ 0x107f6d04f ray::raylet::WorkerPool::StartWorkerProcess()
(pid=raylet) @ 0x107f73707 ray::raylet::WorkerPool::PopWorker()
(pid=raylet) @ 0x107ec6923 ray::raylet::NodeManager::DispatchTasks()
(pid=raylet) @ 0x107ed8b09 ray::raylet::NodeManager::HandleWorkerAvailable()
(pid=raylet) @ 0x107ed15cf ray::raylet::NodeManager::HandleWorkerAvailable()
(pid=raylet) @ 0x107ecfa86 ray::raylet::NodeManager::ProcessClientMessage()
(pid=raylet) @ 0x107f3817a std::__1::__function::__func<>::operator()()
(pid=raylet) @ 0x1083558ee ray::ClientConnection::ProcessMessage()
(pid=raylet) @ 0x10835cdb0 boost::asio::detail::reactive_socket_recv_op<>::do_complete()
(pid=raylet) @ 0x1087e830e boost::asio::detail::scheduler::do_run_one()
(pid=raylet) @ 0x1087dbca1 boost::asio::detail::scheduler::run()
(pid=raylet) @ 0x1087dbb2c boost::asio::io_context::run()
(pid=raylet) @ 0x107ea7d8a main
(pid=raylet) @ 0x7fff6a420cc9 start
(pid=49566) F0904 12:07:30.043242 49566 287137216 raylet_client.cc:108] Check failed: _s.ok() [RayletClient] Unable to register worker with raylet.: IOError: No such file or directory
(pid=49566) *** Check failure stack trace: ***
(pid=49566) @ 0x10c6f44e2 google::LogMessage::~LogMessage()
(pid=49566) @ 0x10c691745 ray::RayLog::~RayLog()
(pid=49566) @ 0x10c2a1b99 ray::raylet::RayletClient::RayletClient()
(pid=49566) @ 0x10c1d1e6a ray::CoreWorker::CoreWorker()
(pid=49566) @ 0x10c1cfbdf ray::CoreWorkerProcess::CreateWorker()
(pid=49566) @ 0x10c1ce913 ray::CoreWorkerProcess::CoreWorkerProcess()
(pid=49566) @ 0x10c1cdab7 ray::CoreWorkerProcess::Initialize()
(pid=49566) @ 0x10c13c275 __pyx_tp_new_3ray_7_raylet_CoreWorker()
(pid=49566) @ 0x10b85ca8f type_call
(pid=49566) @ 0x10b7d14f3 _PyObject_FastCallKeywords
(pid=49566) @ 0x10b90ee75 call_function
(pid=49566) @ 0x10b90bb92 _PyEval_EvalFrameDefault
(pid=49566) @ 0x10b90046e _PyEval_EvalCodeWithName
(pid=49566) @ 0x10b7d1a03 _PyFunction_FastCallKeywords
(pid=49566) @ 0x10b90ed67 call_function
(pid=49566) @ 0x10b90cb8d _PyEval_EvalFrameDefault
(pid=49566) @ 0x10b90046e _PyEval_EvalCodeWithName
(pid=49566) @ 0x10b963ce0 PyRun_FileExFlags
(pid=49566) @ 0x10b963157 PyRun_SimpleFileExFlags
(pid=49566) @ 0x10b990dc3 pymain_main
(pid=49566) @ 0x10b7a3f2d main
(pid=49566) @ 0x7fff6a420cc9 start
(pid=49566) @ 0xb (unknown)
(pid=49563) F0904 12:07:30.037027 49563 245689792 raylet_client.cc:108] Check failed: _s.ok() [RayletClient] Unable to register worker with raylet.: IOError: No such file or directory
(pid=49563) *** Check failure stack trace: ***
(pid=49563) @ 0x10f06e4e2 google::LogMessage::~LogMessage()
(pid=49563) @ 0x10f00b745 ray::RayLog::~RayLog()
(pid=49563) @ 0x10ec1bb99 ray::raylet::RayletClient::RayletClient()
(pid=49563) @ 0x10eb4be6a ray::CoreWorker::CoreWorker()
(pid=49563) @ 0x10eb49bdf ray::CoreWorkerProcess::CreateWorker()
(pid=49563) @ 0x10eb48913 ray::CoreWorkerProcess::CoreWorkerProcess()
(pid=49563) @ 0x10eb47ab7 ray::CoreWorkerProcess::Initialize()
(pid=49563) @ 0x10eab6275 __pyx_tp_new_3ray_7_raylet_CoreWorker()
(pid=49563) @ 0x10ded3a8f type_call
(pid=49563) @ 0x10de484f3 _PyObject_FastCallKeywords
(pid=49563) @ 0x10df85e75 call_function
(pid=49563) @ 0x10df82b92 _PyEval_EvalFrameDefault
(pid=49563) @ 0x10df7746e _PyEval_EvalCodeWithName
(pid=49563) @ 0x10de48a03 _PyFunction_FastCallKeywords
(pid=49563) @ 0x10df85d67 call_function
(pid=49563) @ 0x10df83b8d _PyEval_EvalFrameDefault
(pid=49563) @ 0x10df7746e _PyEval_EvalCodeWithName
(pid=49563) @ 0x10dfdace0 PyRun_FileExFlags
(pid=49563) @ 0x10dfda157 PyRun_SimpleFileExFlags
(pid=49563) @ 0x10e007dc3 pymain_main
(pid=49563) @ 0x10de1af2d main
(pid=49563) @ 0x7fff6a420cc9 start
(pid=49512) E0904 12:07:30.084020 49512 218574848 core_worker.cc:694] Raylet failed. Shutting down.
(pid=49519) E0904 12:07:30.083940 49519 101613568 core_worker.cc:694] Raylet failed. Shutting down.
(pid=49521) E0904 12:07:30.083511 49521 5890048 core_worker.cc:694] Raylet failed. Shutting down.
(pid=49562) F0904 12:07:30.097999 49562 180841920 core_worker.cc:330] Check failed: _s.ok() Bad status: IOError: Broken pipe
(pid=49562) *** Check failure stack trace: ***
(pid=49562) @ 0x101a884e2 google::LogMessage::~LogMessage()
(pid=49562) @ 0x101a25745 ray::RayLog::~RayLog()
(pid=49562) @ 0x1015661df ray::CoreWorker::CoreWorker()
(pid=49562) @ 0x101563bdf ray::CoreWorkerProcess::CreateWorker()
(pid=49562) @ 0x101562913 ray::CoreWorkerProcess::CoreWorkerProcess()
(pid=49562) @ 0x101561ab7 ray::CoreWorkerProcess::Initialize()
(pid=49562) @ 0x1014d0275 __pyx_tp_new_3ray_7_raylet_CoreWorker()
(pid=49562) @ 0x100bf0a8f type_call
(pid=49562) @ 0x100b654f3 _PyObject_FastCallKeywords
(pid=49562) @ 0x100ca2e75 call_function
(pid=49562) @ 0x100c9fb92 _PyEval_EvalFrameDefault
(pid=49562) @ 0x100c9446e _PyEval_EvalCodeWithName
(pid=49562) @ 0x100b65a03 _PyFunction_FastCallKeywords
(pid=49562) @ 0x100ca2d67 call_function
(pid=49562) @ 0x100ca0b8d _PyEval_EvalFrameDefault
(pid=49562) @ 0x100c9446e _PyEval_EvalCodeWithName
(pid=49562) @ 0x100cf7ce0 PyRun_FileExFlags
(pid=49562) @ 0x100cf7157 PyRun_SimpleFileExFlags
(pid=49562) @ 0x100d24dc3 pymain_main
(pid=49562) @ 0x100b37f2d main
(pid=49562) @ 0x7fff6a420cc9 start
(pid=49562) @ 0xb (unknown)
If we cannot run your script, we cannot fix your issue.
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 16 (15 by maintainers)
Hmm it seems a bit scary to me to be overriding that limit as part of Ray’s source code…It’s also pretty awkward since then we’d have to make the value configurable. We do it in the example cluster launcher scripts, but that’s less sketchy since it’s very clear and configurable, plus it’s not going to be running on the user’s laptop.
I would be in favor of just making the error message less scary (we probably don’t need a stacktrace) and adding a tip to let the user know that they should manually increase the limit.