ray: [Multitenancy] Job ID not found in Raylet for Detached Worker
What is the problem?
@stephanie-wang and I discovered this problem while debugging a tutorial on Serve.
Ray version and other system information (Python version, TensorFlow version, OS):
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
RAY_BACKEND_LOG_LEVEL=debug ray start --head- Run this file as a Python script and make sure it exit
import ray
ray.init(address="auto")
@ray.remote
class Controller:
def __init__(self):
self.workers = []
def list(self):
return self.workers
def start(self, actor_class):
worker = actor_class.options(lifetime="detached").remote()
ray.get(worker.ping.remote())
self.workers.append(worker)
ctl = Controller.options(lifetime="detached", name="controller").remote()
print(f"Worker handles {ray.get(ctl.list.remote())}")
- Run this file as a separate script, it will block forever.
import ray
ray.init(address="auto")
ctl = ray.get_actor("controller")
@ray.remote
class Worker:
def __init__(self):
print("Worker started!")
def ping(self): return "pong"
ray.get(ctl.start.remote(Worker))
print("worker", ray.get(ctl.list.remote()))
- Now it hangs, let’s ctrl+C and go to ray log and checkout raylet.out:
grep --ignore-case job raylet.out
I1016 08:54:55.621362 70924 193117632 service_based_accessor.cc:92] Reestablishing subscription for job info.
I1016 08:54:55.638834 70924 193117632 service_based_accessor.cc:110] Getting all job info.
I1016 08:54:55.639565 70924 193117632 service_based_accessor.cc:117] Finished getting all job info.
I1016 08:55:01.539261 70924 193117632 worker.cc:123] Assigned worker 01000000ffffffffffffffffffffffffffffffff to job 01000000
I1016 08:55:01.539319 70924 193117632 service_based_accessor.cc:29] Adding job, job id = 01000000, driver pid = 71004
I1016 08:55:01.539764 70924 193117632 service_based_accessor.cc:39] Finished adding job, status = OK, job id = 01000000, driver pid = 71004
I1016 08:55:01.539840 70924 193117632 node_manager.cc:336] HandleJobStarted 01000000
I1016 08:55:01.546902 70924 193117632 service_based_accessor.cc:1150] Adding object location, object id = 4ea01d51bed991a789781a50e0e8dbfd00000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = e0e8dbfd
I1016 08:55:01.547286 70924 193117632 service_based_accessor.cc:1205] Removing object location, object id = 4ea01d51bed991a789781a50e0e8dbfd00000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = e0e8dbfd
I1016 08:55:01.547652 70924 193117632 service_based_accessor.cc:1166] Finished adding object location, status = OK, object id = 4ea01d51bed991a789781a50e0e8dbfd00000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = e0e8dbfd
I1016 08:55:01.548303 70924 193117632 service_based_accessor.cc:1220] Finished removing object location, status = OK, object id = 4ea01d51bed991a789781a50e0e8dbfd00000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = e0e8dbfd
I1016 08:55:01.556632 70924 193117632 service_based_accessor.cc:850] Adding task, task id = ffffffffffffffffdf5a1a8201000000, job id = 01000000
I1016 08:55:01.556875 70924 193117632 node_manager.cc:2256] Submitting task: task_spec={Type=ACTOR_CREATION_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=__main__, class_name=Controller, function_name=__init__, function_hash=8ba8d4fb-0588-4001-a7fe-2772faf99d8c}, task_id=ffffffffffffffffdf5a1a8201000000, task_name=controller:Controller.__init__, job_id=01000000, num_args=0, num_returns=1, actor_creation_task_spec={actor_id=df5a1a8201000000, max_restarts=0, max_concurrency=1, is_asyncio_actor=0, is_detached=1}}, task_execution_spec={num_forwards=0}
I1016 08:55:01.558889 70924 193117632 worker_pool.cc:325] Worker process 71007 is bound to job 01000000
I1016 08:55:01.559113 70924 193117632 service_based_accessor.cc:859] Finished adding task, status = OK, task id = ffffffffffffffffdf5a1a8201000000, job id = 01000000
I1016 08:55:01.974562 70924 193117632 worker.cc:123] Assigned worker e32a2b267032d7c9485c534d7f32e1dce8eacb4d to job 01000000
I1016 08:55:01.983006 70924 193117632 service_based_accessor.cc:1150] Adding object location, object id = eb75d8974441f778201a0155bf38150000000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = bf381500
I1016 08:55:01.983310 70924 193117632 service_based_accessor.cc:1205] Removing object location, object id = eb75d8974441f778201a0155bf38150000000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = bf381500
I1016 08:55:01.983757 70924 193117632 service_based_accessor.cc:1166] Finished adding object location, status = OK, object id = eb75d8974441f778201a0155bf38150000000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = bf381500
I1016 08:55:01.984107 70924 193117632 service_based_accessor.cc:1220] Finished removing object location, status = OK, object id = eb75d8974441f778201a0155bf38150000000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = bf381500
I1016 08:55:02.493301 70924 193117632 service_based_accessor.cc:48] Marking job state, job id = 01000000
I1016 08:55:02.493429 70924 193117632 node_manager.cc:1474] Driver (pid=71004) is disconnected. job_id: 01000000
I1016 08:55:02.493961 70924 193117632 service_based_accessor.cc:57] Finished marking job state, status = OK, job id = 01000000
I1016 08:55:02.494050 70924 193117632 node_manager.cc:353] HandleJobFinished 01000000
I1016 08:55:07.481276 70924 193117632 worker.cc:123] Assigned worker 02000000ffffffffffffffffffffffffffffffff to job 02000000
I1016 08:55:07.481319 70924 193117632 service_based_accessor.cc:29] Adding job, job id = 02000000, driver pid = 71081
I1016 08:55:07.481721 70924 193117632 service_based_accessor.cc:39] Finished adding job, status = OK, job id = 02000000, driver pid = 71081
I1016 08:55:07.481789 70924 193117632 node_manager.cc:336] HandleJobStarted 02000000
I1016 08:55:07.489056 70924 193117632 service_based_accessor.cc:1150] Adding object location, object id = 6b427e06817f752e052c96de5df7830e00000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = 5df7830e
I1016 08:55:07.489295 70924 193117632 service_based_accessor.cc:1205] Removing object location, object id = 6b427e06817f752e052c96de5df7830e00000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = 5df7830e
I1016 08:55:07.489627 70924 193117632 service_based_accessor.cc:1166] Finished adding object location, status = OK, object id = 6b427e06817f752e052c96de5df7830e00000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = 5df7830e
I1016 08:55:07.490015 70924 193117632 service_based_accessor.cc:1220] Finished removing object location, status = OK, object id = 6b427e06817f752e052c96de5df7830e00000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = 5df7830e
I1016 08:55:07.500048 70924 193117632 service_based_accessor.cc:850] Adding task, task id = ffffffffffffffff7fffa98101000000, job id = 01000000
I1016 08:55:07.500170 70924 193117632 node_manager.cc:2256] Submitting task: task_spec={Type=ACTOR_CREATION_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=__main__, class_name=Worker, function_name=__init__, function_hash=c66af86d-070e-4c79-a178-29e8ac074eb6}, task_id=ffffffffffffffff7fffa98101000000, task_name=Worker.__init__(), job_id=01000000, num_args=0, num_returns=1, actor_creation_task_spec={actor_id=7fffa98101000000, max_restarts=0, max_concurrency=1, is_asyncio_actor=0, is_detached=1}}, task_execution_spec={num_forwards=0}
I1016 08:55:07.500252 70924 193117632 worker_pool.cc:180] Job config of job 01000000 are not local yet.
I1016 08:55:07.500522 70924 193117632 service_based_accessor.cc:859] Finished adding task, status = OK, task id = ffffffffffffffff7fffa98101000000, job id = 01000000
I1016 08:55:15.822947 70924 193117632 service_based_accessor.cc:1378] Publishing job error, job id = 01000000
I1016 08:55:15.822988 70924 193117632 service_based_accessor.cc:1381] Finished publishing job error, job id = 01000000
I1016 08:59:19.770591 70924 193117632 service_based_accessor.cc:48] Marking job state, job id = 02000000
I1016 08:59:19.770685 70924 193117632 node_manager.cc:1474] Driver (pid=71081) is disconnected. job_id: 02000000
I1016 08:59:19.771344 70924 193117632 service_based_accessor.cc:57] Finished marking job state, status = OK, job id = 02000000
I1016 08:59:19.771391 70924 193117632 node_manager.cc:353] HandleJobFinished 02000000
Notice the
I1016 08:55:07.500252 70924 193117632 worker_pool.cc:180] Job config of job 01000000 are not local yet.
If we cannot run your script, we cannot fix your issue.
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 16 (16 by maintainers)
Commits related to this issue
- [Serve] Add regression test for #11437 — committed to simon-mo/ray by simon-mo 4 years ago
- [Serve] Add regression test for #11437 (#11539) — committed to ray-project/ray by simon-mo 4 years ago
In Serve’s use case, one ideal workflow can be that, when user run
serve.start(detached=True)in their first driver script, the job turned into a long running job. In this case we can make sure the job don’t need to get cleaned up.btw, should we throw a warning if the job id is not local yet?
On those questions, @stephanie-wang can probably add more but my mental model is that:
Note that this works if
RAY_ENABLE_MULTI_TENANCY=0is set forray start.@kfstorm, seems caused by a multitenancy bug.