ray: [Multitenancy] Job ID not found in Raylet for Detached Worker

What is the problem?

@stephanie-wang and I discovered this problem while debugging a tutorial on Serve.

Ray version and other system information (Python version, TensorFlow version, OS):

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

  1. RAY_BACKEND_LOG_LEVEL=debug ray start --head
  2. Run this file as a Python script and make sure it exit
import ray

ray.init(address="auto")

@ray.remote
class Controller:
    def __init__(self):
        self.workers = []

    def list(self):
        return self.workers

    def start(self, actor_class):
        worker = actor_class.options(lifetime="detached").remote()
        ray.get(worker.ping.remote())
        self.workers.append(worker)

ctl = Controller.options(lifetime="detached", name="controller").remote()
print(f"Worker handles {ray.get(ctl.list.remote())}")
  1. Run this file as a separate script, it will block forever.
import ray

ray.init(address="auto")

ctl = ray.get_actor("controller")

@ray.remote
class Worker:
    def __init__(self):
        print("Worker started!")

    def ping(self): return "pong"

ray.get(ctl.start.remote(Worker))
print("worker", ray.get(ctl.list.remote()))
  1. Now it hangs, let’s ctrl+C and go to ray log and checkout raylet.out: grep --ignore-case job raylet.out
I1016 08:54:55.621362 70924 193117632 service_based_accessor.cc:92] Reestablishing subscription for job info.
I1016 08:54:55.638834 70924 193117632 service_based_accessor.cc:110] Getting all job info.
I1016 08:54:55.639565 70924 193117632 service_based_accessor.cc:117] Finished getting all job info.
I1016 08:55:01.539261 70924 193117632 worker.cc:123] Assigned worker 01000000ffffffffffffffffffffffffffffffff to job 01000000
I1016 08:55:01.539319 70924 193117632 service_based_accessor.cc:29] Adding job, job id = 01000000, driver pid = 71004
I1016 08:55:01.539764 70924 193117632 service_based_accessor.cc:39] Finished adding job, status = OK, job id = 01000000, driver pid = 71004
I1016 08:55:01.539840 70924 193117632 node_manager.cc:336] HandleJobStarted 01000000
I1016 08:55:01.546902 70924 193117632 service_based_accessor.cc:1150] Adding object location, object id = 4ea01d51bed991a789781a50e0e8dbfd00000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = e0e8dbfd
I1016 08:55:01.547286 70924 193117632 service_based_accessor.cc:1205] Removing object location, object id = 4ea01d51bed991a789781a50e0e8dbfd00000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = e0e8dbfd
I1016 08:55:01.547652 70924 193117632 service_based_accessor.cc:1166] Finished adding object location, status = OK, object id = 4ea01d51bed991a789781a50e0e8dbfd00000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = e0e8dbfd
I1016 08:55:01.548303 70924 193117632 service_based_accessor.cc:1220] Finished removing object location, status = OK, object id = 4ea01d51bed991a789781a50e0e8dbfd00000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = e0e8dbfd
I1016 08:55:01.556632 70924 193117632 service_based_accessor.cc:850] Adding task, task id = ffffffffffffffffdf5a1a8201000000, job id = 01000000
I1016 08:55:01.556875 70924 193117632 node_manager.cc:2256] Submitting task: task_spec={Type=ACTOR_CREATION_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=__main__, class_name=Controller, function_name=__init__, function_hash=8ba8d4fb-0588-4001-a7fe-2772faf99d8c}, task_id=ffffffffffffffffdf5a1a8201000000, task_name=controller:Controller.__init__, job_id=01000000, num_args=0, num_returns=1, actor_creation_task_spec={actor_id=df5a1a8201000000, max_restarts=0, max_concurrency=1, is_asyncio_actor=0, is_detached=1}}, task_execution_spec={num_forwards=0}
I1016 08:55:01.558889 70924 193117632 worker_pool.cc:325] Worker process 71007 is bound to job 01000000
I1016 08:55:01.559113 70924 193117632 service_based_accessor.cc:859] Finished adding task, status = OK, task id = ffffffffffffffffdf5a1a8201000000, job id = 01000000
I1016 08:55:01.974562 70924 193117632 worker.cc:123] Assigned worker e32a2b267032d7c9485c534d7f32e1dce8eacb4d to job 01000000
I1016 08:55:01.983006 70924 193117632 service_based_accessor.cc:1150] Adding object location, object id = eb75d8974441f778201a0155bf38150000000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = bf381500
I1016 08:55:01.983310 70924 193117632 service_based_accessor.cc:1205] Removing object location, object id = eb75d8974441f778201a0155bf38150000000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = bf381500
I1016 08:55:01.983757 70924 193117632 service_based_accessor.cc:1166] Finished adding object location, status = OK, object id = eb75d8974441f778201a0155bf38150000000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = bf381500
I1016 08:55:01.984107 70924 193117632 service_based_accessor.cc:1220] Finished removing object location, status = OK, object id = eb75d8974441f778201a0155bf38150000000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = bf381500
I1016 08:55:02.493301 70924 193117632 service_based_accessor.cc:48] Marking job state, job id = 01000000
I1016 08:55:02.493429 70924 193117632 node_manager.cc:1474] Driver (pid=71004) is disconnected. job_id: 01000000
I1016 08:55:02.493961 70924 193117632 service_based_accessor.cc:57] Finished marking job state, status = OK, job id = 01000000
I1016 08:55:02.494050 70924 193117632 node_manager.cc:353] HandleJobFinished 01000000
I1016 08:55:07.481276 70924 193117632 worker.cc:123] Assigned worker 02000000ffffffffffffffffffffffffffffffff to job 02000000
I1016 08:55:07.481319 70924 193117632 service_based_accessor.cc:29] Adding job, job id = 02000000, driver pid = 71081
I1016 08:55:07.481721 70924 193117632 service_based_accessor.cc:39] Finished adding job, status = OK, job id = 02000000, driver pid = 71081
I1016 08:55:07.481789 70924 193117632 node_manager.cc:336] HandleJobStarted 02000000
I1016 08:55:07.489056 70924 193117632 service_based_accessor.cc:1150] Adding object location, object id = 6b427e06817f752e052c96de5df7830e00000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = 5df7830e
I1016 08:55:07.489295 70924 193117632 service_based_accessor.cc:1205] Removing object location, object id = 6b427e06817f752e052c96de5df7830e00000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = 5df7830e
I1016 08:55:07.489627 70924 193117632 service_based_accessor.cc:1166] Finished adding object location, status = OK, object id = 6b427e06817f752e052c96de5df7830e00000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = 5df7830e
I1016 08:55:07.490015 70924 193117632 service_based_accessor.cc:1220] Finished removing object location, status = OK, object id = 6b427e06817f752e052c96de5df7830e00000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = 5df7830e
I1016 08:55:07.500048 70924 193117632 service_based_accessor.cc:850] Adding task, task id = ffffffffffffffff7fffa98101000000, job id = 01000000
I1016 08:55:07.500170 70924 193117632 node_manager.cc:2256] Submitting task: task_spec={Type=ACTOR_CREATION_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=__main__, class_name=Worker, function_name=__init__, function_hash=c66af86d-070e-4c79-a178-29e8ac074eb6}, task_id=ffffffffffffffff7fffa98101000000, task_name=Worker.__init__(), job_id=01000000, num_args=0, num_returns=1, actor_creation_task_spec={actor_id=7fffa98101000000, max_restarts=0, max_concurrency=1, is_asyncio_actor=0, is_detached=1}}, task_execution_spec={num_forwards=0}
I1016 08:55:07.500252 70924 193117632 worker_pool.cc:180] Job config of job 01000000 are not local yet.
I1016 08:55:07.500522 70924 193117632 service_based_accessor.cc:859] Finished adding task, status = OK, task id = ffffffffffffffff7fffa98101000000, job id = 01000000
I1016 08:55:15.822947 70924 193117632 service_based_accessor.cc:1378] Publishing job error, job id = 01000000
I1016 08:55:15.822988 70924 193117632 service_based_accessor.cc:1381] Finished publishing job error, job id = 01000000
I1016 08:59:19.770591 70924 193117632 service_based_accessor.cc:48] Marking job state, job id = 02000000
I1016 08:59:19.770685 70924 193117632 node_manager.cc:1474] Driver (pid=71081) is disconnected. job_id: 02000000
I1016 08:59:19.771344 70924 193117632 service_based_accessor.cc:57] Finished marking job state, status = OK, job id = 02000000
I1016 08:59:19.771391 70924 193117632 node_manager.cc:353] HandleJobFinished 02000000

Notice the

I1016 08:55:07.500252 70924 193117632 worker_pool.cc:180] Job config of job 01000000 are not local yet.

If we cannot run your script, we cannot fix your issue.

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 16 (16 by maintainers)

Commits related to this issue

Most upvoted comments

In Serve’s use case, one ideal workflow can be that, when user run serve.start(detached=True) in their first driver script, the job turned into a long running job. In this case we can make sure the job don’t need to get cleaned up.

btw, should we throw a warning if the job id is not local yet?

On those questions, @stephanie-wang can probably add more but my mental model is that:

  1. If a detached actor starts non-detached task/actor, then the task/actor belongs the that parent. If that detached actor was killed, the task/actor should be cleaned up by raylet.
  2. If a detached actor starts another detached actor. The new child actor now belongs to Ray not the parent actor. If parent actor exit, the child actor remain alive.

Note that this works if RAY_ENABLE_MULTI_TENANCY=0 is set for ray start.

@kfstorm, seems caused by a multitenancy bug.