ray: [core] ray.kill pending actor doesn't cancel the actor creation task
What is the problem?
Currently, ray.kill will silently fail if the actor has not already been started. This appears to be because we try to kill actors directly (via direct actor transport), but now GCS is responsible for scheduling/creating actors, so the actor’s owner can’t easily cancel the pending lease request.
Here’s a simple reproduction which shows the lease request is still infeasible in a raylet.
import ray
from ray._raylet import GlobalStateAccessor
import time
cluster = ray.init()
global_state_accessor = GlobalStateAccessor(
cluster["redis_address"], ray.ray_constants.REDIS_DEFAULT_PASSWORD)
global_state_accessor.connect()
@ray.remote(resources={"WORKER": 1.0})
class ActorA:
pass
a = ActorA.remote()
ray.kill(a) # do not wait until it starts
while True:
message = global_state_accessor.get_all_resource_usage()
if message is not None:
resource_usage = ray.gcs_utils.ResourceUsageBatchData.FromString(
message)
print(resource_usage)
else:
print(message)
time.sleep(1)
cc @ericl
Ray version and other system information (Python version, TensorFlow version, OS):
Reproduction (REQUIRED)
Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):
If the code snippet cannot be run by itself, the issue will be closed with “needs-repro-script”.
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 18 (18 by maintainers)
Yes, I think the version with
lease_client->CancelWorkerLeaselooks right.