ray: ray.wait() doesn't return methods completed by dead actors as ready

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
Ray installed from (source or binary): source
Ray version: 0.6.1
Python version: 3.6.6

Describe the problem

launch an actor on another node
x = actor.ping.remote()
kill the node containing the actor
ray.wait([x], timeout=0). x will never become ready, even if called much later

Expected behavior is that x will become ready and store an exception. This is an issue when adding heartbeats for actors on multiple node using ray.wait(), such as for distributed SGD.

Source code / logs

import time

import ray 
from ray.test.cluster_utils import Cluster

cluster = Cluster(True, True, head_node_args={"num_cpus": 0})
node = cluster.add_node()

@ray.remote(num_cpus=1)
class Foo:
    def ping(self):
        pass

f = Foo.remote()

print("pinging")
ray.get(f.ping.remote())

x = f.ping.remote()

print("removing node")
cluster.remove_node(node)
print("done removing node")

for i in range(100):
    print(i, ray.wait([x], timeout=1))
    time.sleep(1)

CC @stephanie-wang

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 1
Comments: 15 (9 by maintainers)

Most upvoted comments

Yes, this only happens with the timeout which is non-blocking. If ray.wait blocks on the ObjectID, then the behavior is as expected.

pschafhalter on Jan 8, 2019