ray: ray.wait() doesn't return methods completed by dead actors as ready

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
  • Ray installed from (source or binary): source
  • Ray version: 0.6.1
  • Python version: 3.6.6

Describe the problem

  1. launch an actor on another node
  2. x = actor.ping.remote()
  3. kill the node containing the actor
  4. ray.wait([x], timeout=0). x will never become ready, even if called much later

Expected behavior is that x will become ready and store an exception. This is an issue when adding heartbeats for actors on multiple node using ray.wait(), such as for distributed SGD.

Source code / logs

import time

import ray 
from ray.test.cluster_utils import Cluster

cluster = Cluster(True, True, head_node_args={"num_cpus": 0})
node = cluster.add_node()

@ray.remote(num_cpus=1)
class Foo:
    def ping(self):
        pass

f = Foo.remote()

print("pinging")
ray.get(f.ping.remote())

x = f.ping.remote()

print("removing node")
cluster.remove_node(node)
print("done removing node")

for i in range(100):
    print(i, ray.wait([x], timeout=1))
    time.sleep(1)

CC @stephanie-wang

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 1
  • Comments: 15 (9 by maintainers)

Most upvoted comments

Yes, this only happens with the timeout which is non-blocking. If ray.wait blocks on the ObjectID, then the behavior is as expected.