ray: ray.wait() doesn't return methods completed by dead actors as ready
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
- Ray installed from (source or binary): source
- Ray version: 0.6.1
- Python version: 3.6.6
Describe the problem
- launch an actor on another node
x = actor.ping.remote()- kill the node containing the actor
ray.wait([x], timeout=0).xwill never become ready, even if called much later
Expected behavior is that x will become ready and store an exception.
This is an issue when adding heartbeats for actors on multiple node using ray.wait(), such as for distributed SGD.
Source code / logs
import time
import ray
from ray.test.cluster_utils import Cluster
cluster = Cluster(True, True, head_node_args={"num_cpus": 0})
node = cluster.add_node()
@ray.remote(num_cpus=1)
class Foo:
def ping(self):
pass
f = Foo.remote()
print("pinging")
ray.get(f.ping.remote())
x = f.ping.remote()
print("removing node")
cluster.remove_node(node)
print("done removing node")
for i in range(100):
print(i, ray.wait([x], timeout=1))
time.sleep(1)
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 1
- Comments: 15 (9 by maintainers)
Yes, this only happens with the timeout which is non-blocking. If
ray.waitblocks on the ObjectID, then the behavior is as expected.