ray: [core] Fail submitted tasks immediately when an actor is temporarily unavailable

Currently, when an actor has max_restarts > 0 and crashed, the actor will enter RESTARTING state and then ALIVE. Imagine this scenario: an online service provides HTTP service and the proxy actor receives requests, forwards them to worker actors, and replies to clients with the execution results from worker actors.

                                                        -> Worker A (actor)
                                                       /
                                                      /
HTTP requests -------> Proxy (actor with HTTP server) ---> Worker B (actor)
                                                      \
                                                       \
                                                        -> ...

For each HTTP request, the proxy picks one worker (e.g. worker A) based on some algorithm, sends the request to it, and call ray.get() to wait for the result. If for some reason the picked worker crashed, Ray will restart the actor, and ray.get() will throw an error. The proxy may pick another worker (e.g. worker B) and re-send the request to it. This is OK.

But new requests keep coming. The proxy may pick worker A again. But because worker A is still in RESTARTING state, it’s not ready to serve requests. This time, ray.get() will hang until worker A is back online (ALIVE state). The proxy won’t be able to reschedule the request to another worker because it doesn’t if worker A is alive or not. ray.get() hanging is probably just because the execution on worker requires quite a lot of time.

To solve this issue, in Ant internal repo, we added a new flag when creating an actor to control the strategy of submitting new tasks when the actor is in RESTARTING state. When the callee actor is in RESTARTING state, by default, newly submitted tasks will be queued in direct actor transport and only send them when the ALIVE state notification is received. But if the flag is turned on, newly submitted tasks will fail immediately with a RayActorError. This solution makes the fallback very quick when a worker is temporarily unavailable and improved P99 latency of the online service.

We hope we can contribute this back to community. @rkooo567 @stephanie-wang what do you think?

cc @raulchen @ffbin @jovany-wang @MissiontoMars

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 2
Comments: 16 (13 by maintainers)

Most upvoted comments

@edoakes Oh, I see. max_retries is for tasks and max_task_retries is for actors, right?

I think letting users maintain a map of alive actors is error-prune. Users may not have the knowledge to implement it correctly.

kfstorm on Feb 7, 2021

@kfstorm @raulchen the problem makes a ton of sense (this is actually an existing problem for Ray Serve cc @simon-mo ) and the solution sounds pretty reasonable to me.

One question I have: how does this interact with the max_task_retries mechanism that we have for actors? In the situation you described, we probably don’t want max_task_retries to be turned on at all, so maybe they should just be mutually exclusive.

edoakes on Feb 3, 2021