ray: Duplicated IDs are generated
What is the problem?
The string generated by GenerateUniqueBytes is not really unique. https://github.com/ray-project/ray/blob/0178d6318ead643fff6d9ed873c88f71fdac52e7/src/ray/common/id.cc#L40-L53
Ray version and other system information (Python version, TensorFlow version, OS): 1.0.1
Reproduction (REQUIRED)
Apply below patch and run bazel run //:id_test,
diff --git a/src/ray/common/id_test.cc b/src/ray/common/id_test.cc
index 926e6fbd11..f1274d92a5 100644
--- a/src/ray/common/id_test.cc
+++ b/src/ray/common/id_test.cc
@@ -51,6 +51,15 @@ TEST(ActorIDTest, TestActorID) {
const ActorID actor_id = ActorID::Of(kDefaultJobId, kDefaultDriverTaskId, 1);
ASSERT_EQ(kDefaultJobId, actor_id.JobId());
}
+
+ {
+ // test no duplicated ID
+ std::unordered_set<ActorID> ids;
+ for (size_t i = 0; i < 1000000; i++) {
+ auto id = ActorID::Of(kDefaultJobId, kDefaultDriverTaskId, i);
+ RAY_CHECK(ids.insert(id).second) << "Duplicated ID generated: " << id;
+ }
+ }
}
TEST(TaskIDTest, TestTaskID) {
You will see the error:
[2020-11-20 05:37:07,442 C 104416 104416] id_test.cc:60: Check failed: ids.insert(id).second Duplicated ID generated: 584a7e60c7000000
If we cannot run your script, we cannot fix your issue.
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
About this issue
- Original URL
- State: open
- Created 4 years ago
- Reactions: 1
- Comments: 15 (12 by maintainers)
I see.
There are 2 comments here;
We have been suffering from this issue for months. it is costly and we are working on external patch to automate restart to avoid this issue. The user experience is that Ray is unstable unfortunately.
We run evolutionary algorithm doing fitness calculations as ray actors. These actors are distributed across 100+ 2CPU EC2 isntance with 200+actors multiple times per second. We run this workload for days at a time. So for us to run 50 000 000+ actors on one cluster for one workload is very common.
Is there a known workaround to not create a new actor ID for each new actor?
@rkooo567 I agree with your analysis. I think we need to either come out with a better ID generation algorithm (I’m not sure if it’s possible) or add some kind of communication between nodes/workers to reduce the chance of conflict. Although 100000 seems pretty large as the number of actors, we still need to be cautious because a small group of users may have already suffered from this for a while. And in Ant, we have some applications which create and destroy actors periodically. So it’s only a matter of time.