orleans: Redis grain directory key not being unregistered when target silo is known to be dead

Hi,

We’re using MS Orleans 3.4.1 to setup a multi silo Orleans cluster within AWS with the following characteristics:

6 silos running under ECS fargate tasks
DynamoDB used for grain, reminder and cluster info persistence
Microsoft.Orleans.GrainDirectory.Redis v3.4.1-beta1 used for the grain directory
Redis cluster with 4 nodes

We have recently observed the following error (Target silo is known to be dead) while a grain was trying to be activated by the cluster:

No target activation "S10.109.3.136:11111:355758997*cli/1db42ab3@965a24ba" for response message: "Transient Rejection (info: Target silo is known to be dead) 
Response S10.109.1.110:11111:354843009*grn/Namespace.Ommited/0+grain_id_ommited@8d48169f->S10.109.3.136:11111:355758997*cli/1db42ab3@965a24ba #1349313"

After inspecting the redis directory key for the affected grain I have verified the following:

the redis record was indeed pointing to a silo (S10.109.1.110) that was no longer running

\"GrainId\":\"00000000000000000000000000000000060000005d645190+grain_id_ommited\",\"ActivationId\":\"48e8f9a7c51701350cb946358d48169f0000000000000000\",\"SiloAddress\":\"10.109.1.110:11111@354843009\"}"

the silo was indeed marked as Dead (i.e. Status = 6) on the cluster info storage

We often perform rolling deployments of new versions of the cluster and we expect silos to be stopped and marked as dead by Orleans while new instances of the service spin up.

From looking at the CachedGrainLocator implementation I would expect for the redis key to be unregistered during lookup in case the directory record still points to a Silo which has since been stopped. However, the error above seems to indicate that this is not happening.

Restarting the cluster did not solve the issue. Trying to activate the grain multiple times still resulted in the same error.

To resolve the issue I had to manually delete the redis key for this grain.

Would anyone please be able to shed some light on this? Unfortunately I haven’t been able to reproduce this locally.

Thanks very much, please let me know if you would like me to share some more info on the matter.

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 3
Comments: 17 (8 by maintainers)

Most upvoted comments

The cause could be that this is run MembershipTableCleanupAgent Wouldn’t that clean the table of dead silos and if a silo is restarted afterwards then it wouldn’t know about the dead silos and thus see the redis grain entries as alive. To test this you could shutdown the cluster. Clean the membership table of any provider and start it back up. If my theory holds then all redis entries are stuck.

jdahl on Apr 27, 2021

Hopefully next week

benjaminpetit on Jul 30, 2021