orleans: local cluster shutdown stuck 30 second
setup:
- memory storage + local cluster
- API gate + direct client + silo host in one process
- orleans version: b0fa540
2018-10-25 14:47:37.4937|INFO|Orleans.Runtime.SiloLifecycleSubject|Lifecycle observer Orleans.Runtime.Silo stopped in stage 20000 which took 3 Milliseconds.
2018-10-25 14:47:37.4937|INFO|Orleans.Runtime.SiloLifecycleSubject|Stopping lifecycle stage 20000 took 6.2884 Milliseconds
2018-10-25 14:47:37.4937|INFO|Orleans.Runtime.Silo|Silo starting to Shutdown()
2018-10-25 14:47:37.4937|INFO|Orleans.Runtime.MembershipService.MembershipOracleData|-ShutDown
2018-10-25 14:47:37.5063|INFO|Orleans.Runtime.Catalog|DeactivateAllActivations.
2018-10-25 14:47:37.5063|INFO|Orleans.Threading.ThreadPoolThread|Stopping Thread Runtime.GrainDirectory.GlobalSingleInstanceActivationMaintainer0 on managed thread 53
2018-10-25 14:47:37.5063|INFO|Orleans.Runtime.Catalog|DeactivateActivations: total 5 to shutdown, out of them 5 promptly, 0 later when become idle and 0 are already being destroyed or invalid.
2018-10-25 14:47:37.5063|INFO|Orleans.Runtime.Catalog|Starting DestroyActivations #0 of 5 activations
2018-10-25 14:47:37.5277|INFO|Default|Room 132080 stopping, reason stop by room agent
2018-10-25 14:47:37.5277|INFO|Default|Cleanup Room 132080
2018-10-25 14:47:37.5377|WARN|Orleans.Runtime.Catalog|UnregisterManyAsync 1 failed. System.InvalidOperationException: Grain directory is stopping
at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.CheckIfShouldForward(GrainId grainId, Int32 hopCount, String operationDescription) in C:\Users\a\source\repos\orleans\src\Orleans.Runtime\GrainDirectory\LocalGrainDirectory.cs:line 560
at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.UnregisterOrPutInForwardList(IEnumerable`1 addresses, UnregistrationCause cause, Int32 hopCount, Dictionary`2& forward, List`1 tasks, String context) in C:\Users\a\source\repos\orleans\src\Orleans.Runtime\GrainDirectory\LocalGrainDirectory.cs:line 721
at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.UnregisterManyAsync(List`1 addresses, UnregistrationCause cause, Int32 hopCount) in C:\Users\a\source\repos\orleans\src\Orleans.Runtime\GrainDirectory\LocalGrainDirectory.cs:line 768
at Orleans.Runtime.Scheduler.AsyncClosureWorkItem.Execute() in C:\Users\a\source\repos\orleans\src\Orleans.Runtime\Scheduler\ClosureWorkItem.cs:line 63
at Orleans.Runtime.Catalog.FinishDestroyActivations(List`1 list, Int32 number, MultiTaskCompletionSource tcs) in C:\Users\a\source\repos\orleans\src\Orleans.Runtime\Catalog\Catalog.cs:line 1013
2018-10-25 14:47:37.5604|INFO|Orleans.Runtime.Catalog|Done FinishDestroyActivations #0 - Destroyed 1 Activations.
2018-10-25 14:47:43.1263|ERROR|Orleans.Runtime.ClientObserverRegistrar|Directory.RegisterAsync S127.0.0.1:11111:278146050*cli/069a77af@d38dd7f2 failed. System.InvalidOperationException: Grain directory is stopping
at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.CheckIfShouldForward(GrainId grainId, Int32 hopCount, String operationDescription) in C:\Users\a\source\repos\orleans\src\Orleans.Runtime\GrainDirectory\LocalGrainDirectory.cs:line 560
at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.RegisterAsync(ActivationAddress address, Boolean singleActivation, Int32 hopCount) in C:\Users\a\source\repos\orleans\src\Orleans.Runtime\GrainDirectory\LocalGrainDirectory.cs:line 578
at Orleans.OrleansTaskExtentions.LogException(Task task, ILogger logger, ErrorCode errorCode, String message) in C:\Users\a\source\repos\orleans\src\Orleans.Core\Async\TaskExtensions.cs:line 138
2018-10-25 14:47:43.1325|ERROR|Orleans.Runtime.ClientObserverRegistrar|OnClientRefreshTimer has thrown an exceptions. System.InvalidOperationException: Grain directory is stopping
at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.CheckIfShouldForward(GrainId grainId, Int32 hopCount, String operationDescription) in C:\Users\a\source\repos\orleans\src\Orleans.Runtime\GrainDirectory\LocalGrainDirectory.cs:line 560
at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.RegisterAsync(ActivationAddress address, Boolean singleActivation, Int32 hopCount) in C:\Users\a\source\repos\orleans\src\Orleans.Runtime\GrainDirectory\LocalGrainDirectory.cs:line 578
at Orleans.OrleansTaskExtentions.LogException(Task task, ILogger logger, ErrorCode errorCode, String message) in C:\Users\a\source\repos\orleans\src\Orleans.Core\Async\TaskExtensions.cs:line 138
at Orleans.Runtime.ClientObserverRegistrar.OnClientRefreshTimer(Object data) in C:\Users\a\source\repos\orleans\src\Orleans.Runtime\GrainDirectory\ClientObserverRegistrar.cs:line 136
2018-10-25 14:48:07.8081|WARN|Orleans.Runtime.CallbackData|Response did not arrive on time in 00:00:30 for message: Request S127.0.0.1:11111:278146050*grn/736BA21C/00000001@b56bd5bc->S127.0.0.1:11111:278146050*grn/716E8E94/00000001+ClientMessageService_ClientMessageService@12fdf888 #45: . Target History is: <S127.0.0.1:11111:278146050:*grn/716E8E94/00000001+ClientMessageService_ClientMessageService:@12fdf888>. About to break its promise.
2018-10-25 14:48:07.8081|WARN|Orleans.Providers.Streams.SimpleMessageStream.SimpleMessageStreamProducer`1[[LobbyGrains.Core.Messages.ClientMessage, LobbyGrains, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]]|Ignoring unhandled exception during PubSub.UnregisterProducer System.TimeoutException: Response did not arrive on time in 00:00:30 for message: Request S127.0.0.1:11111:278146050*grn/736BA21C/00000001@b56bd5bc->S127.0.0.1:11111:278146050*grn/716E8E94/00000001+ClientMessageService_ClientMessageService@12fdf888 #45: . Target History is: <S127.0.0.1:11111:278146050:*grn/716E8E94/00000001+ClientMessageService_ClientMessageService:@12fdf888>.
at Orleans.Providers.Streams.SimpleMessageStream.SimpleMessageStreamProducer`1.Cleanup() in C:\Users\a\source\repos\orleans\src\Orleans.Core\Streams\SimpleMessageStream\SimpleMessageStreamProducer.cs:line 157
2018-10-25 14:48:07.8081|WARN|Orleans.Runtime.Catalog|UnregisterManyAsync 4 failed. System.InvalidOperationException: Grain directory is stopping
at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.CheckIfShouldForward(GrainId grainId, Int32 hopCount, String operationDescription) in C:\Users\a\source\repos\orleans\src\Orleans.Runtime\GrainDirectory\LocalGrainDirectory.cs:line 560
at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.UnregisterOrPutInForwardList(IEnumerable`1 addresses, UnregistrationCause cause, Int32 hopCount, Dictionary`2& forward, List`1 tasks, String context) in C:\Users\a\source\repos\orleans\src\Orleans.Runtime\GrainDirectory\LocalGrainDirectory.cs:line 721
at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.UnregisterManyAsync(List`1 addresses, UnregistrationCause cause, Int32 hopCount) in C:\Users\a\source\repos\orleans\src\Orleans.Runtime\GrainDirectory\LocalGrainDirectory.cs:line 768
at Orleans.Runtime.Scheduler.AsyncClosureWorkItem.Execute() in C:\Users\a\source\repos\orleans\src\Orleans.Runtime\Scheduler\ClosureWorkItem.cs:line 63
at Orleans.Runtime.Catalog.FinishDestroyActivations(List`1 list, Int32 number, MultiTaskCompletionSource tcs) in C:\Users\a\source\repos\orleans\src\Orleans.Runtime\Catalog\Catalog.cs:line 1013
2018-10-25 14:48:07.8081|INFO|Orleans.Runtime.Dispatcher|Forwarding 1 requests destined for address S127.0.0.1:11111:278146050*grn/716E8E94/00000001+ClientMessageService_ClientMessageService@12fdf888 to address after Finished Destroy Activation.
2018-10-25 14:48:07.8081|INFO|Orleans.Runtime.Catalog|Done FinishDestroyActivations #0 - Destroyed 4 Activations.
2018-10-25 14:48:07.8081|INFO|Orleans.Runtime.Dispatcher|Trying to forward after Finished Destroy Activation, ForwardCount = 0. Message Request S127.0.0.1:11111:278146050*grn/736BA21C/00000001@b56bd5bc->S127.0.0.1:11111:278146050*grn/716E8E94/00000001+ClientMessageService_ClientMessageService@12fdf888 #45: .
2018-10-25 14:48:07.8188|INFO|Orleans.Threading.ThreadPoolThread|Stopping Thread Runtime.Messaging.GatewayAcceptor0 on managed thread 55
2018-10-25 14:48:07.8188|INFO|Orleans.Runtime.SiloLifecycleSubject|Lifecycle observer Orleans.Runtime.Silo stopped in stage 19999 which took 30324 Milliseconds.
2018-10-25 14:48:07.8188|INFO|Orleans.Runtime.SiloLifecycleSubject|Stopping lifecycle stage 19999 took 30324.2329 Milliseconds
2018-10-25 14:48:07.8188|INFO|Orleans.Runtime.SiloLifecycleSubject|Lifecycle observer Orleans.Runtime.Versions.GrainVersionStore stopped in stage 10000 which took 0 Milliseconds.
2018-10-25 14:48:07.8188|INFO|Orleans.Runtime.SiloLifecycleSubject|Stopping lifecycle stage 10000 took 0.1231 Milliseconds
2018-10-25 14:48:07.8188|INFO|Orleans.Runtime.SiloLifecycleSubject|Lifecycle observer Orleans.Runtime.Silo stopped in stage 8000 which took 0 Milliseconds.
2018-10-25 14:48:07.8188|INFO|Orleans.Runtime.SiloLifecycleSubject|Stopping lifecycle stage 8000 took 0.0793 Milliseconds
2018-10-25 14:48:07.8188|INFO|Orleans.Runtime.SiloLifecycleSubject|Lifecycle observer Orleans.Runtime.Silo stopped in stage 4000 which took 1 Milliseconds.
2018-10-25 14:48:07.8188|INFO|Orleans.Runtime.SiloLifecycleSubject|Stopping lifecycle stage 4000 took 1.8309 Milliseconds
2018-10-25 14:48:07.8188|INFO|Orleans.Runtime.SiloLifecycleSubject|Lifecycle observer Orleans.Runtime.InsideRuntimeClient stopped in stage 2000 which took 0 Milliseconds.
2018-10-25 14:48:07.8188|INFO|Orleans.Runtime.Silo|Silo is Stopped()
2018-10-25 14:48:07.8188|INFO|Orleans.Runtime.MembershipService.MembershipOracleData|-KillMyself
2018-10-25 14:48:07.8188|INFO|Orleans.Threading.ThreadPoolThread|Stopping Thread Runtime.Messaging.IncomingMessageAgent/System0 on managed thread 50
2018-10-25 14:48:07.8188|INFO|Orleans.Threading.ThreadPoolThread|Stopping Thread Runtime.Messaging.IncomingMessageAgent/Application0 on managed thread 51
2018-10-25 14:48:07.8188|INFO|Orleans.Threading.ThreadPoolThread|Stopping Thread Runtime.Messaging.IncomingMessageAgent/Ping0 on managed thread 49
2018-10-25 14:48:07.8188|INFO|Orleans.Threading.ThreadPoolThread|Stopping Thread Runtime.Watchdog0 on managed thread 54
2018-10-25 14:48:07.8188|INFO|Orleans.Runtime.ActivationDirectory|ActivationDirectory.PrintActivationDirectory(): Size = 2, Directory:
[++ [Activation: S127.0.0.1:11111:278146050*grn/30D642F0/000000000000000000000000000000010300000030d642f0-0x53883087@43f10f9d #GrainType=Orleans.Storage.MemoryStorageGrain Placement=RandomPlacement State=Valid NonReentrancyQueueSize=0 EnqueuedOnDispatcher=0 InFlightCount=0 NumRunning=0 IdlenessTimeSpan=00:00:32.6087081 CollectionAgeLimit=02:00:00]
++ [Activation: S127.0.0.1:11111:278146050*grn/30D642F0/000000000000000000000000000000080300000030d642f0-0x3C74C8EA@5e82ecad #GrainType=Orleans.Storage.MemoryStorageGrain Placement=RandomPlacement State=Valid NonReentrancyQueueSize=0 EnqueuedOnDispatcher=0 InFlightCount=0 NumRunning=0 IdlenessTimeSpan=00:00:30.3045059 CollectionAgeLimit=02:00:00]]
2018-10-25 14:48:07.8188|INFO|Orleans.Threading.ThreadPoolThread|Stopping Thread Runtime.Messaging.IncomingMessageAcceptor0 on managed thread 46
2018-10-25 14:48:07.8188|INFO|Orleans.Threading.ThreadPoolThread|Stopping Thread Runtime.Messaging.SiloMessageSender/PingSender0 on managed thread 47
2018-10-25 14:48:07.8188|INFO|Orleans.Threading.ThreadPoolThread|Stopping Thread Runtime.Messaging.SiloMessageSender/SystemSender0 on managed thread 48
2018-10-25 14:48:07.8188|INFO|Orleans.Runtime.CounterStatistic|Stopping Windows perf counter stats collection
2018-10-25 14:48:07.8432|INFO|Orleans.RuntimeSiloLogStatistics|Statistics: ^^^
App.Requests.Latency.Average.Millis=734.000
App.Requests.LatencyHistogram.Millis=[0.0128:0.0255]=2, [0.0256:0.0511]=3, [0.0512:0.1023]=3, [0.1024:0.2047]=2, [0.2048:0.4095]=2, [0.4096:0.8191]=2, [0.8192:1.6383]=8, [1.6384:3.2767]=6, [3.2768:6.5535]=2, [6.5536:13.1071]=4, [13.1072:26.2143]=2, [52.4288:104.8575]=2, [104.8576:209.7151]=1, [209.7152:419.4303]=3, [26843.5456:53687.0911]=1,
App.Requests.TimedOut.Current=1
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 16 (13 by maintainers)
I have been working on this issue for a few days and have made some progress. Here are the preconditions:
Now, if I try to stop the silo. First, user-define grain will be deactive and it’s state will be set to “Deactivating” (In fact, only the non-system-target activation will be deactive and be destroyed there. In my case, the user-define grain and the stream grain’s activation are all non-system-target).
https://github.com/dotnet/orleans/blob/411b581864b13703773fbf7e5015cd0da9466c9f/src/Orleans.Runtime/Catalog/Catalog.cs#L859
Then, the deactivation progress continue. The user-define grain find itself has used Stream so decide to deactivate Stream resources.
https://github.com/dotnet/orleans/blob/411b581864b13703773fbf7e5015cd0da9466c9f/src/Orleans.Runtime/Catalog/Catalog.cs#L1206
https://github.com/dotnet/orleans/blob/411b581864b13703773fbf7e5015cd0da9466c9f/src/Orleans.Core/Streams/SimpleMessageStream/SimpleMessageStreamProducer.cs#L156
Actually, the “UnregisterProducer()” function will send a request message to target silo(just the only silo) and wait for its response. When the message arrives at the silo’s IncomingMessageAgent, the agent will try to find the target activation.
https://github.com/dotnet/orleans/blob/411b581864b13703773fbf7e5015cd0da9466c9f/src/Orleans.Runtime/Messaging/IncomingMessageAgent.cs#L108
But we have set the silo’s activation state to “Deactivating”. The message target and context will be null. If we are not the last silo in the cluster, the GSI protocol will creates a new instance on another silo. Unfortunately, we are the last one. So the message’s response will never be sent and the UnregisterProducer() function will block until a message time out.
By the way, the default response timeout is 30s, so that’s why the silo stop() function will block 30s(if you use Visual Studio to Debug it, it will block even 30mins)!
https://github.com/dotnet/orleans/blob/411b581864b13703773fbf7e5015cd0da9466c9f/src/Orleans.Core/Configuration/Options/MessagingOptions.cs#L21
So, here are all the things happen during the silo stop. If I have something wrong, please correct me.
PS: I still have some points want to know further: