orleans: OrleansMessageRejection exception and Orleans stream messages stuck in azure storage queue
I’ve encountered this a couple times in the last 1.5 weeks. I’ll deploy a new revision of my Orleans application and within a couple days silos will become unavailable and messages will be undeliverable on some instances. The problematic silos will not recover and I have to restart the cluster to resolve this issue.
When 1 or more of 9 silos get in this state where grain messages can’t be delivered, then Orleans stream messages pushed to the queue will also get stuck until I restart the cluster (container app environment). The issue may have started shortly after the last deployment. The last few times this issue occurred, it seemed to follow shortly after the new release.
I’d appreciate some further guidance on tracking down the issue here.
Here are some further observations:
- Silo running in Azure container app environment
- Last revision was deployed
2023-07-07T20:24:08z
- Very little load over weekend, then Sunday night for seemingly no reason, silos are terminating and messages can’t be delivered from Orleans storage queue.
- Running Orleans
7.1.2
- Silo exits with code 1
- container logs
2023-07-10T05:32:23.4393447Z
- message:
Container silo failed liveness probe, will be restarted
- message:
2023-07-10T05:34:30.094215Z
- message:
Container 'silo' was terminated with exit code '1'
- message:
- The most closely related exeptions
2023-07-10T05:34:20.2993575Z
–Orleans.Runtime.OrleansMessageRejectionException
- message:
Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to endpoint S100.100.0.120:11111:47972033. See InnerException ---> Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 100.100.0.120:11111. Error: ConnectionRefused at Orleans.Networking.Shared.SocketConnectionFactory.ConnectAsync(EndPoint endpoint, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/Shared/SocketConnectionFactory.cs:line 54 at Orleans.Runtime.Messaging.ConnectionFactory.ConnectAsync(SiloAddress address, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/ConnectionFactory.cs:line 61 at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 228 --- End of inner exception stack trace --- at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 228 at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 108 at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 231
2023-07-10T05:34:23.1327681Z
–System.ObjectDisposedException at Orleans.Serialization.Serializers.CodecProvider.GetServiceOrCreateInstance
- message:
Cannot access a disposed object. Object name: 'IServiceProvider'.
2023-07-10T05:37:08.2696465Z
–System.InvalidOperationException at Orleans.Runtime.ActivationData.StartDeactivating
- message:
Calling DeactivateOnIdle from within OnActivateAsync is not supported
As of 2023-07-10T20:48:15.9391397Z
still seeing Orleans.Runtime.OrleansMessageRejectionException
and there are 34k orleans messages stuck in queue-1.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 18 (7 by maintainers)
I was reffering about the
MessageRejectionException
.The streaming infrastructure is using some internal grains, called
PubSubRendezVousGrain
. Here is seems the directory is in a bad state, and the cluster isn’t able to create a new activation of thePubSubRendezVousGrain
for some streams.It would be interesting to see if you have more directory related logs.
Also when you scale your cluster up, do you see some silo dying in the meantime?
The v7.2.3 release which aims to fix this is now available, so I will close this but please open a new issue and reference this if you still encounter this issue: https://github.com/dotnet/orleans/releases/tag/v7.2.3
We’ve merged https://github.com/dotnet/orleans/pull/8696 & https://github.com/dotnet/orleans/pull/8704 to fix this issue. We will create a release shortly.
Just an aside, but this looks suspicious… @iamsamcoder
FromSeconds against collectionAgeMinutes? This could contribute to a very spammy DHT
I don’t want to prematurely celebrate, but I think this problem went away when I changed the grain directory of the
IPubSubRendezvousGrain
to use Redis instead of the default directory. We have been seeing this issue many times a day since we started pushing anything but test data through our streams, but after changing the grain directory this has not occurred a single time today.Due to #8632 this wasn’t super straight forward but I managed to solve it by adding a named Redis grain directory and registering a custom
Orleans.Runtime.GrainDirectory.IGrainDirectoryResolver
in the service collection, like this:I guess this points to it being a grain directory issue and not directly a streaming issue, but as already mentioned the only grain in our cluster having this issue was the
IPubSubRendezvousGrain
.