azure-sdk-for-net: [BUG] ServiceBus MessageLock Exception on CompleteAsync
Describe the bug I’m using the Azure.Messaging.ServiceBus.dll version 7.4.0, ServiceBusReceiver. It is used to get message out of the queue using peek\lock mode and then process it (sometimes processing takes up to 1 hour). Lock time is 5 min, and I have a timer that calls token renew every 4 minutes. The issue is that for the whole time token renew works without any exceptions from library side, but failed on CompleteAsync: The lock supplied is invalid. Either the lock expired, or the message has already been removed from the queue, or was received by a different receiver instance. (MessageLockLost)
According to my logging. Application renewed token for another 5 min, and then failed to complete after 1 min, 11/11/2021, 9:47:38.650 AM | Message lock on ‘18a3d6f3-0173-4e02-a457-9b118874473b’ is renewed. Locked until: ‘11/11/2021 09:52:38 +00:00’
11/11/2021, 9:48:38.681 AM | Azure.Messaging.ServiceBus.ServiceBusException at Azure.Messaging.ServiceBus.Amqp.AmqpReceiver.ThrowLockLostException
Points of interest:
- it happens randomly
- after exception message is still in the queue for the remaining 4 minutes and can not be received by any other receiver
- I have one receiver per pode/application
- I have 8 parallel running pods
- each reveiver consumes only 1 message at a time
Stak trace (from app insights)
{“method”:“Azure.Messaging.ServiceBus.Amqp.AmqpReceiver.ThrowLockLostException”,“assembly”:“Azure.Messaging.ServiceBus, Version=7.4.0.0, Culture=neutral, PublicKeyToken=92742159e12e44c8”,“level”:0,“line”:0}
{“method”:“Azure.Messaging.ServiceBus.Amqp.AmqpReceiver+<DisposeMessageAsync>d__42.MoveNext”,“assembly”:“Azure.Messaging.ServiceBus, Version=7.4.0.0, Culture=neutral, PublicKeyToken=92742159e12e44c8”,“level”:1,“line”:0}
{“method”:“Azure.Messaging.ServiceBus.Amqp.AmqpReceiver+<CompleteInternalAsync>d__41.MoveNext”,“assembly”:“Azure.Messaging.ServiceBus, Version=7.4.0.0, Culture=neutral, PublicKeyToken=92742159e12e44c8”,“level”:5,“line”:0}
{“method”:“Azure.Messaging.ServiceBus.Amqp.AmqpReceiver+<>c+<<CompleteAsync>b__40_0>d.MoveNext”,“assembly”:“Azure.Messaging.ServiceBus, Version=7.4.0.0, Culture=neutral, PublicKeyToken=92742159e12e44c8”,“level”:9,“line”:0}
{“method”:“Azure.Messaging.ServiceBus.ServiceBusRetryPolicy+<>c__201+<<RunOperation>b__20_0>d.MoveNext","assembly":"Azure.Messaging.ServiceBus, Version=7.4.0.0, Culture=neutral, PublicKeyToken=92742159e12e44c8","level":13,"line":0} {"method":"Azure.Messaging.ServiceBus.ServiceBusRetryPolicy+<RunOperation>d__212.MoveNext”,“assembly”:“Azure.Messaging.ServiceBus, Version=7.4.0.0, Culture=neutral, PublicKeyToken=92742159e12e44c8”,“level”:19,“line”:0}
{“method”:“Azure.Messaging.ServiceBus.ServiceBusRetryPolicy+<RunOperation>d__212.MoveNext","assembly":"Azure.Messaging.ServiceBus, Version=7.4.0.0, Culture=neutral, PublicKeyToken=92742159e12e44c8","level":21,"line":0} {"method":"Azure.Messaging.ServiceBus.ServiceBusRetryPolicy+<RunOperation>d__201.MoveNext”,“assembly”:“Azure.Messaging.ServiceBus, Version=7.4.0.0, Culture=neutral, PublicKeyToken=92742159e12e44c8”,“level”:27,“line”:0}
{“method”:“Azure.Messaging.ServiceBus.Amqp.AmqpReceiver+<CompleteAsync>d__40.MoveNext”,“assembly”:“Azure.Messaging.ServiceBus, Version=7.4.0.0, Culture=neutral, PublicKeyToken=92742159e12e44c8”,“level”:31,“line”:0}
{“method”:“Azure.Messaging.ServiceBus.ServiceBusReceiver+<CompleteMessageAsync>d__47.MoveNext”,“assembly”:“Azure.Messaging.ServiceBus, Version=7.4.0.0, Culture=neutral, PublicKeyToken=92742159e12e44c8”,“level”:35,“line”:0}
{“method”:“Azure.Messaging.ServiceBus.ServiceBusReceiver+<CompleteMessageAsync>d__46.MoveNext”,“assembly”:“Azure.Messaging.ServiceBus, Version=7.4.0.0, Culture=neutral, PublicKeyToken=92742159e12e44c8”,“level”:39,“line”:0}
[My Code]
Environment:
- Azure.Messaging.ServiceBus.dll version 7.4.0
- application is running in Azure aks cluster
- wasn’t able to reproduce on local windows machine
- Docker containers FROM mcr.microsoft.com/dotnet/aspnet:3.1
- kubernetes version 1.19.11
- Message lock time = 5min
- Receiver settings: “TransportType”: “AmqpWebSockets”, “Retry”: { “MaxRetries”: 5, “TryTimeout”: 1200000, “Delay”: 30000 }
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 30 (23 by maintainers)
@JoshLove-msft @jsquire To add a bit more detail here we already went through most of the custom code that was in place and are fairly certain there is no other potential source of deadlock because all the custom code completes but the actual underlying AMQP link interaction and completion seems to hang as indicated in the logs.
We have already verified a few things like:
They are running on 7.10 which also uses 2.5.10 of AMQP library which already contains the fix for the singleton object task completion source creation using
TaskContinuationOptions.RunAsynchronously(that would have been a potential source of deadlock). They have also tried to switch from the processor to do manual receives and completions and have seen the same pattern happen occasionally.I have tried to reproduce it with a “silly” console application
The idea was to once the message is being received cap the network connection, wait for the link timeout and renewals to be attempted and then reconnect to see if we can reproduce the scenario, but we failed so far. I was planning to have a more in-depth look into the underlying AMQP interaction at some point but haven’t had time yet. To me, it looks like the receiver link is “broken” and doesn’t recover and then things come to a grinding halt.
@shankarsama @DorothySun216 can you provide any updates on this bug? I see it was last routed towards both of you for “service follow-up”. We are seeing exactly the issue described here when attempting to complete messages when the service bus instance is getting throttled.
We fixed a bug related to this recently. Production rollout for this has been completed. Please reopen this issue if it reproduces in future.
After further investigation (also with @danielmarbach - thanks a lot 😃) of the issue, it seems likely to be a deadlock within the Azure Service Bus SDK or it’s dependencies.
We observed that
CompleteAsync startis written and theCompleteAsync doneis missing. After this, the receiver link is somehow “broken” and the connection to the SB backend is closed after 10 minutes. The logs shows exactly that behavior:Unfortunately, we are unable to reproduce that behavior constantly by intension when we want it. Any help and hints to solve that issue would be highly appreciated.
@EldertGrootenboer can you help follow up on this issue?
@JoshLove-msft
I might have found something. It appears that before every error there is a warning:
This fits perfectly with the 2 seconds delay I get for every complete that fails.
For some reason, when throttling it looses the lock, strange.
I would expect it to just wait and commit, now we need some funky logic to span the requests?
Also strange, is that the entire service (9 instances) only throttle on a single queue, other queue work just fine and does not get throttled