azure-sdk-for-python: Async Eventhub Consumer with Pyamqp: Link Detached after long idle period Results in High "User Errors" volume
- package: azure-eventhub
- version: 5.11.1
- OS: Debian Container (kubernetes pod):
python:3.11-slim-bullseye
- Python version: 3.11.2
Describe the bug Long idle periods (greater than 1 hour) sometimes result in very high volume of “User Errors” reported in the metrics blade for our Eventhub. These high “User Error” volumes are correlated with our Consumer Clients not receiving events after idle period has ended and Producers have started publishing events again.
After setting log-level to DEBUG
for our application, we have been able to correlate these “User Errors” with some errors in socket management in the _pyamqp
module of the azure-eventhub
package (see more details below).
Our consumer’s log volume (when set at DEBUG
) roughly correlates with User Errors volume (see screenshots) and seems to indicate a problem with socket connection being closed.
Lastly, our apps sometimes recover from this, but in other cases our Consumer Client pods must be restarted to begin consuming events again. (We had an outage this week as a result of this issue.)
To Reproduce Steps to reproduce the behavior:
- Build an async Eventhub Consumer similar to this example (Note: our app is very close to this example with some arbitrary processing in the event handler; the errors in question happen after long idle periods.)
- Run the consumer with log-level set to DEBUG and no events going into Eventhub for longer than 1 hour (our event volume is closely tied to US Business Hours).
- Set Eventhub to “Disabled”. (This seems to simulate the server socket disconnect.)
- Re-enable Eventhub and see if Consumer Client can reconnect (ours go into failure loops).
- Check Azure Eventhub until “User Errors” volume spikes (see screenshot) and correlate with logs.
Expected behavior
- No “User Errors” managing closed sockets for idle consumers.
- Idle consumers with closed connections are able to reconnect.
- Also, less important but surprising behavior in
_pyamqp
module: errors in this library are called out inlog.info
calls (notlog.error
?), so not readily apparent with our typical logging setups (Example).
Screenshots A. User Errors Reported by Azure Eventhub metrics blade:
B. Log Volume Correlated with high “User Errors”:
Additional context Here are the log messages we see in order when this happens (full debug log dump available):
- "An error occurred when detaching the link:
AMQPConnectionError('Error condition: ErrorCondition.SocketError\\n Error Description: Can not read frame due to exception: Server unexpectedly closed connection')"
- “Link state changed: <LinkState.ATTACHED: 3> -> <LinkState.DETACHED: 0>”
- "An error occurred when detaching the link: AMQPConnectionError(‘Error condition: ErrorCondition.SocketError\n Error Description: Can not read frame due to exception: Server unexpectedly closed connection’)"
- “Link state changed: <LinkState.ATTACHED: 3> -> <LinkState.DETACHED: 0>”
- “Management link receiver state changed: <LinkState.ATTACHED: 3> -> <LinkState.DETACHED: 0>”
- “An error occurred when detaching the link: AMQPConnectionError(‘Error condition: ErrorCondition.SocketError\n Error Description: Can not read frame due to exception: Server unexpectedly closed connection’)”
- “Link state changed: <LinkState.ATTACHED: 3> -> <LinkState.DETACHED: 0>”
- “Management link sender state changed: <LinkState.ATTACHED: 3> -> <LinkState.DETACHED: 0>”
- “An error occurred when ending the session: AMQPConnectionError(‘Error condition: ErrorCondition.SocketError\n Error Description: Can not read frame due to exception: Server unexpectedly closed connection’)”
- “Session state changed: <SessionState.MAPPED: 3> -> <SessionState.UNMAPPED: 0>”
- “An error occurred when closing the connection: AMQPConnectionError(‘Error condition: ErrorCondition.SocketError\n Error Description: Can not read frame due to exception: Server unexpectedly closed connection’)”
- “Connection state changed: <ConnectionState.OPENED: 9> -> <ConnectionState.END: 13>”
- “Connection state changed: None -> <ConnectionState.START: 0>”
- GO THROUGH CONNECTION FLOW
- “CBS completed opening with status: <ManagementOpenResult.OK: 1>”
- “Link state changed: <LinkState.DETACHED: 0> -> <LinkState.ATTACH_SENT: 1>”
- “Cannot get source or target. Detaching link”
- “Link state changed: <LinkState.ATTACH_SENT: 1> -> <LinkState.DETACHED: 0>”
- “Unable to attach new link: ValueError(‘Invalid link’)”
- “Link state changed: <LinkState.DETACHED: 0> -> <LinkState.ERROR: 6>”
- “Link state changed: <LinkState.ATTACHED: 3> -> <LinkState.DETACH_SENT: 4>”
- GOTO 1
Notes:
- The above flow seems to loop endlessly until we restart our consumer pods
- All of the above error messages are logged at level
INFO
(would prefer to haveWARN
orERROR
instead). - At step 14 it looks like something has succeeeded, so it’s unclear why the consumer goes into a loop afterward and can’t recover.
- We set
idle_timeout
on a guess that it could help with long idle periods, but it has not. - There are hundreds of “User Errors” per minute but it’s not clear to us what the Eventhub server itself is registering as a “User Error”?
- Our Consumers run in kubernetes pods and we also see this happen on Eventhubs with 1-24 partitions in different environments.
Also, worth mentioning that we have a Microsoft Support ticket: TrackingID#2302210010000022.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 30 (14 by maintainers)
Yesterday, I added
idle_timeout=60 * 4,
to ourEventHubConsumerClient.from_connection_string(...
call, and it may have helped. I picked 4 minutes because it was lower than theallowed 300000 milliseconds
reported by the error. I will keep an eye on it and see.We had another event in a dev environment this morning and the event seems to have begun with this:
I sent more details via email.
We see this for Eventhubs with 1, 3, and 24 partitions. We see it in West US and West US 2. I still need to get you another set of logs. Apologies: will get it to you soon. Thanks again for looking into this.
Ah, I was afraid of that: that you wouldn’t be able to reproduce it so easily. I will run some queries and try to get another log with a larger window. At a glance, before the reconnect failures occur, the debug log looks like “azure-blob-storage checkpointing worked”.
If I can get the time, I think I will also try to build a simple repro that can be shared. I may need to wait a few days to see if the event happens organically: I have am slightly suspicious that my explicit disable-and-reenable triggers a similar-looking failure which may be different.
@erewok received the email 😃
Speaking of, I may have a reproduction:
This also gave me a way to reliably generate the erroring frames, I think. The log volume here spikes immediately after I disabled and then re-enabled this Eventhub:
@kashifkhan I have a new log dump to email: what address should I send it to?
One other update: after reviewing other logs for one of my other consumers with log level at
WARN
andlogging_enable=True
, I saw the telltale signs of this issue this morning (high “User errors” reported in azure).Unfortunately, we have only have two log messages because log-level was at
WARN
.First:
{"message": "END frame received on invalid channel. Closing connection.", "timestamp": "2023-02-24T19:37:24.764204+00:00", "application": "SF Event Hub Updater", "version": "pr7634-0.20.2-20230224.9"}
After that, an endless repeat of (until service was restarted):
{"message": "Unable to attach new link: ValueError('Invalid link')", "timestamp": "2023-02-24T19:37:25.417902+00:00", "application": "SF Event Hub Updater", "version": "pr7634-0.20.2-20230224.9"}
I have redeployed this service with loglevel set higher and waiting to see if we can catch the invalid frame in action, but it seems like this code may be close to the trigger. It occurs to me that if you have a way to terminate an Eventhub consumer connection on the server side, you may be able to reproduce without waiting for hours?
Ah, you’re right. I had that enabled on a different consumer. Unfortunately, this error-state is inconsistent and I haven’t been able to reliably reproduce it. I turn up logging in our dev environment and then wake up the next day to see what data has appeared. I’ll probably have to get back to you next week as a result (and keep my fingers crossed that it’s reproducible 🤞for others).
Ah, okay. thanks so much for offering to look into this!