azure-sdk-for-python: Async Eventhub Consumer with Pyamqp: Link Detached after long idle period Results in High "User Errors" volume

package: azure-eventhub
version: 5.11.1
OS: Debian Container (kubernetes pod): python:3.11-slim-bullseye
Python version: 3.11.2

Describe the bug Long idle periods (greater than 1 hour) sometimes result in very high volume of “User Errors” reported in the metrics blade for our Eventhub. These high “User Error” volumes are correlated with our Consumer Clients not receiving events after idle period has ended and Producers have started publishing events again.

After setting log-level to DEBUG for our application, we have been able to correlate these “User Errors” with some errors in socket management in the _pyamqp module of the azure-eventhub package (see more details below).

Our consumer’s log volume (when set at DEBUG) roughly correlates with User Errors volume (see screenshots) and seems to indicate a problem with socket connection being closed.

Lastly, our apps sometimes recover from this, but in other cases our Consumer Client pods must be restarted to begin consuming events again. (We had an outage this week as a result of this issue.)

To Reproduce Steps to reproduce the behavior:

Build an async Eventhub Consumer similar to this example (Note: our app is very close to this example with some arbitrary processing in the event handler; the errors in question happen after long idle periods.)
Run the consumer with log-level set to DEBUG and no events going into Eventhub for longer than 1 hour (our event volume is closely tied to US Business Hours).
Set Eventhub to “Disabled”. (This seems to simulate the server socket disconnect.)
Re-enable Eventhub and see if Consumer Client can reconnect (ours go into failure loops).
Check Azure Eventhub until “User Errors” volume spikes (see screenshot) and correlate with logs.

Expected behavior

No “User Errors” managing closed sockets for idle consumers.
Idle consumers with closed connections are able to reconnect.
Also, less important but surprising behavior in _pyamqp module: errors in this library are called out in log.info calls (not log.error?), so not readily apparent with our typical logging setups (Example).

Screenshots A. User Errors Reported by Azure Eventhub metrics blade: user-errors-metrics-sfevents-DEV

B. Log Volume Correlated with high “User Errors”:

user-errors-logs-correlations-sfevents-DEV

Additional context Here are the log messages we see in order when this happens (full debug log dump available):

"An error occurred when detaching the link: AMQPConnectionError('Error condition: ErrorCondition.SocketError\\n Error Description: Can not read frame due to exception: Server unexpectedly closed connection')"
“Link state changed: <LinkState.ATTACHED: 3> -> <LinkState.DETACHED: 0>”
"An error occurred when detaching the link: AMQPConnectionError(‘Error condition: ErrorCondition.SocketError\n Error Description: Can not read frame due to exception: Server unexpectedly closed connection’)"
“Link state changed: <LinkState.ATTACHED: 3> -> <LinkState.DETACHED: 0>”
“Management link receiver state changed: <LinkState.ATTACHED: 3> -> <LinkState.DETACHED: 0>”
“An error occurred when detaching the link: AMQPConnectionError(‘Error condition: ErrorCondition.SocketError\n Error Description: Can not read frame due to exception: Server unexpectedly closed connection’)”
“Link state changed: <LinkState.ATTACHED: 3> -> <LinkState.DETACHED: 0>”
“Management link sender state changed: <LinkState.ATTACHED: 3> -> <LinkState.DETACHED: 0>”
“An error occurred when ending the session: AMQPConnectionError(‘Error condition: ErrorCondition.SocketError\n Error Description: Can not read frame due to exception: Server unexpectedly closed connection’)”
“Session state changed: <SessionState.MAPPED: 3> -> <SessionState.UNMAPPED: 0>”
“An error occurred when closing the connection: AMQPConnectionError(‘Error condition: ErrorCondition.SocketError\n Error Description: Can not read frame due to exception: Server unexpectedly closed connection’)”
“Connection state changed: <ConnectionState.OPENED: 9> -> <ConnectionState.END: 13>”
“Connection state changed: None -> <ConnectionState.START: 0>”
GO THROUGH CONNECTION FLOW
“CBS completed opening with status: <ManagementOpenResult.OK: 1>”
“Link state changed: <LinkState.DETACHED: 0> -> <LinkState.ATTACH_SENT: 1>”
“Cannot get source or target. Detaching link”
“Link state changed: <LinkState.ATTACH_SENT: 1> -> <LinkState.DETACHED: 0>”
“Unable to attach new link: ValueError(‘Invalid link’)”
“Link state changed: <LinkState.DETACHED: 0> -> <LinkState.ERROR: 6>”
“Link state changed: <LinkState.ATTACHED: 3> -> <LinkState.DETACH_SENT: 4>”
GOTO 1

Notes:

The above flow seems to loop endlessly until we restart our consumer pods
All of the above error messages are logged at level INFO (would prefer to have WARN or ERROR instead).
At step 14 it looks like something has succeeeded, so it’s unclear why the consumer goes into a loop afterward and can’t recover.
We set idle_timeout on a guess that it could help with long idle periods, but it has not.
There are hundreds of “User Errors” per minute but it’s not clear to us what the Eventhub server itself is registering as a “User Error”?
Our Consumers run in kubernetes pods and we also see this happen on Eventhubs with 1-24 partitions in different environments.

Also, worth mentioning that we have a Microsoft Support ticket: TrackingID#2302210010000022.

About this issue

Original URL
State: closed
Created a year ago
Comments: 30 (14 by maintainers)

Most upvoted comments

Yesterday, I added idle_timeout=60 * 4, to our EventHubConsumerClient.from_connection_string(... call, and it may have helped. I picked 4 minutes because it was lower than the allowed 300000 milliseconds reported by the error. I will keep an eye on it and see.

erewok on Mar 1, 2023

We had another event in a dev environment this morning and the event seems to have begun with this:

"<- DetachFrame(handle=0, closed=True, error=[b'amqp:connection:forced', b\"The connection was inactive for more than the allowed 300000 milliseconds and is closed by container 'LinkTracker'. TrackingId:2a8827fbd5ee48d9be0f2787f123da1b_G23, SystemTracker:gateway5, Timestamp:2023-02-28T15:27:18\", None])"

I sent more details via email.

erewok on Feb 28, 2023

We see this for Eventhubs with 1, 3, and 24 partitions. We see it in West US and West US 2. I still need to get you another set of logs. Apologies: will get it to you soon. Thanks again for looking into this.

erewok on Feb 28, 2023

Ah, I was afraid of that: that you wouldn’t be able to reproduce it so easily. I will run some queries and try to get another log with a larger window. At a glance, before the reconnect failures occur, the debug log looks like “azure-blob-storage checkpointing worked”.

If I can get the time, I think I will also try to build a simple repro that can be shared. I may need to wait a few days to see if the event happens organically: I have am slightly suspicious that my explicit disable-and-reenable triggers a similar-looking failure which may be different.

erewok on Feb 28, 2023

@erewok received the email 😃

kashifkhan on Feb 24, 2023

Speaking of, I may have a reproduction:

When I set the Eventhub’s “Status” to disabled (see attached) and then I set it to enabled, it triggers the same behavior (“User errors” chart spikes up and stays high until service restarted)

Screenshot 2023-02-24 at 1 52 40 PM

This also gave me a way to reliably generate the erroring frames, I think. The log volume here spikes immediately after I disabled and then re-enabled this Eventhub:

Screenshot 2023-02-24 at 1 53 47 PM

@kashifkhan I have a new log dump to email: what address should I send it to?

erewok on Feb 24, 2023

One other update: after reviewing other logs for one of my other consumers with log level at WARN and logging_enable=True, I saw the telltale signs of this issue this morning (high “User errors” reported in azure).

Unfortunately, we have only have two log messages because log-level was at WARN.

First:

{"message": "END frame received on invalid channel. Closing connection.", "timestamp": "2023-02-24T19:37:24.764204+00:00", "application": "SF Event Hub Updater", "version": "pr7634-0.20.2-20230224.9"}

After that, an endless repeat of (until service was restarted):

{"message": "Unable to attach new link: ValueError('Invalid link')", "timestamp": "2023-02-24T19:37:25.417902+00:00", "application": "SF Event Hub Updater", "version": "pr7634-0.20.2-20230224.9"}

I have redeployed this service with loglevel set higher and waiting to see if we can catch the invalid frame in action, but it seems like this code may be close to the trigger. It occurs to me that if you have a way to terminate an Eventhub consumer connection on the server side, you may be able to reproduce without waiting for hours?

erewok on Feb 24, 2023

Ah, you’re right. I had that enabled on a different consumer. Unfortunately, this error-state is inconsistent and I haven’t been able to reliably reproduce it. I turn up logging in our dev environment and then wake up the next day to see what data has appeared. I’ll probably have to get back to you next week as a result (and keep my fingers crossed that it’s reproducible 🤞for others).

erewok on Feb 24, 2023

Ah, okay. thanks so much for offering to look into this!

erewok on Feb 24, 2023