azure-sdk-for-java: [BUG] Event hub receiver stops consuming messages from partiton after a few days

Describe the bug We are consuming messages from a EventHub with 32 partitions. Occasionally (after a few days) the service stops receiving messages from one of the partitions.

Exception or Stack Trace

2021-10-06 08:25:20.823  WARN 6 --- [tor-executor-25] c.a.c.a.implementation.ReactorExecutor   : connectionId[MF_184043_1633422722973], message[Unhandled exception while processing events in reactor, report this error.]
java.lang.IllegalStateException: no current delivery
2021-10-06 08:26:20.827 ERROR 6 --- [undedElastic-37] c.a.c.a.i.ReactorDispatcher              : ReactorDispatcher instance is closed.
2021-10-06 08:26:20.827 ERROR 6 --- [     parallel-1] c.a.c.a.i.ReactorDispatcher              : ReactorDispatcher instance is closed.
2021-10-06 08:26:20.827 ERROR 6 --- [     parallel-1] c.a.c.a.i.ReactorDispatcher              : ReactorDispatcher instance is closed.
2021-10-06 08:26:20.827  WARN 6 --- [     parallel-1] c.a.c.a.i.ReactorConnection              : connectionId[MF_184043_1633422722973] Error while scheduling closeConnection work. Manually disposing.
ReactorDispatcher instance is closed.
2021-10-06 08:27:20.827  WARN 6 --- [     parallel-2] c.a.c.a.i.ReactorConnection              : connectionId[MF_184043_1633422722973]: Timed out waiting for all sessions to close.

To Reproduce

Start consuming events
Wait a day or two
Observe the error message above

Expected behavior Event hub receiver doesn’t stop consuming events from a partition or recovers from the error on its own

Setup (please complete the following information):

OS: Linux (Docker container on K8S)
IDE: IntelliJ
Library/Libraries: azure-messaging-eventhubs:5.10.1 azure-messaging-eventhubs-checkpointstore-blob:1.10.0
Java version: AdoptOpenJDK-11.0.11+9
App Server/Environment: Tomcat
Frameworks: Spring Boot

Additional context Not sure if this is relevant, but we are currently using a single application instance. So all partitions of the event hub are consumed by one instance.

We are using a prefetchCount of 10 and a batch size of 500. We set maxWaitTime to null as a workaround for #23950

Information Checklist Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

Bug Description Added
Repro Steps Added
Setup information Added

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 1
Comments: 22 (12 by maintainers)

Most upvoted comments

@anuchandy and I looked through your logs and believe this is fixed in the October release.

c.a.m.e.i.EventHubReactorAmqpConnection : Get or create consumer for path: 'projectname-ingest/ConsumerGroups/ingest/Partitions/0

The log entry above shows it is trying to create a link for partition 0 (here), and in doing so, calls “tokenManager.authorize()”.

The authorize calls will use the “Mono<> sendWithAck” from RRChannel, where we can see auth call happens for “msg 178” immediately below, but it is never successfully completed.

connectionId[MF_30ffff_1633678666625], linkName[cbs]: Scheduling on dispatcher. MessageId[178]

Since there is no operation timeout on sendWithAck, it waits forever for an ACK that will never come… thus, stopping the receiver from recovering.

conniey on Oct 11, 2021

I don’t see an err from storage in the log you shared; yes, agree with you; given the load (~10 msg in every 5 mins), chances are low to hit this (it could also be the case that storage retry is debug-level / it’s filtered out in log-config). This page document that frequent checkpointing is not good https://docs.microsoft.com/en-us/azure/event-hubs/event-processor-balance-partition-load, but feel like this should be called out in a separate section, e.g., best practices.

We’ve also deployed and started the test with a setup close to your non-prod env so that we can debug. Will see how it goes.

anuchandy on Oct 21, 2021

@conniey Thank you for the configuration. Since I couldn’t just redeploy the environment I ended up turning DEBUG on for all com.azure loggers. In the end, it wasn’t too bad. 😅

The error just reoccurred today. Here is the log with all com.azure debug messages enabled: 2021-10-10-eh-problem-clean.log

About the log file:

Contains data 5 minutes before and 5 minutes after the error
Error occurrs at timestamp 2021-10-10 10:10:14.855, line 72904
Project relevant names have been censored (replaced by “projectname”)
Messages are consumed from projectname-eval.servicebus.windows.net, EventHub: projectname-ingest
Messages are published to projectname-eval.servicebus.windows.net, EventHub: asset-messages
There is also an IoT hub EventHub which is irrelevant: iothub-ns-projectname.servicebus.windows.net

Thank you for your time. I hope this logfile contains useful information 🙂

mwalser on Oct 10, 2021