azure-sdk-for-java: [BUG] Event hub receiver stops consuming messages from partiton after a few days

Describe the bug We are consuming messages from a EventHub with 32 partitions. Occasionally (after a few days) the service stops receiving messages from one of the partitions.

Exception or Stack Trace

2021-10-06 08:25:20.823  WARN 6 --- [tor-executor-25] c.a.c.a.implementation.ReactorExecutor   : connectionId[MF_184043_1633422722973], message[Unhandled exception while processing events in reactor, report this error.]
java.lang.IllegalStateException: no current delivery
2021-10-06 08:26:20.827 ERROR 6 --- [undedElastic-37] c.a.c.a.i.ReactorDispatcher              : ReactorDispatcher instance is closed.
2021-10-06 08:26:20.827 ERROR 6 --- [     parallel-1] c.a.c.a.i.ReactorDispatcher              : ReactorDispatcher instance is closed.
2021-10-06 08:26:20.827 ERROR 6 --- [     parallel-1] c.a.c.a.i.ReactorDispatcher              : ReactorDispatcher instance is closed.
2021-10-06 08:26:20.827  WARN 6 --- [     parallel-1] c.a.c.a.i.ReactorConnection              : connectionId[MF_184043_1633422722973] Error while scheduling closeConnection work. Manually disposing.
ReactorDispatcher instance is closed.
2021-10-06 08:27:20.827  WARN 6 --- [     parallel-2] c.a.c.a.i.ReactorConnection              : connectionId[MF_184043_1633422722973]: Timed out waiting for all sessions to close.

To Reproduce

  • Start consuming events
  • Wait a day or two
  • Observe the error message above

Expected behavior Event hub receiver doesn’t stop consuming events from a partition or recovers from the error on its own

Setup (please complete the following information):

  • OS: Linux (Docker container on K8S)
  • IDE: IntelliJ
  • Library/Libraries: azure-messaging-eventhubs:5.10.1 azure-messaging-eventhubs-checkpointstore-blob:1.10.0
  • Java version: AdoptOpenJDK-11.0.11+9
  • App Server/Environment: Tomcat
  • Frameworks: Spring Boot

Additional context Not sure if this is relevant, but we are currently using a single application instance. So all partitions of the event hub are consumed by one instance.

We are using a prefetchCount of 10 and a batch size of 500. We set maxWaitTime to null as a workaround for #23950

Information Checklist Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

  • Bug Description Added
  • Repro Steps Added
  • Setup information Added

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 22 (12 by maintainers)

Most upvoted comments

@anuchandy and I looked through your logs and believe this is fixed in the October release.

c.a.m.e.i.EventHubReactorAmqpConnection : Get or create consumer for path: 'projectname-ingest/ConsumerGroups/ingest/Partitions/0

The log entry above shows it is trying to create a link for partition 0 (here), and in doing so, calls “tokenManager.authorize()”.

The authorize calls will use the “Mono<> sendWithAck” from RRChannel, where we can see auth call happens for “msg 178” immediately below, but it is never successfully completed.

connectionId[MF_30ffff_1633678666625], linkName[cbs]: Scheduling on dispatcher. MessageId[178]

Since there is no operation timeout on sendWithAck, it waits forever for an ACK that will never come… thus, stopping the receiver from recovering.

I don’t see an err from storage in the log you shared; yes, agree with you; given the load (~10 msg in every 5 mins), chances are low to hit this (it could also be the case that storage retry is debug-level / it’s filtered out in log-config). This page document that frequent checkpointing is not good https://docs.microsoft.com/en-us/azure/event-hubs/event-processor-balance-partition-load, but feel like this should be called out in a separate section, e.g., best practices.

We’ve also deployed and started the test with a setup close to your non-prod env so that we can debug. Will see how it goes.

@conniey Thank you for the configuration. Since I couldn’t just redeploy the environment I ended up turning DEBUG on for all com.azure loggers. In the end, it wasn’t too bad. 😅

The error just reoccurred today. Here is the log with all com.azure debug messages enabled: 2021-10-10-eh-problem-clean.log

About the log file:

  • Contains data 5 minutes before and 5 minutes after the error
  • Error occurrs at timestamp 2021-10-10 10:10:14.855, line 72904
  • Project relevant names have been censored (replaced by “projectname”)
  • Messages are consumed from projectname-eval.servicebus.windows.net, EventHub: projectname-ingest
  • Messages are published to projectname-eval.servicebus.windows.net, EventHub: asset-messages
  • There is also an IoT hub EventHub which is irrelevant: iothub-ns-projectname.servicebus.windows.net

Thank you for your time. I hope this logfile contains useful information 🙂