azure-sdk-for-java: [BUG] Event hub receiver stops consuming messages from partiton after a few days
Describe the bug We are consuming messages from a EventHub with 32 partitions. Occasionally (after a few days) the service stops receiving messages from one of the partitions.
Exception or Stack Trace
2021-10-06 08:25:20.823 WARN 6 --- [tor-executor-25] c.a.c.a.implementation.ReactorExecutor : connectionId[MF_184043_1633422722973], message[Unhandled exception while processing events in reactor, report this error.]
java.lang.IllegalStateException: no current delivery
2021-10-06 08:26:20.827 ERROR 6 --- [undedElastic-37] c.a.c.a.i.ReactorDispatcher : ReactorDispatcher instance is closed.
2021-10-06 08:26:20.827 ERROR 6 --- [ parallel-1] c.a.c.a.i.ReactorDispatcher : ReactorDispatcher instance is closed.
2021-10-06 08:26:20.827 ERROR 6 --- [ parallel-1] c.a.c.a.i.ReactorDispatcher : ReactorDispatcher instance is closed.
2021-10-06 08:26:20.827 WARN 6 --- [ parallel-1] c.a.c.a.i.ReactorConnection : connectionId[MF_184043_1633422722973] Error while scheduling closeConnection work. Manually disposing.
ReactorDispatcher instance is closed.
2021-10-06 08:27:20.827 WARN 6 --- [ parallel-2] c.a.c.a.i.ReactorConnection : connectionId[MF_184043_1633422722973]: Timed out waiting for all sessions to close.
To Reproduce
- Start consuming events
- Wait a day or two
- Observe the error message above
Expected behavior Event hub receiver doesn’t stop consuming events from a partition or recovers from the error on its own
Setup (please complete the following information):
- OS: Linux (Docker container on K8S)
- IDE: IntelliJ
- Library/Libraries: azure-messaging-eventhubs:5.10.1 azure-messaging-eventhubs-checkpointstore-blob:1.10.0
- Java version: AdoptOpenJDK-11.0.11+9
- App Server/Environment: Tomcat
- Frameworks: Spring Boot
Additional context Not sure if this is relevant, but we are currently using a single application instance. So all partitions of the event hub are consumed by one instance.
We are using a prefetchCount of 10 and a batch size of 500. We set maxWaitTime to null as a workaround for #23950
Information Checklist Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report
- Bug Description Added
- Repro Steps Added
- Setup information Added
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 22 (12 by maintainers)
@anuchandy and I looked through your logs and believe this is fixed in the October release.
The log entry above shows it is trying to create a link for partition 0 (here), and in doing so, calls “tokenManager.authorize()”.
The authorize calls will use the “Mono<> sendWithAck” from RRChannel, where we can see auth call happens for “msg 178” immediately below, but it is never successfully completed.
Since there is no operation timeout on sendWithAck, it waits forever for an ACK that will never come… thus, stopping the receiver from recovering.
I don’t see an err from storage in the log you shared; yes, agree with you; given the load (~10 msg in every 5 mins), chances are low to hit this (it could also be the case that storage retry is debug-level / it’s filtered out in log-config). This page document that frequent checkpointing is not good https://docs.microsoft.com/en-us/azure/event-hubs/event-processor-balance-partition-load, but feel like this should be called out in a separate section, e.g., best practices.
We’ve also deployed and started the test with a setup close to your non-prod env so that we can debug. Will see how it goes.
@conniey Thank you for the configuration. Since I couldn’t just redeploy the environment I ended up turning
DEBUGon for allcom.azureloggers. In the end, it wasn’t too bad. 😅The error just reoccurred today. Here is the log with all
com.azuredebug messages enabled: 2021-10-10-eh-problem-clean.logAbout the log file:
2021-10-10 10:10:14.855, line72904projectname-eval.servicebus.windows.net, EventHub:projectname-ingestprojectname-eval.servicebus.windows.net, EventHub:asset-messagesiothub-ns-projectname.servicebus.windows.netThank you for your time. I hope this logfile contains useful information 🙂