RawRabbit: Publishing suddenly stops and channel workload increases indefinetly

Good morning.

I’m developing an application that should read data from external sources (e.g. modbus, OPC etc.) and process it through a sequence of calculation blocks. I’m using RabbitMQ and RawRabbit to implement a Data Flow Pattern to route the information among the blocks. I am using one instance of the bus client shared across the application and all publishers and subscribers are currently operating on the same process. Currently the data are coming from a simulated source that just spits out random data.

Everything seems to work fine, but after a while (usually around 48 hours) the publishing suddenly stops and the application starts allocating massive amounts of memory.

The process just keeps running, until it stops due to an OutOfMemoryException.

The only clue I found about what’s going on is the following line in RawRabbit logs:

Channel pool currently has 1 channels open and a total workload of n

Until the publishing stops n is zero. Then suddenly this number starts to increase by 20 at each publish attempt.

The server seems unaware that anything wrong is happening as well.

I’ve been working with RawRabbit and RabbitMQ only for a very short time so please pardon my ignorance, I really can’t figure out what’s going on under the hood at the moment.

This is how I publish data

await this.databus.PublishAsync(
                    dataPackage, 
                    ctx => ctx.UsePublishConfiguration(
                        cfg => cfg.OnDeclaredExchange(exchange => exchange.WithName("datahub")))
                        .UsePublishAcknowledge(use:false)
                );

I removed the publish acknowledge early on because I had some exceptions raising, which I found weird considering that the server is on the localhost. Probably by doing so I just hid something under the carpet … I’ll run another test with publish acknowledge and see what happens.

Thanks for any help you can provide. Best regards Luca

About this issue

Original URL
State: open
Created 6 years ago
Reactions: 1
Comments: 56 (18 by maintainers)

Commits related to this issue

(#303) Adding more logging in the channel pool — committed to pardahlman/RawRabbit by pardahlman 6 years ago
Fixes issue #303 See https://github.com/pardahlman/RawRabbit/issues/303 for further details. — committed to matlec/RawRabbit by matlec 5 years ago

Most upvoted comments

We are seeing the same problems at described here by @jmgarrett and @RedOnePrime when under moderate load. After a few messages are sent then no more messages can be sent.

I made some quick local changes (very brute force) to the 2 locations and now it doesn’t stop publishing after a few seconds in our load test.

ConcurrentChannelQueue.Enqueue: always raise the Queued event
StaticChannelPool: remove the IsEmpty/Count==0 test at the top and change the TryLock to wrap the rest of the method in a “lock(_workLock)” block.

This ticket is pretty old. Is there any idea of when a fix will be part of a release? If there isn’t going to be one then we may just have to keep our local repo with the changes.

daddyman on Aug 10, 2018

Dear @pardahlman , I collected dotMemory snapshots of rc5, hoping they’ll contain some useful clues.

Download dotMemory export

Thanks Luca

LDonato on Mar 27, 2018

Hi! Great job, the system is more stable now but still it stopped working in the end, unfortunately.

I’m sending you the logs, the last messages successfully processed was timestamped 20:58:32.

logs.zip

Let me know if there’s anything I can do to provide more information.

Regards Luca

LDonato on Mar 15, 2018

Thank you very much for the follow up! I’ll start testing immediately!

LDonato on Mar 11, 2018

Thank you for taking the time to look at the sample. In the real situation the publishing rate is far slower but still it’s great food for thought! I’ll try this prefetch count in the real application and see what happens.

Thanks again!

LDonato on Feb 25, 2018

@pardahlman thank you for your time. In my case the broker is on the same machine as the publishers and subscribers and publishers and subscribers are on the same application, so it is not exactly the same configuration as your test I think. Also, I have a situation where I forward messages publishing from within the scope of a subscriber’s callback. I don’t know if it may make any difference but maybe it would be worth including this in your test. You can find a project to reproduce exactly what I am doing a few comments back. In my machine that project stops working after a few minutes usually, although it’s been alive for an hour or so in one occasion.

To answer your other questions I believe that the subscribers are still working but the publishers stop writing on the queue. Being on the same application I would not know how to restart the publisher alone. In the real situation I publish about very few messages … about 2-4 per second and it crashes in average after 48 hours. I do use await on publishers.

I hope I answered everything, let me know if I can give you any more information.

Thanks

https://github.com/pardahlman/RawRabbit/files/1680583/RawRabbitTest.zip

LDonato on Feb 24, 2018

I created a model of what goes on in my application in order to replicate the issue. It’s not deterministic but at least the problem manifests in minutes, not hours.

RawRabbitTest.zip

LDonato on Jan 31, 2018