nats-server: Round-robin load balance fairness with slow queue subscribers

I’ve been tracking down an issue in my application where my workers in a queue subscription are not being utilised properly. The structure of my application is:

Queued subscription of workers that receive task requests, do some processing, reply with the result and then try to get the next available task again
Clients send a request to process a task and wait for the reply

When I have 3 idle workers I can send 3 requests in parallel which instruct the workers to sleep for 5 seconds and then reply. I would expect to see behaviour where all 3 idle workers are being utilised for the 3 requests. But instead it seems like requests can pile up on a busy worker, making the work inefficient.

I’ve been given information about the nature of subscription queues having no concept of whether the receivers are actually able to handle the next message: https://natsio.slack.com/archives/C069GSYFP/p1547221881169300

But this information is concerning because I now have more than one project which relies on subscription queues with workers that are doing time consuming work in their message handlers before accepting the next message. I now wonder if nats queues are efficient enough for these situations or if I have to create my own broker in the middle manages the queue and knows which worker is waiting for a message.

Versions of `gnatsd` and affected client libraries used:

gnatsd / embedded server 1.3.0 Go client library 1.6.0

OS/Container environment:

Ubuntu 16.04

Steps or code to reproduce the issue:

Please reference the following complete reproduction of the issue: https://gist.github.com/justinfx/b0adb36e694ec03365da19b6bbf33c20

Expected result:

Each idle worker subscriber would receive one of the 3 available messages

Actual result:

2 workers receive a message and take time to process, while the 3rd worker remains idle. Once one of the other workers finishes its message handling it receives the 3rd message

Output from gnatsd: gnatsd_out.txt

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 24 (12 by maintainers)

Most upvoted comments

Understood. JetStream will be pretty slick and handle multiple cases like streaming and MQs with one tech.

derekcollison on May 24, 2019

I didn’t have a consistent solution for this specific problem using nats streaming, but I am having great success with streaming on a more recent project. So it was probably just not suited to this queuing problem.

My solution of having a broker has been working great. Every idle consumer gets a message and I fully utilise my workers. It has some naive handling of workers going away after they have asked for another task and one is finally ready to give it. I’m timing out the sending of the task to the worker after a second and retrying it on another worker. But I could update it to ping the workers after they ask for a task and preemptively bail on them when they go away.

justinfx on May 10, 2019

NATS core is at most once, so queue subscribers are randomly selected and the message is sent with that QoS, meaning if they can’t process for whatever reason that request is lost and the requestor should timeout and retry. If you want more intelligent queue subscriber selection that is semantically aware regarding number of outstanding messages use NATS streaming, which offers at least once semantics.

derekcollison on May 10, 2019