channels_redis: Multiple Daphne instances with Redis backend get stuck

I am currently debugging the following situation, for which I have not been able to create a reproducible situation in a testing environment, mostly due to time contraints. I am posting what I know now, perhaps this rings a bell somewhere that makes me stop looking further, but I suspect a bug of some kind:

Chromium browser client, Ubuntu 16.04 server
channels 2.0.2, daphne 2.0.4, Django 2.0.2, channels_redis 2.1.0

I have a Nginx proxy server that proxies between two Daphne instances running the same codebase. I am running a Redis channel backend. I am using Celery for several backend tasks, and want to report their status to the client that is waiting on them. For this, they publish a PROGRESS state and I have hooked some Channels code inside the storage function, so that a message is sent to a group waiting on that task, e.g.

async_to_sync(channel_layer.group_send)("celery-status-<task-uuid>", {"type": "celery.task_status", "text": task_status})

I have a consumer that looks a bit like this:

class MyConsumer(AsyncWebsocketConsumer):
    async def receive(self, text_data=None, bytes_data=None):
        await self.channel_layer.group_add("celery-status-<task-uuid>", self.channel_name)

    async def celery_task_status(self, event):
        await self.send(text_data=event["text"])

Now, when someone requests to receive updates for the given Celery task ID, and I send status updates to the client. (I also immediately send the current status to the consumer, which works fine, so is left out of the code example).

The bug I encounter, is that after a while, progress updates are not sent to the clients anymore. This only appears to happen when there are two or more Daphne servers running. The Redis queues fill up. When I shutdown one of those Daphne servers, and clear the Redis cache, it continues to work fine.

Perhaps this has to do with my Nginx proxy setup, which has simply two upstream servers specified, although I’m not sure that’s the culprit. I think two Daphne instances is the root of the problem.

Is this a known issue? Is there something I can add to this bug to make it more reproducible?

About this issue

Original URL
State: open
Created 6 years ago
Reactions: 2
Comments: 20 (8 by maintainers)

Commits related to this issue

Switch to RabbitMQ Hopefully this will prevent stalled message-receives, such as https://github.com/django/channels/issues/939. — committed to CJWorkbench/cjworkbench by adamhooper 6 years ago

Most upvoted comments

@ericls I’m not quite sure what you’re talking about…

Is it “different event loop requires different connection?” channels_rabbitmq and channels_redis address this in the same way: a mapping from EventLoop to Connection (well, List[Connection], in channels_redis’ case). Indeed, each project has a battery of unit tests that spawns and destroys several event loops.

Is it “invoking async code from @async_to_sync’s executor threads is tricky?” Again, channels_rabbitmq and channels_redis address this with the same strategy: er, nothing. (Channel-layer methods can only be invoked on the event loop’s main thread, not its executor threads.)

The one difference between channels_rabbitmq and channels_redis is that channels_rabbitmq immediately starts consuming messages when you connect, and channels_redis waits for you to ask for one. I think we’re straying pretty far from this particular bug report, though.

The point is: I think my workaround is viable. If there’s a bug in channels_rabbitmq, please report it! https://github.com/CJWorkbench/channels_rabbitmq/issues

adamhooper on Nov 6, 2018