zulip: Messages get stuck or disappear

I’m running the zulip/docker-zulip:latest image under Kubernetes, using a helm chart based on the armooo/zulip-helm chart. A while after starting the Zulip server pod, messages stop being delivered. Sometimes the service is up and working for a few days, sometimes just hours before failure. The sending client doesn’t get a server reply with a message timestamp, and server.log does not log any /json/message entries. Deleting the Zulip server pod, thereby restarting Zulip, but none of the other services, causes (usually) the undelivered messages to get delivered.

zulip 1.9.0-latest
memchached: 2.3.1
redis: 4.2.0
postgresql: 0.19.0
rabbitmq: 3.5.0

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 26 (13 by maintainers)

Commits related to this issue

rabbitmq: Set a short TCP keepalive idle time on BlockingConnection. The code comment explains this issue in some detail, but essentially in Kubernetes and Docker Swarm systems, the container overlay... — committed to YashRE42/zulip by jabagawee 5 years ago

Most upvoted comments

We’re running into this issue as well. We’ve tracked it down to the message_sender queue not actually doing anything. When we shell into the Zulip pod we can flush all queued messages by starting a new one ($ python3 /home/zulip/deployments/current/manage.py process_queue --queue_name=message_sender --worker_num=4), but when that exits the issue returns.

Killing the existing message_sender processes ($ pkill -f message_sender) returns the functionality of the service (supervisor restarts them). We’re going to try abusing the liveness probe to kill these on a cadence until the cause can be determined. It would be beneficial to have discrete images of the different components currently held in the zulip container, allowing for a distributed deployment.

sgowie on Jan 2, 2019