zulip: Messages get stuck or disappear

I’m running the zulip/docker-zulip:latest image under Kubernetes, using a helm chart based on the armooo/zulip-helm chart. A while after starting the Zulip server pod, messages stop being delivered. Sometimes the service is up and working for a few days, sometimes just hours before failure. The sending client doesn’t get a server reply with a message timestamp, and server.log does not log any /json/message entries. Deleting the Zulip server pod, thereby restarting Zulip, but none of the other services, causes (usually) the undelivered messages to get delivered.

  • zulip 1.9.0-latest
  • memchached: 2.3.1
  • redis: 4.2.0
  • postgresql: 0.19.0
  • rabbitmq: 3.5.0

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 26 (13 by maintainers)

Commits related to this issue

Most upvoted comments

We’re running into this issue as well. We’ve tracked it down to the message_sender queue not actually doing anything. When we shell into the Zulip pod we can flush all queued messages by starting a new one ($ python3 /home/zulip/deployments/current/manage.py process_queue --queue_name=message_sender --worker_num=4), but when that exits the issue returns.

Killing the existing message_sender processes ($ pkill -f message_sender) returns the functionality of the service (supervisor restarts them). We’re going to try abusing the liveness probe to kill these on a cadence until the cause can be determined. It would be beneficial to have discrete images of the different components currently held in the zulip container, allowing for a distributed deployment.