docker-zulip: Error: psycopg2.OperationalError: server closed the connection unexpectedly

On the latest 2.1.3 (but also on 2.1.1, and 2.1.2) I am frequently (multiple times per day, 10-20?) getting this error emailed to me:

---------- Forwarded message ---------
From: myemail
Date: Fri, Apr 3, 2020 at 9:04 PM
Subject: [Django] a111cc8a0340: server closed the connection unexpectedly\n	This probably means the server terminated abnormally\n	before or while processing the request.\n
To: myemail

Logger root, from module zerver.worker.queue_processors line 151:
Error generated by Anonymous user (not logged in) on a111cc8a0340 deployment

Traceback (most recent call last):
  File "/home/zulip/deployments/2020-04-01-21-55-56/zulip-py3-venv/lib/python3.6/site-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
  File "/home/zulip/deployments/2020-04-01-21-55-56/zerver/lib/db.py", line 31, in execute
    return wrapper_execute(self, super().execute, query, vars)
  File "/home/zulip/deployments/2020-04-01-21-55-56/zerver/lib/db.py", line 18, in wrapper_execute
    return action(sql, params)
psycopg2.OperationalError: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/zulip/deployments/2020-04-01-21-55-56/zerver/worker/queue_processors.py", line 134, in consume_wrapper
    self.consume(data)
  File "/home/zulip/deployments/2020-04-01-21-55-56/zerver/worker/queue_processors.py", line 310, in consume
    user_profile = get_user_profile_by_id(event["user_profile_id"])
  File "/home/zulip/deployments/2020-04-01-21-55-56/zerver/lib/cache.py", line 186, in func_with_caching
    val = func(*args, **kwargs)
  File "/home/zulip/deployments/2020-04-01-21-55-56/zerver/models.py", line 2072, in get_user_profile_by_id
    return UserProfile.objects.select_related().get(id=uid)
  File "/home/zulip/deployments/2020-04-01-21-55-56/zulip-py3-venv/lib/python3.6/site-packages/django/db/models/query.py", line 374, in get
    num = len(clone)
  File "/home/zulip/deployments/2020-04-01-21-55-56/zulip-py3-venv/lib/python3.6/site-packages/django/db/models/query.py", line 232, in __len__
    self._fetch_all()
  File "/home/zulip/deployments/2020-04-01-21-55-56/zulip-py3-venv/lib/python3.6/site-packages/django/db/models/query.py", line 1121, in _fetch_all
    self._result_cache = list(self._iterable_class(self))
  File "/home/zulip/deployments/2020-04-01-21-55-56/zulip-py3-venv/lib/python3.6/site-packages/django/db/models/query.py", line 53, in __iter__
    results = compiler.execute_sql(chunked_fetch=self.chunked_fetch)
  File "/home/zulip/deployments/2020-04-01-21-55-56/zulip-py3-venv/lib/python3.6/site-packages/django/db/models/sql/compiler.py", line 899, in execute_sql
    raise original_exception
  File "/home/zulip/deployments/2020-04-01-21-55-56/zulip-py3-venv/lib/python3.6/site-packages/django/db/models/sql/compiler.py", line 889, in execute_sql
    cursor.execute(sql, params)
  File "/home/zulip/deployments/2020-04-01-21-55-56/zulip-py3-venv/lib/python3.6/site-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
  File "/home/zulip/deployments/2020-04-01-21-55-56/zulip-py3-venv/lib/python3.6/site-packages/django/db/utils.py", line 94, in __exit__
    six.reraise(dj_exc_type, dj_exc_value, traceback)
  File "/home/zulip/deployments/2020-04-01-21-55-56/zulip-py3-venv/lib/python3.6/site-packages/django/utils/six.py", line 685, in reraise
    raise value.with_traceback(tb)
  File "/home/zulip/deployments/2020-04-01-21-55-56/zulip-py3-venv/lib/python3.6/site-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
  File "/home/zulip/deployments/2020-04-01-21-55-56/zerver/lib/db.py", line 31, in execute
    return wrapper_execute(self, super().execute, query, vars)
  File "/home/zulip/deployments/2020-04-01-21-55-56/zerver/lib/db.py", line 18, in wrapper_execute
    return action(sql, params)
django.db.utils.OperationalError: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

Deployed code:
- ZULIP_VERSION: 2.1.3
- version: docker

Request info: none

The installed version appears to work correctly though.

Elsewhere in other issues I have posted here, I have received the advice that this is caused by restarted services, but this is not something that I do as the Docker Swarm stack is not restarted and I see the services having an uptime of larger than one day (while I have received these error messages)

I am running in a single node Docker Swarm with a docker-compose identical to the one described in docker-zulip repo.

For validation, here is it, with the secrets redacted:

version: '3.3'
services:
  database:
    image: zulip/zulip-postgresql:10
    environment:
      POSTGRES_DB: zulip
      POSTGRES_PASSWORD: xxx
      POSTGRES_USER: zulip
    volumes:
     - zulip_psql_data:/var/lib/postgresql/data
    networks:
     - default
    logging:
      driver: json-file
  memcached:
    image: memcached:alpine
    networks:
     - default
    logging:
      driver: json-file
    command:
      - 'sh'
      - '-euc'
      - |
        echo 'mech_list: plain' > "$$SASL_CONF_PATH"
        echo "zulip@$$HOSTNAME:$$MEMCACHED_PASSWORD" > "$$MEMCACHED_SASL_PWDB"
        exec memcached -S
    environment:
      SASL_CONF_PATH: '/home/memcache/memcached.conf'
      MEMCACHED_SASL_PWDB: '/home/memcache/memcached-sasl-db'
      MEMCACHED_PASSWORD: 'xxx'
  rabbitmq:
    image: rabbitmq:3.7.7
    environment:
      RABBITMQ_DEFAULT_PASS: xxx
      RABBITMQ_DEFAULT_USER: zulip
    volumes:
     - zulip_rabbitmq_data:/var/lib/rabbitmq
    networks:
     - default
    logging:
      driver: json-file
  redis:
    image: redis:alpine
    volumes:
     - zulip_redis_data:/data:rw
    networks:
     - default
    logging:
      driver: json-file
    command:
      - 'sh'
      - '-euc'
      - |
        echo "requirepass '$$REDIS_PASSWORD'" > /etc/redis.conf
        exec redis-server /etc/redis.conf
    environment:
      REDIS_PASSWORD: 'xxx'
  zulip:
    image: zulip/docker-zulip:2.1.3-0
    ports:
      - 80
    environment:
      DB_HOST: database
      DB_HOST_PORT: '5432'
      DB_USER: zulip
      DISABLE_HTTPS: 'True'
      SECRETS_email_password: xxx
      SECRETS_google_oauth2_client_secret: xxx
      SECRETS_postgres_password: xxx
      SECRETS_rabbitmq_password: xxx
      SECRETS_memcached_password: 'xxx'
      SECRETS_redis_password: 'xxx'
      SECRETS_secret_key: xxx
      SECRETS_social_auth_github_secret: xxx
      SETTING_EMAIL_HOST: smtp.gmail.com
      SETTING_EMAIL_HOST_USER: xxx
      SETTING_EMAIL_PORT: '587'
      SETTING_EMAIL_USE_SSL: 'False'
      SETTING_EMAIL_USE_TLS: 'True'
      SETTING_EXTERNAL_HOST: xxx.xxx.xxx
      SETTING_GOOGLE_OAUTH2_CLIENT_ID: xxxxx
      SETTING_MEMCACHED_LOCATION: memcached:11211
      SETTING_PUSH_NOTIFICATION_BOUNCER_URL: https://push.zulipchat.com
      SETTING_RABBITMQ_HOST: rabbitmq
      SETTING_REDIS_HOST: redis
      SETTING_SOCIAL_AUTH_GITHUB_KEY: xxx
      SETTING_ZULIP_ADMINISTRATOR: xxx
      SSL_CERTIFICATE_GENERATION: self-signed
      ZULIP_AUTH_BACKENDS: EmailAuthBackend,GoogleMobileOauth2Backend,GitHubAuthBackend
    volumes:
     - zulip_app_data:/data
    networks:
     - traefik-public
     - default
    logging:
      driver: json-file
    deploy:
      labels:
        traefik.docker.network: traefik-public
        traefik.enable: 'true'
        traefik.http.routers.zulip.entrypoints: websecure
        traefik.http.routers.zulip.rule: Host(`xxx.xxx.xxx`)
        traefik.http.routers.zulip.tls.certresolver: letsencryptresolver
        traefik.http.services.zulip.loadbalancer.server.port: '80'
networks:
  default:
    driver: overlay
  traefik-public:
    external: true
volumes:
  zulip_app_data:
    external: true
  zulip_psql_data:
    external: true
  zulip_rabbitmq_data:
    external: true
  zulip_redis_data:
    external: true

Some hopefully useful info:

docker version
Client: Docker Engine - Community
 Version:           19.03.8
 API version:       1.40
 Go version:        go1.12.17
 Git commit:        afacb8b7f0
 Built:             Wed Mar 11 01:25:46 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.8
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.17
  Git commit:       afacb8b7f0
  Built:            Wed Mar 11 01:24:19 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.13
  GitCommit:        7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

uname -a
Linux xxx.com 4.15.0-91-generic zulip/zulip#92-Ubuntu SMP Fri Feb 28 11:09:48 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

This is a “sister” issue as zulip/zulip#14456 that I also opened (but with another error message).

Anything else I can provide the help solve this…?

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Comments: 16 (9 by maintainers)

Most upvoted comments

It looks like the remaining problem was a Docker Swarm default configuration problem; a potential solution is suggested here:

https://github.com/vapor/postgres-kit/issues/164#issuecomment-738450518

I’ll transfer this to the docker-zulip repository.

OK, my theory is that the Docker Swarm networking stack is killing the open TCP connections between the Zulip server and the postgres/memcached servers. We had a much more fatal similar with RabbitMQ fixed last year (b312001fd92dc36233e5a9f57cd9fada890880c4). The symptom is the same as the service being restarted – the connections are killed, which Zulip will re-establish in each process when it discovers this (and send an error email), resulting in this random distribution of error emails.

Googling suggests that other products have indeed had that sort of problem with Docker Swarm’s aggressive killing of TCP connections. https://success.docker.com/article/ipvs-connection-timeout-issue seems to be their knowledge base article on the topic.

@stratosgear can you try playing with the diagnostic steps described on that article to see if they suggest this is what’s happening? Based on that doc, it looks like Docker Swarm itself doesn’t support configuring its networking behavior of killing idle TCP connections 😦.

For memcached, https://github.com/lericson/pylibmc/issues/199, https://sendapatch.se/projects/pylibmc/behaviors.html, and https://pypi.org/project/pylibmc/1.3.0/ suggest they have an undocumented option to set the keepalive settings.