postgres-operator: Database server fails to start

I’m trying to deploy Postgres Operator in a DigitalOcean cluster, but it keeps failing due to (as far as I can tell) Postgres not starting. The operator fails to create the cluster since a master is never elected (i.e. no pod gets the label spilo-role=master), and from checking the logs of the pod in question (there are two pods and the other one gets labeled as a replica), the reason appears to be that there are constant failures in trying to connect to Postgres:

2019-01-01 21:45:52,393 INFO: establishing a new patroni connection to the postgres cluster
/var/run/postgresql:5432 - accepting connections
2019-01-01 21:45:52,697 INFO: Could not take out TTL lock
2019-01-01 21:45:52,885 INFO: following new leader after trying and failing to obtain lock
2019-01-01 21:45:59,681 ERROR: get_postgresql_status
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/patroni/api.py", line 488, in query
    cursor.execute(sql, params)
psycopg2.OperationalError: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/patroni/api.py", line 437, in get_postgresql_status
    self.server.patroni.postgresql.lsn_name), retry=retry)[0]
  File "/usr/local/lib/python3.6/dist-packages/patroni/api.py", line 412, in query
    return self.server.query(sql, *params)
  File "/usr/local/lib/python3.6/dist-packages/patroni/api.py", line 493, in query
    raise PostgresConnectionException('connection problems')
patroni.exceptions.PostgresConnectionException: 'connection problems'

Can someone please help me understand why my cluster won’t start? You can see my Kubernetes manifests here. You might notice I use a personal version of the operator itself, but it should behave mostly the same as the original, with some tweaks for added robustness.

Please let me know if you need more info.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 15 (15 by maintainers)

Most upvoted comments

Happy to help, we will prep a PR updating the example manifests.

FYI: We believe you run into a problem that existed in Postgres for a few versions/commits but has be resolved by now. Thus the new Spilo image with a new Postgres version prevents Postgres from running into this error state and crash (you see all the segfaults and immediate shutdowns in your log).

Steps to try: Update your operator config to new image and recreate operator. It will try to do the rolling upgrade and this might then fix the problem. If not, delete manifest and start from scratch as I assume the cluster is still empty.

While this seems scary at first it would be resolvable as it normally does not affect the old master node or is revertable by moving the docker image forward or backwards to a working version or disabling bg_mon temporarily. (bg_mon is not the culprit, it just triggered the error).

I use Spilo image spilo-cdp-10:1.4-p29

Try to use latest version (spilo-cdp-11:1.5-p42) - if possible, delete the current cluster and recreate with the new spilo. To test without bg_mon, you can override shared_preload_libraries in Postgres parameters section of a manifest.

@erthalion I copied logs from pgdata/pgroot/pg_logs/postgresql-3.csv in the master pod: gist. Hope it helps.