spilo: Postmaster not starting

Hello,

we’re trying to deploy Harbor on a Kubernetes Cluster using Zalando PostgresOperator. We deployed the PostgresOperator and it seems to be running fine and detects when a new cluster is to be managed.

When we deploy a Postgresql HA with spilo (image -> spilo-14:2.1-p3) and the following config:

apiVersion: acid.zalan.do/v1
kind: postgresql
metadata:
  creationTimestamp: "2022-04-11T08:39:32Z"
  generation: 1
  name: harbor-postgresql
  namespace: vanillastack-harbor
  resourceVersion: "197604124"
  uid: <redacted>
spec:
  databases:
    harbor: harbor
    notary_server: harbor
    notary_signer: harbor
    registry: harbor
  enableLogicalBackup: true
  logicalBackupSchedule: 30 */12 * * *
  numberOfInstances: 2
  postgresql:
    parameters:
      max_connections: "400"
    version: "14"
  resources:
    limits:
      cpu: 750m
      memory: 1.5Gi
    requests:
      cpu: 100m
      memory: 1Gi
  teamId: harbor
  users:
    harbor: []
    postgres:
    - superuser
    - createdb
  volume:
    size: 20Gi

it shows both pods are running but when looking at the logs we can see that it fails to bootstrap the cluster like here:

2022-04-11 08:42:52,811 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
2022-04-11 08:42:52,822 INFO: No PostgreSQL configuration items changed, nothing to reload.
2022-04-11 08:42:52,824 INFO: Lock owner: None; I am harbor-postgresql-0
2022-04-11 08:42:52,842 INFO: trying to bootstrap a new cluster
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

creating directory /home/postgres/pgdata/pgroot/data ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... Etc/UTC
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok

Success. You can now start the database server using:

    /usr/lib/postgresql/14/bin/pg_ctl -D /home/postgres/pgdata/pgroot/data -l logfile start

2022-04-11 08:42:53,887 INFO: postmaster pid=222
2022-04-11 08:42:53 UTC [222]: [1-1] 6253ea0d.de 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2022-04-11 08:42:53 UTC [222]: [2-1] 6253ea0d.de 0     LOG:  pg_stat_kcache.linux_hz is set to 1000000
/var/run/postgresql:5432 - no response
/var/run/postgresql:5432 - no response
2022-04-11 08:42:55,973 ERROR: postmaster is not running
2022-04-11 08:42:55,978 INFO: removing initialize key after failed attempt to bootstrap the cluster
2022-04-11 08:42:55,985 INFO: renaming data directory to /home/postgres/pgdata/pgroot/data_2022-04-11-08-42-55
Traceback (most recent call last):
  File "/usr/local/bin/patroni", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 171, in main
    return patroni_main()
  File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 139, in patroni_main
    abstract_main(Patroni, schema)
  File "/usr/local/lib/python3.6/dist-packages/patroni/daemon.py", line 100, in abstract_main
    controller.run()
  File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 109, in run
    super(Patroni, self).run()
  File "/usr/local/lib/python3.6/dist-packages/patroni/daemon.py", line 59, in run
    self._run_cycle()
  File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 112, in _run_cycle
    logger.info(self.ha.run_cycle())
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1471, in run_cycle
    info = self._run_cycle()
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1345, in _run_cycle
    return self.post_bootstrap()
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1238, in post_bootstrap
    self.cancel_initialization()
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1231, in cancel_initialization
    raise PatroniFatalException('Failed to bootstrap cluster')
patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'
/run/service/patroni: finished with code=1 signal=0
/run/service/patroni: sleeping 120 seconds

The logs say that postmaster is not running so when trying to get the pid of postmaster it indeed is missing

    PID TTY      STAT   TIME COMMAND
      1 ?        Ss     0:00 /usr/bin/dumb-init -c --rewrite 1:0 -- /bin/sh /launch.s
      7 ?        S      0:00 /bin/sh /launch.sh
     31 ?        S      0:00 /usr/bin/runsvdir -P /etc/service
     32 ?        Ss     0:00 runsv pgqd
     33 ?        Ss     0:00 runsv patroni
     35 ?        S      0:00 /bin/bash /scripts/patroni_wait.sh --role master -- /usr
    288 pts/0    Ss     0:00 bash
    312 ?        S      0:00 sleep 60
    317 pts/0    R+     0:00 ps -ax

Do you have any idea why this might be happening? We’re running on Kubernetes 1.23.5 I also must note we have a very similar different cluster setup where everything works fine, we can’t figure out why it doesn’t deploy on this cluster though.

If you need more information please let me know.

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 20 (5 by maintainers)

Most upvoted comments

It seems that people slowly start switching to cgroup v2. I’ll prepare the fix.

CyberDem0n on Jun 9, 2022

@haslersn there is a fix - https://github.com/zalando/spilo/releases/tag/2.1-p6 “Compatibility with cgroup v2 when figuring out memory limit and auto-calculating shared_buffers size.”

FactorT on Jul 11, 2022

I ran into this and what helped me solve the issue is to adjust the log level to see “DEBUG” logs. Turned out my problem was the command to start postgres attempted to use too much share_buffers memory: https://dba.stackexchange.com/questions/184951/memory-errors-on-startup-in-postgresql-9-6-log-map-hugetlb-failed

LiuShuaiyi on Apr 22, 2022