spilo: Postmaster not starting
Hello,
we’re trying to deploy Harbor on a Kubernetes Cluster using Zalando PostgresOperator. We deployed the PostgresOperator and it seems to be running fine and detects when a new cluster is to be managed.
When we deploy a Postgresql HA with spilo (image -> spilo-14:2.1-p3) and the following config:
apiVersion: acid.zalan.do/v1
kind: postgresql
metadata:
creationTimestamp: "2022-04-11T08:39:32Z"
generation: 1
name: harbor-postgresql
namespace: vanillastack-harbor
resourceVersion: "197604124"
uid: <redacted>
spec:
databases:
harbor: harbor
notary_server: harbor
notary_signer: harbor
registry: harbor
enableLogicalBackup: true
logicalBackupSchedule: 30 */12 * * *
numberOfInstances: 2
postgresql:
parameters:
max_connections: "400"
version: "14"
resources:
limits:
cpu: 750m
memory: 1.5Gi
requests:
cpu: 100m
memory: 1Gi
teamId: harbor
users:
harbor: []
postgres:
- superuser
- createdb
volume:
size: 20Gi
it shows both pods are running but when looking at the logs we can see that it fails to bootstrap the cluster like here:
2022-04-11 08:42:52,811 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
2022-04-11 08:42:52,822 INFO: No PostgreSQL configuration items changed, nothing to reload.
2022-04-11 08:42:52,824 INFO: Lock owner: None; I am harbor-postgresql-0
2022-04-11 08:42:52,842 INFO: trying to bootstrap a new cluster
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.
The database cluster will be initialized with locale "en_US.utf-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".
Data page checksums are disabled.
creating directory /home/postgres/pgdata/pgroot/data ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... Etc/UTC
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok
Success. You can now start the database server using:
/usr/lib/postgresql/14/bin/pg_ctl -D /home/postgres/pgdata/pgroot/data -l logfile start
2022-04-11 08:42:53,887 INFO: postmaster pid=222
2022-04-11 08:42:53 UTC [222]: [1-1] 6253ea0d.de 0 LOG: Auto detecting pg_stat_kcache.linux_hz parameter...
2022-04-11 08:42:53 UTC [222]: [2-1] 6253ea0d.de 0 LOG: pg_stat_kcache.linux_hz is set to 1000000
/var/run/postgresql:5432 - no response
/var/run/postgresql:5432 - no response
2022-04-11 08:42:55,973 ERROR: postmaster is not running
2022-04-11 08:42:55,978 INFO: removing initialize key after failed attempt to bootstrap the cluster
2022-04-11 08:42:55,985 INFO: renaming data directory to /home/postgres/pgdata/pgroot/data_2022-04-11-08-42-55
Traceback (most recent call last):
File "/usr/local/bin/patroni", line 11, in <module>
sys.exit(main())
File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 171, in main
return patroni_main()
File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 139, in patroni_main
abstract_main(Patroni, schema)
File "/usr/local/lib/python3.6/dist-packages/patroni/daemon.py", line 100, in abstract_main
controller.run()
File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 109, in run
super(Patroni, self).run()
File "/usr/local/lib/python3.6/dist-packages/patroni/daemon.py", line 59, in run
self._run_cycle()
File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 112, in _run_cycle
logger.info(self.ha.run_cycle())
File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1471, in run_cycle
info = self._run_cycle()
File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1345, in _run_cycle
return self.post_bootstrap()
File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1238, in post_bootstrap
self.cancel_initialization()
File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1231, in cancel_initialization
raise PatroniFatalException('Failed to bootstrap cluster')
patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'
/run/service/patroni: finished with code=1 signal=0
/run/service/patroni: sleeping 120 seconds
The logs say that postmaster is not running so when trying to get the pid of postmaster it indeed is missing
PID TTY STAT TIME COMMAND
1 ? Ss 0:00 /usr/bin/dumb-init -c --rewrite 1:0 -- /bin/sh /launch.s
7 ? S 0:00 /bin/sh /launch.sh
31 ? S 0:00 /usr/bin/runsvdir -P /etc/service
32 ? Ss 0:00 runsv pgqd
33 ? Ss 0:00 runsv patroni
35 ? S 0:00 /bin/bash /scripts/patroni_wait.sh --role master -- /usr
288 pts/0 Ss 0:00 bash
312 ? S 0:00 sleep 60
317 pts/0 R+ 0:00 ps -ax
Do you have any idea why this might be happening? We’re running on Kubernetes 1.23.5 I also must note we have a very similar different cluster setup where everything works fine, we can’t figure out why it doesn’t deploy on this cluster though.
If you need more information please let me know.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 20 (5 by maintainers)
It seems that people slowly start switching to cgroup v2. I’ll prepare the fix.
@haslersn there is a fix - https://github.com/zalando/spilo/releases/tag/2.1-p6 “Compatibility with cgroup v2 when figuring out memory limit and auto-calculating shared_buffers size.”
I ran into this and what helped me solve the issue is to adjust the log level to see “DEBUG” logs. Turned out my problem was the command to start postgres attempted to use too much share_buffers memory: https://dba.stackexchange.com/questions/184951/memory-errors-on-startup-in-postgresql-9-6-log-map-hugetlb-failed