postgres-operator: Create Stanza Unable to Find Primary Cluster in Openshift 3.11

Describe the bug When creating a postgres cluster from scratch using 4.6.0, the stanza-create job never completes successfully. It errors our 5 pods before no longer retrying.

To Reproduce Create a cluster using the pgo-client on Openshift 3.11 with no run as root and fsgroup disabled in pgo yaml.

Expected behavior stanza create job completes

Screenshots

kubectl logs -f test-cluster5-stanza-create-brdtq 
Mon Feb 15 22:51:51 UTC 2021 INFO: Image mode found: pgbackrest
Mon Feb 15 22:51:51 UTC 2021 INFO: Starting in 'pgbackrest' mode
time="2021-02-15T22:51:51Z" level=info msg="crunchy-pgbackrest starts"
time="2021-02-15T22:51:51Z" level=info msg="debug flag set to %tfalse"
time="2021-02-15T22:51:51Z" level=info msg="backrest stanza-create command requested"
time="2021-02-15T22:51:51Z" level=info msg="command to execute is [pgbackrest stanza-create  --db-host=10.131.5.235 --db-path=/pgdata/test-cluster5]"
time="2021-02-15T22:51:52Z" level=info msg="output=[]"
time="2021-02-15T22:51:52Z" level=info msg="stderr=[WARN: unable to check pg-1: [UnknownError] remote-0 process on '10.131.5.235' terminated unexpectedly [255]\nERROR: [056]: unable to find primary cluster - cannot proceed\n]"
time="2021-02-15T22:51:52Z" level=fatal msg="command terminated with exit code 56"

On backrest-shared-repo /tmp:

bash-4.4$ cat db-stanza-create.log 
-------------------PROCESS START-------------------
2021-02-15 22:51:46.952 P00   INFO: stanza-create command begin 2.31: --exec-id=33-dd4cc144 --log-path=/tmp --pg1-host=10.131.5.235 --pg1-path=/pgdata/test-cluster5 --pg1-port=5432 --pg1-socket-path=/tmp --repo1-path=/backrestrepo/test-cluster5-backrest-shared-repo --stanza=db
2021-02-15 22:51:47.057 P00   WARN: unable to check pg-1: [UnknownError] remote-0 process on '10.131.5.235' terminated unexpectedly [255]
2021-02-15 22:51:47.057 P00  ERROR: [056]: unable to find primary cluster - cannot proceed
2021-02-15 22:51:47.057 P00   INFO: stanza-create command end: aborted with exception [056]

In the postgres logs:

ERROR: [125]: remote-0 process on 'postgres-backrest-shared-repo' terminated unexpectedly [255]
[2021-02-15 22:31:18.132 UTC   272   00000  602ae6df.110 449  0] LOG:  archive command failed with exit code 125
[2021-02-15 22:31:18.132 UTC   272   00000  602ae6df.110 450  0] DETAIL:  The failed archive command was: source /opt/crunchy/bin/postgres-ha/pgbackrest/pgbackrest-set-env.sh && pgbackrest archive-push "pg_wal/000000010000000000000001"
ERROR: [125]: remote-0 process on 'postgres-backrest-shared-repo' terminated unexpectedly [255]
[2021-02-15 22:31:19.263 UTC   272   00000  602ae6df.110 451  0] LOG:  archive command failed with exit code 125
[2021-02-15 22:31:19.263 UTC   272   00000  602ae6df.110 452  0] DETAIL:  The failed archive command was: source /opt/crunchy/bin/postgres-ha/pgbackrest/pgbackrest-set-env.sh && pgbackrest archive-push "pg_wal/000000010000000000000001"
ERROR: [125]: remote-0 process on 'postgres-backrest-shared-repo' terminated unexpectedly [255]
[2021-02-15 22:31:20.400 UTC   272   00000  602ae6df.110 453  0] LOG:  archive command failed with exit code 125
[2021-02-15 22:31:20.400 UTC   272   00000  602ae6df.110 454  0] DETAIL:  The failed archive command was: source /opt/crunchy/bin/postgres-ha/pgbackrest/pgbackrest-set-env.sh && pgbackrest archive-push "pg_wal/000000010000000000000001"
[2021-02-15 22:31:20.400 UTC   272   01000  602ae6df.110 455  0] WARNING:  archiving write-ahead log file "000000010000000000000001" failed too many times, will try again later

Please tell us about your environment:

  • Operating System: centos8
  • Where is this running ( Local, Cloud Provider): local
  • Storage being used (NFS, Hostpath, Gluster, etc): NFS
  • Container Image Tag: centos8-4.6.0 (or centos8-12.5-4.6.0)
  • PostgreSQL Version: 12.5
  • Platform (Docker, Kubernetes, OpenShift): OpenShift
  • Platform Version: 4.6.0

Additional context It sort of looks like a possible bug with the anyuuid in OpenShift…but I’m shooting in the dark a bit. I’ve disabled fsgroups in the pgo-yaml but I did see in the logs of the shared-backrest-repo:

Starting the pgBackRest repo
creating  /backrestrepo/test-cluster5-backrest-shared-repo
/usr/local/bin/pgbackrest-repo.sh: line 50: /etc/pgbackrest/pgbackrest.conf: Permission denied
The pgBackRest repo has been started

I tried to follow this for a while and it does look like there might be a bug here. Here’s what I tried on the shared-repo:

bash-4.4$ id
uid=1001680000(pgbackrest) gid=0(root) groups=0(root),1001680000
bash-4.4$ ls -larth /etc/pgbackrest
total 0
drwxrwsrwt. 3 root 1001680000 80 Feb 15 22:37 conf.d
drwxr-xr-x. 1 root root       51 Feb 15 22:37 ..
drwxrwxr-x. 1 2000 postgres   20 Feb 15 22:37 .
bash-4.4$ cat /etc/passwd
...
pgbackrest:x:1001680000:0:pgbackrest user:/:/bin/bash
bash-4.4$ cat /etc/group
...
pgbackrest:x:0:pgbackrest

Then I took a look at what /usr/local/bin/pgbackrest-repo.sh was doing and tried executing it:

bash-4.4$ printf "[%s]\npg1-path=/tmp/pg1path\n" "$PGBACKREST_STANZA" >> /etc/pgbackrest/pgbackrest.conf
bash: /etc/pgbackrest.conf: Permission denied

It looks like the folder is only writable by postgres group which the user isn’t a part of or id 2000 but because of the anyuuid I’m not that user. I ended up trying a custom Dockerfile where I just gave that folder 777 and it did actually write to it, but I still had the problem stated above. Hoping you have some other thoughts on logs to look at. I read a bit on other issues regarding authentication, tls or firewall, but after hitting these issues I stripped it to a bare-bones deploy off of the client so it doesn’t appear to be that at first glance.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 16 (6 by maintainers)

Commits related to this issue

Most upvoted comments

As it’s late here, I’ll provide a brief explanation for now: the change between 4.6.0 + 4.5.1 had to do with the directory ownership of the /etc/pgbackrest directory specifically in the pgBackRest repository container, which was alluded to in the original report. The ownership change caused the reported failure when using anyuid mode in OCP 3.11.

The reason why this manifested was due to the change in how the pgBackRest repo container was constructed, i.e. it was using a new container as its base.

The reason my suspicion was the storage layer at first is that when similar things are reported, that tends to be the case. I missed the /etc/pgbackrest detail which was the clue for this one. The incorrect ownership under anyuid mode would cause the configuration file to not properly generate, which has the cascading failure effect that you noticed.

The good news: this came in just under the radar for the 4.6.1 cut off (I was literally about to stamp, and decided to take one more look at this) so this will be included in the upcoming patch release.

One of my colleagues may also chime in on a workaround until 4.6.1 is out.

Anyway, thanks for reporting!