postgres-operator: cloning existing cluster - Failed to bootstrap cluster - Clone failed

Please, answer some short questions which should help us to understand your problem / question better?

Which image of the operator are you using? registry.opensource.zalan.do/acid/postgres-operator:v1.6.0
Where do you run it - cloud or metal? Kubernetes or OpenShift? AWS K8s
Are you running Postgres Operator in production? yes
Type of issue? Bug report I think

I have some WAL backups in S3, all working fine, I can see base backups twice a day created, and all the related wal files.

I tried to setup a cluster cloning an existing one from the backup, by declaring a new Postgres CRD:

apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
  name: mydb-postgresql-clone
spec:
  clone:
    uid: "633b6c86-a4d8-460d-8a54-fbf4128b74c8"
    cluster: "mydb-postgresql"
    timestamp: "2021-02-08T13:05:00+01:00"

  teamId: "mydb"
  volume:
    size: 2Gi
  numberOfInstances: 2
  postgresql:
    version: "13"

Once I create this in the cluster, and the postgres pods are up, I can see from the logs it can’t find the backups

2021-02-10 13:19:34,231 INFO: Running custom bootstrap script: envdir "/run/etc/wal-e.d/env-clone-simplrapi-postgresql" python3 /scripts/clone_with_wale.py --recovery-target-time="2021-02-08T13:05:00+01:00"
2021-02-10 13:19:34,407 INFO: Trying s3://some-s3-name/spilo/mydb-postgresql/633b6c86-a4d8-460d-8a54-fbf4128b74c8/wal/13/ for clone
INFO: 2021/02/10 13:19:34.544441 No backups found
2021-02-10 13:19:34,547 INFO: Trying s3://some-s3-name/spilo/mydb-postgresql/633b6c86-a4d8-460d-8a54-fbf4128b74c8/wal/12/ for clone
INFO: 2021/02/10 13:19:34.678884 No backups found
2021-02-10 13:19:34,683 INFO: Trying s3://some-s3-name/spilo/mydb-postgresql/633b6c86-a4d8-460d-8a54-fbf4128b74c8/wal/11/ for clone
INFO: 2021/02/10 13:19:34.861163 No backups found
2021-02-10 13:19:34,870 INFO: Trying s3://some-s3-name/spilo/simplrapi-postgresql/633b6c86-a4d8-460d-8a54-fbf4128b74c8/wal/10/ for clone
INFO: 2021/02/10 13:19:34.970748 No backups found
2021-02-10 13:19:34,976 INFO: Trying s3://some-s3-name/spilo/mydb-postgresql/633b6c86-a4d8-460d-8a54-fbf4128b74c8/wal/9.6/ for clone
INFO: 2021/02/10 13:19:35.060207 No backups found
2021-02-10 13:19:35,067 INFO: Trying s3://some-s3-name/spilo/mydb-postgresql/633b6c86-a4d8-460d-8a54-fbf4128b74c8/wal/9.5/ for clone
INFO: 2021/02/10 13:19:35.170388 No backups found
2021-02-10 13:19:35,178 INFO: Trying s3://postgres-wal-develop/mydb/simplrapi-postgresql/633b6c86-a4d8-460d-8a54-fbf4128b74c8/wal/ for clone
INFO: 2021/02/10 13:19:35.285631 No backups found
2021-02-10 13:19:35,292 ERROR: Clone failed

And after 5 attemps then:

patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'
/run/service/patroni: finished with code=1 signal=0
/run/service/patroni: exceeded maximum number of restarts 5
stopping /run/service/patroni
timeout: finish: .: (pid 524) 9s, want down

But it should have found backups in the following directory: s3://some-s3-name/spilo/mydb-postgresql/633b6c86-a4d8-460d-8a54-fbf4128b74c8/wal/13/, as inside there is a “wal_005/” folder and a “basebackups_005/” folder containing backups, prior to the given timestamp.

What am I doing wrong?

About this issue

Original URL
State: open
Created 3 years ago
Reactions: 5
Comments: 15 (3 by maintainers)

Most upvoted comments

We also encountered a similar problem when creating a cloned cluster

the documentation says: https://postgres-operator.readthedocs.io/en/latest/user/#how-to-clone-an-existing-postgresql-cluster

If your source cluster uses a WAL location different from the global configuration you can specify the full path under s3_wal_path. For [Google Cloud Platform](https://postgres-operator.readthedocs.io/en/latest/administrator/#google-cloud-platform-setup) or [Azure](https://postgres-operator.readthedocs.io/en/latest/administrator/#azure-setup) it can only be set globally with [custom Pod environment variables](https://postgres-operator.readthedocs.io/en/latest/administrator/#custom-pod-environment-variables) or locally in the Postgres manifest's [env](https://postgres-operator.readthedocs.io/en/latest/administrator/#via-postgres-cluster-manifest) section.

that is, such a section should be quite enough if the backups themselves function:

apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
  name: acid-minimal-cluster-clone
spec:
  clone:
    uid: "efd12e58-5786-11e8-b5a7-06148230260c"
    cluster: "acid-minimal-cluster"
    timestamp: "2017-12-19T12:40:33+01:00"

but practice shows that this is not so, and for the proper functioning of cloning it is also mandatory to pass CLONE_* variables through custom-pod-environment-variables: https://postgres-operator.readthedocs.io/en/latest/administrator/#custom-pod-environment-variables

Based on this, the question arises, do I have something configured wrong in the configuration of the operator or cluster? or all the same, I’m doing everything right, but this fad is not explicitly indicated in the documentation?

p.s. with clone_ variables specified, cloning works flawlessly

p.p.s. if I described something indistinctly, please forgive my English

Grey1406 on Jun 29, 2022