postgres-operator: pg_wal filling up and archive files not being deleted and backups failing
pgo 4.5.1
cluster info as follows:
$ pgoc pgo show cluster retroelk-prod1-azure -n pgo
cluster : retroelk-prod1-azure (crunchy-postgres-gis-ha:centos7-11.9-2.5-4.5.0)
pod : retroelk-prod1-azure-9dd6fd444-phhw6 (Running) on prod1-db-1 (2/2) (primary)
pvc: retroelk-prod1-azure (100Gi)
pod : retroelk-prod1-azure-cmyc-55fd469bdc-p6z7g (Running) on prod1-db-0 (2/2) (replica)
pvc: retroelk-prod1-azure-cmyc (100G)
pod : retroelk-prod1-azure-tapu-67bc49c74b-v4n28 (Running) on prod1-db-2 (2/2) (replica)
pvc: retroelk-prod1-azure-tapu (100G)
resources : CPU: 8 Memory: 16Gi
limits : CPU: 15 Memory: 48Gi
deployment : retroelk-prod1-azure
deployment : retroelk-prod1-azure-backrest-shared-repo
deployment : retroelk-prod1-azure-cmyc
deployment : retroelk-prod1-azure-pgbouncer
deployment : retroelk-prod1-azure-tapu
service : retroelk-prod1-azure - ClusterIP (10.105.129.54)
service : retroelk-prod1-azure-pgbouncer - ClusterIP (10.101.227.65)
service : retroelk-prod1-azure-replica - ClusterIP (10.98.138.154)
pgreplica : retroelk-prod1-azure-cmyc
pgreplica : retroelk-prod1-azure-tapu
labels : pgouser=admin service-type=NodePort workflowid=4098ec34-1913-48e8-86a6-b46690c41030 name=retroelk-prod1-azure pg-pod-anti-affinity= pgo-backrest=true crunchy-postgres-exporter=true custom-config=retroelk-custom-config pgo-version=4.5.0 NodeLabelValue=postgres autofail=true crunchy-pgbadger=false crunchy-pgha-scope=retroelk-prod1-azure deployment-name=retroelk-prod1-azure pg-cluster=retroelk-prod1-azure
Archives got backed up I believe because backrest repo disk was full. But now have cleared space on the backrest repo disk (see issue #2111) but archives are still not being cleared from the primary. When I try to run a backup both to s3 and locally I get failures. Trying to run it with debug turned on so will post logs here as soon as I get them again as it takes it a while to fail.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 15 (4 by maintainers)
@jkatz thanks much for that info. I guess I thought of the operator as including all of the processes that it packages but understand now that you think of this as a pgbackrest specific issue. Re my issue I think I have figured out what is going on and have some helpful recommendations for anyone else hitting this issue:
So I am assuming that the pg wal archiver looks in this directory for wals that are ready to archive (again understanding how things work really helps).
So archiver on primary is failing because it can’t find WALs that I deleted with pg_archivecleanup because my primary disk was filling up because the disk was full on my backrestrepo. So now seeing how this chain reaction played out… backrest disk filled up. Primary started filling up because archiver failed to write WALs to archive on backrestrepo. I cleaned up WALs on the primary because it was also now running out of disk space using a tool specifically designed for that i.e. pg_archivecleanup but I guess pg_archivecleanup does not cleanup the archive_status files and so the archiver is now trying to archive files that are missing.
So next I am planning to delete all the missing files in the /pgdata/retroelk-prod1-azure/pg_wal/archive_status. I’ll update here with if that works or causes the next disaster 😃