postgres-operator: pg_wal filling up and archive files not being deleted and backups failing

pgo 4.5.1

cluster info as follows:

$ pgoc pgo show cluster retroelk-prod1-azure -n pgo

cluster : retroelk-prod1-azure (crunchy-postgres-gis-ha:centos7-11.9-2.5-4.5.0)
	pod : retroelk-prod1-azure-9dd6fd444-phhw6 (Running) on prod1-db-1 (2/2) (primary)
		pvc: retroelk-prod1-azure (100Gi)
	pod : retroelk-prod1-azure-cmyc-55fd469bdc-p6z7g (Running) on prod1-db-0 (2/2) (replica)
		pvc: retroelk-prod1-azure-cmyc (100G)
	pod : retroelk-prod1-azure-tapu-67bc49c74b-v4n28 (Running) on prod1-db-2 (2/2) (replica)
		pvc: retroelk-prod1-azure-tapu (100G)
	resources : CPU: 8 Memory: 16Gi
	limits : CPU: 15 Memory: 48Gi
	deployment : retroelk-prod1-azure
	deployment : retroelk-prod1-azure-backrest-shared-repo
	deployment : retroelk-prod1-azure-cmyc
	deployment : retroelk-prod1-azure-pgbouncer
	deployment : retroelk-prod1-azure-tapu
	service : retroelk-prod1-azure - ClusterIP (10.105.129.54)
	service : retroelk-prod1-azure-pgbouncer - ClusterIP (10.101.227.65)
	service : retroelk-prod1-azure-replica - ClusterIP (10.98.138.154)
	pgreplica : retroelk-prod1-azure-cmyc
	pgreplica : retroelk-prod1-azure-tapu
	labels : pgouser=admin service-type=NodePort workflowid=4098ec34-1913-48e8-86a6-b46690c41030 name=retroelk-prod1-azure pg-pod-anti-affinity= pgo-backrest=true crunchy-postgres-exporter=true custom-config=retroelk-custom-config pgo-version=4.5.0 NodeLabelValue=postgres autofail=true crunchy-pgbadger=false crunchy-pgha-scope=retroelk-prod1-azure deployment-name=retroelk-prod1-azure pg-cluster=retroelk-prod1-azure

Archives got backed up I believe because backrest repo disk was full. But now have cleared space on the backrest repo disk (see issue #2111) but archives are still not being cleared from the primary. When I try to run a backup both to s3 and locally I get failures. Trying to run it with debug turned on so will post logs here as soon as I get them again as it takes it a while to fail.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 15 (4 by maintainers)

Most upvoted comments

@jkatz thanks much for that info. I guess I thought of the operator as including all of the processes that it packages but understand now that you think of this as a pgbackrest specific issue. Re my issue I think I have figured out what is going on and have some helpful recommendations for anyone else hitting this issue:

The archives are actually pushed by the pg primary not pulled by the backrest as I mistakenly thought before (it really helps to have a global understanding of how everything works together so bear with me and other non dbas as we figure this out 😃). I don’t understand why I am not seeing anything in the pg log (is this a bug?) but figured out the archive process is failing by exec’ing into the primary and then listing the processes and grepping for archive as follows:

bash-4.2$ ps aux | grep archive
postgres  4020  0.0  0.0 157804  5828 ?        Ss   Nov29   2:03 postgres: retroelk-prod1-azure: archiver   failed on 00000010000016A00000008B
postgres 29691  0.0  0.0  12548  2168 pts/13   S+   16:30   0:00 grep archive

Looking through the pg_wal directory on my pg 11 server I don’t have that WAL file but I see a reference to it and a lot of other files that aren’t in the pg_wal in the /pgdata/retroelk-prod1-azure/pg_wal/archive_status directory. All these files end with .ready e.g.

0000000F.history.done		
00000010000016A00000008B.ready
00000010000016A00000008C.ready
00000010000016A00000008D.ready
etc...

So I am assuming that the pg wal archiver looks in this directory for wals that are ready to archive (again understanding how things work really helps).

So archiver on primary is failing because it can’t find WALs that I deleted with pg_archivecleanup because my primary disk was filling up because the disk was full on my backrestrepo. So now seeing how this chain reaction played out… backrest disk filled up. Primary started filling up because archiver failed to write WALs to archive on backrestrepo. I cleaned up WALs on the primary because it was also now running out of disk space using a tool specifically designed for that i.e. pg_archivecleanup but I guess pg_archivecleanup does not cleanup the archive_status files and so the archiver is now trying to archive files that are missing.
So next I am planning to delete all the missing files in the /pgdata/retroelk-prod1-azure/pg_wal/archive_status. I’ll update here with if that works or causes the next disaster 😃

alrooney on Dec 11, 2020