patroni: Recovery error when using custom bootstrap and recovery_target_action: pause

Describe the bug In issue 1891 a change was made that breaks patroni behavior when restoring via custom bootstrap.

Expected behavior In postgres-Mon.log

recovery stopping before commit of transaction 1858, time 2023-02-22 12:51:26.046508+03

This is the point in time I need to recover. I go to the database and see if the necessary data has been restored, if not, stop patroni, change recovery_target_time and start patroni. And I can repeat this several times, moving the restore point forward without having to restore the entire cluster all over again. After reaching the desired timestamp, I simply execute the psql command SELECT pg_xlog_replay_resume() and postgres exits recovery mode and becomes the primary node.

If I rename data.failed -> data and start the patroni service, then postgres starts up, but immediately does promote, which I don’t need, because Otherwise, I can not do the recovery procedure described above.

To Reproduce Environment

  • Patroni version: 2.0.2+
  • PostgreSQL version: 9.6+
  • DCS (and its version): etcd 3.5.1
  • postgres backup util: wal-g 2.0.1

Patroni configuration file

bootstrap:
  method: wal-g
  wal-g:
    command: /usr/local/bin/wal_g_bootstrap.sh
    keep_existing_recovery_conf: False
    recovery_conf:
      recovery_target_time: '2023-02-22 12:51:26'
      recovery_target_action: pause
      recovery_target_timeline: 4
      restore_command: '/usr/local/bin/wal-g wal-fetch %f %p'
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    master_start_timeout: 30
    synchronous_mode: false
    synchronous_mode_strict: false
    postgresql:
      use_pg_rewind: true
      parameters:
        archive_mode: 'on'
        archive_command: '/usr/local/bin/wal-g wal-push %p'

bootstrap script wal_g_bootstrap.sh

#!/usr/bin/env bash

while getopts ":-:" optchar; do
  [[ "${optchar}" == "-" ]] || continue
  case "${OPTARG}" in
    datadir=* )
        DATA_DIR=${OPTARG#*=}
        ;;
    scope=* )
        SCOPE=${OPTARG#*=}
        ;;
  esac
done
/usr/local/bin/wal-g backup-fetch $DATA_DIR base_000000040000000000000017

patroni.log

2023-02-27 09:59:34,412 INFO: Selected new etcd server http://127.0.0.1:2379
2023-02-27 09:59:34,430 INFO: No PostgreSQL configuration items changed, nothing to reload.
2023-02-27 09:59:34,484 INFO: Lock owner: None; I am cluster_name
2023-02-27 09:59:34,593 INFO: trying to bootstrap a new cluster
2023-02-27 09:59:34,593 INFO: Running custom bootstrap script: /usr/local/bin/wal_g_bootstrap.sh
2023-02-27 09:59:44,434 INFO: Lock owner: None; I am cluster_name
2023-02-27 09:59:44,434 INFO: not healthy enough for leader race
2023-02-27 09:59:44,735 INFO: bootstrap in progress
2023-02-27 09:59:51,433 INFO: postmaster pid=29043
2023-02-27 09:59:54,434 INFO: Lock owner: None; I am cluster_name
2023-02-27 09:59:54,434 INFO: not healthy enough for leader race
2023-02-27 09:59:54,714 INFO: bootstrap in progress
2023-02-27 09:59:56,569 INFO: removing initialize key after failed attempt to bootstrap the cluster
2023-02-27 09:59:56,581 INFO: renaming data directory to /var/lib/pgsql/9.6/data.failed

syslog

INFO: 2023/02/27 09:59:34.732530 Selecting the backup with name base_000000040000000000000017...
INFO: 2023/02/27 09:59:47.686658 Finished extraction of part_002.tar.br
INFO: 2023/02/27 09:59:47.829682 Finished extraction of part_004.tar.br
INFO: 2023/02/27 09:59:50.840466 Finished extraction of part_001.tar.br
INFO: 2023/02/27 09:59:50.985037 Finished extraction of pg_control.tar.br
INFO: 2023/02/27 09:59:50.985111 
Backup extraction complete.
2023-02-27 09:59:51 MSK [29043]: [1-1] 63fc54e7.7173 0     LOG:  redirecting log output to logging collector process
2023-02-27 09:59:51 MSK [29043]: [2-1] 63fc54e7.7173 0     HINT:  Future log output will appear in directory "/var/lib/pgsql/9.6/logs".
localhost:5432 - rejecting connections
localhost:5432 - rejecting connections
localhost:5432 - rejecting connections
localhost:5432 - rejecting connections
localhost:5432 - rejecting connections
localhost:5432 - rejecting connections
localhost:5432 - no response
Traceback (most recent call last):
  File "/usr/bin/patroni", line 11, in <module>
    load_entry_point('patroni==3.0.0', 'console_scripts', 'patroni')()
  File "/usr/lib/python3.6/site-packages/patroni/__main__.py", line 144, in main
    return patroni_main()
  File "/usr/lib/python3.6/site-packages/patroni/__main__.py", line 136, in patroni_main
    abstract_main(Patroni, schema)
  File "/usr/lib/python3.6/site-packages/patroni/daemon.py", line 108, in abstract_main
    controller.run()
  File "/usr/lib/python3.6/site-packages/patroni/__main__.py", line 106, in run
    super(Patroni, self).run()
  File "/usr/lib/python3.6/site-packages/patroni/daemon.py", line 65, in run
    self._run_cycle()
  File "/usr/lib/python3.6/site-packages/patroni/__main__.py", line 109, in _run_cycle
    logger.info(self.ha.run_cycle())
  File "/usr/lib/python3.6/site-packages/patroni/ha.py", line 1771, in run_cycle
    info = self._run_cycle()
  File "/usr/lib/python3.6/site-packages/patroni/ha.py", line 1593, in _run_cycle
    return self.post_bootstrap()
  File "/usr/lib/python3.6/site-packages/patroni/ha.py", line 1484, in post_bootstrap
    self.cancel_initialization()
  File "/usr/lib/python3.6/site-packages/patroni/ha.py", line 1477, in cancel_initialization
    raise PatroniFatalException('Failed to bootstrap cluster')
patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'

postgresql-Mon.log

2023-02-27 09:59:51 MSK [29046]: [1-1] 63fc54e7.7176 0     LOG:  database system was interrupted; last known up at 2023-02-22 12:33:23 MSK
2023-02-27 09:59:51 MSK [29046]: [2-1] 63fc54e7.7176 0     LOG:  creating missing WAL directory "pg_xlog/archive_status"
2023-02-27 09:59:51 MSK [29047]: [1-1] 63fc54e7.7177 0 [unknown] [unknown] [unknown] 127.0.0.1 LOG:  connection received: host=127.0.0.1 port=45240
2023-02-27 09:59:51 MSK [29047]: [2-1] 63fc54e7.7177 0 postgres postgres [unknown] 127.0.0.1 FATAL:  the database system is starting up
2023-02-27 09:59:51 MSK [29049]: [1-1] 63fc54e7.7179 0 [unknown] [unknown] [unknown] 127.0.0.1 LOG:  connection received: host=127.0.0.1 port=45242
2023-02-27 09:59:51 MSK [29049]: [2-1] 63fc54e7.7179 0 postgres postgres [unknown] 127.0.0.1 FATAL:  the database system is starting up
2023-02-27 09:59:51 MSK [29046]: [3-1] 63fc54e7.7176 0     LOG:  restored log file "00000004.history" from archive
2023-02-27 09:59:51 MSK [29046]: [4-1] 63fc54e7.7176 0     LOG:  starting point-in-time recovery to 2023-02-22 12:51:26+03
2023-02-27 09:59:51 MSK [29046]: [5-1] 63fc54e7.7176 0     LOG:  restored log file "00000004.history" from archive
2023-02-27 09:59:52 MSK [29046]: [6-1] 63fc54e7.7176 0     LOG:  restored log file "000000040000000000000017" from archive
INFO: 2023/02/27 09:59:52.101640 WAL-prefetch file:  000000040000000000000018
INFO: 2023/02/27 09:59:52.112315 WAL-prefetch file:  000000040000000000000019
2023-02-27 09:59:52 MSK [29046]: [7-1] 63fc54e7.7176 0     LOG:  redo starts at 0/17000028
2023-02-27 09:59:52 MSK [29046]: [8-1] 63fc54e7.7176 0     LOG:  consistent recovery state reached at 0/170000F8
2023-02-27 09:59:52 MSK [29046]: [9-1] 63fc54e7.7176 0     LOG:  restored log file "000000040000000000000018" from archive
INFO: 2023/02/27 09:59:52.434774 WAL-prefetch file:  00000004000000000000001A
2023-02-27 09:59:52 MSK [29105]: [1-1] 63fc54e8.71b1 0 [unknown] [unknown] [unknown] 127.0.0.1 LOG:  connection received: host=127.0.0.1 port=45260
2023-02-27 09:59:52 MSK [29105]: [2-1] 63fc54e8.71b1 0 postgres postgres [unknown] 127.0.0.1 FATAL:  the database system is starting up
2023-02-27 09:59:52 MSK [29046]: [10-1] 63fc54e7.7176 0     LOG:  restored log file "000000040000000000000019" from archive
INFO: 2023/02/27 09:59:52.854834 WAL-prefetch file:  00000004000000000000001B
2023-02-27 09:59:53 MSK [29046]: [11-1] 63fc54e7.7176 0     LOG:  restored log file "00000004000000000000001A" from archive
INFO: 2023/02/27 09:59:53.169198 WAL-prefetch file:  00000004000000000000001C
2023-02-27 09:59:53 MSK [29142]: [1-1] 63fc54e9.71d6 0 [unknown] [unknown] [unknown] 127.0.0.1 LOG:  connection received: host=127.0.0.1 port=45268
2023-02-27 09:59:53 MSK [29142]: [2-1] 63fc54e9.71d6 0 postgres postgres [unknown] 127.0.0.1 FATAL:  the database system is starting up
2023-02-27 09:59:53 MSK [29046]: [12-1] 63fc54e7.7176 0     LOG:  restored log file "00000004000000000000001B" from archive
INFO: 2023/02/27 09:59:53.722749 WAL-prefetch file:  00000004000000000000001D
2023-02-27 09:59:54 MSK [29161]: [1-1] 63fc54ea.71e9 0 [unknown] [unknown] [unknown] 127.0.0.1 LOG:  connection received: host=127.0.0.1 port=45274
2023-02-27 09:59:54 MSK [29161]: [2-1] 63fc54ea.71e9 0 postgres postgres [unknown] 127.0.0.1 FATAL:  the database system is starting up
2023-02-27 09:59:54 MSK [29046]: [13-1] 63fc54e7.7176 0     LOG:  recovery stopping before commit of transaction 1858, time 2023-02-22 12:51:26.046508+03
2023-02-27 09:59:54 MSK [29043]: [3-1] 63fc54e7.7173 0     LOG:  shutdown at recovery target
2023-02-27 09:59:54 MSK [29085]: [1-1] 63fc54e8.719d 0     LOG:  shutting down
2023-02-27 09:59:54 MSK [29085]: [2-1] 63fc54e8.719d 0     LOG:  restartpoint starting: shutdown immediate
2023-02-27 09:59:55 MSK [29163]: [1-1] 63fc54eb.71eb 0 [unknown] [unknown] [unknown] 127.0.0.1 LOG:  connection received: host=127.0.0.1 port=45278
2023-02-27 09:59:55 MSK [29163]: [2-1] 63fc54eb.71eb 0 postgres postgres [unknown] 127.0.0.1 FATAL:  the database system is shutting down
2023-02-27 09:59:56 MSK [29085]: [31] 63fc54e8.719d 0     LOG:  restartpoint complete: wrote 9622 buffers (7.6%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.158 s, sync=0.972 s, total=1.165 s; sync files=37, longest=0.970 s, average=0.026 s; distance=74631 kB, estimate=74631 kB
2023-02-27 09:59:56 MSK [29085]: [4-1] 63fc54e8.719d 0     LOG:  recovery restart point at 0/1B8E1D18
2023-02-27 09:59:56 MSK [29085]: [5-1] 63fc54e8.719d 0     DETAIL:  last completed transaction was at log time 2023-02-22 12:33:41.56509+03
2023-02-27 09:59:56 MSK [29043]: [4-1] 63fc54e7.7173 0     LOG:  database system is shut down

After my changes everything works as expected.

diff --git a/patroni/postgresql/config.py b/patroni/postgresql/config.py
index 27fd4cf..a93ff8e 100644
--- a/patroni/postgresql/config.py
+++ b/patroni/postgresql/config.py
@@ -1098,7 +1098,7 @@ class ConfigHandler(object):
         # like max_connections are increased, therefore we disable hot_standby.
         if self._postgresql.bootstrap.running_custom_bootstrap and \
                 (self._postgresql.bootstrap.keep_existing_recovery_conf or self._recovery_conf):
-            effective_configuration['hot_standby'] = 'off'
+            effective_configuration['hot_standby'] = 'on'
             self._postgresql.set_pending_restart(True)
 
         return effective_configuration
**syslog**
Backup extraction complete.
localhost:5432 - no response
2023-02-27 10:31:47 MSK [29946]: [1-1] 63fc5c62.74fa 0     LOG:  redirecting log output to logging collector process
2023-02-27 10:31:47 MSK [29946]: [2-1] 63fc5c62.74fa 0     HINT:  Future log output will appear in directory "/var/lib/pgsql/9.6/logs".
localhost:5432 - rejecting connections
localhost:5432 - rejecting connections
localhost:5432 - accepting connections

postgresql-Mon.log

2023-02-27 10:31:50 MSK [29949]: [13-1] 63fc5c63.74fd 0     LOG:  recovery stopping before commit of transaction 1858, time 2023-02-22 12:51:26.046508+03
2023-02-27 10:31:50 MSK [29949]: [14-1] 63fc5c63.74fd 0     LOG:  recovery has paused
2023-02-27 10:31:50 MSK [29949]: [15-1] 63fc5c63.74fd 0     HINT:  Execute pg_xlog_replay_resume() to continue.

After bash psql -c 'SELECT pg_xlog_replay_resume()'

patroni.log

2023-02-27 10:34:08,983 INFO: waiting for end of recovery after bootstrap
2023-02-27 10:34:19,084 INFO: running post_bootstrap
2023-02-27 10:34:19,891 INFO: closed patroni connection to the postgresql cluster
2023-02-27 10:34:20,315 INFO: postmaster pid=30290
2023-02-27 10:34:21,415 WARNING: Could not activate Linux watchdog device: "Can't open watchdog device: [Errno 2] No such file or directory: '/dev/watchdog'"
2023-02-27 10:34:21,415 INFO: establishing a new patroni connection to the postgres cluster
2023-02-27 10:34:21,665 INFO: initialized a new cluster
2023-02-27 10:34:21,824 INFO: no action. I am (cluster_name), the leader with the lock

syslog

2023-02-27 10:34:20 MSK [30290]: [1-1] 63fc5cfc.7652 0     LOG:  redirecting log output to logging collector process
2023-02-27 10:34:20 MSK [30290]: [2-1] 63fc5cfc.7652 0     HINT:  Future log output will appear in directory "/var/lib/pgsql/9.6/logs".
localhost:5432 - rejecting connections
localhost:5432 - rejecting connections
localhost:5432 - accepting connections

postgresql-Mon.log

2023-02-27 10:34:09 MSK [29949]: [16-1] 63fc5c63.74fd 0     LOG:  redo done at 0/1B8E3748
2023-02-27 10:34:09 MSK [29949]: [17-1] 63fc5c63.74fd 0     LOG:  last completed transaction was at log time 2023-02-22 12:33:41.56509+03
2023-02-27 10:34:10 MSK [29949]: [18-1] 63fc5c63.74fd 0     LOG:  restored log file "00000005.history" from archive
ERROR: 2023/02/27 10:34:10.334171 Archive '00000006.history' does not exist.
2023-02-27 10:34:10 MSK [29949]: [19-1] 63fc5c63.74fd 0     LOG:  selected new timeline ID: 6
2023-02-27 10:34:10 MSK [29949]: [20-1] 63fc5c63.74fd 0     LOG:  restored log file "00000004.history" from archive
2023-02-27 10:34:10 MSK [29949]: [21-1] 63fc5c63.74fd 0     LOG:  archive recovery complete
2023-02-27 10:34:10 MSK [29989]: [1-1] 63fc5c64.7525 0     LOG:  checkpoint starting: end-of-recovery immediate wait
2023-02-27 10:34:11 MSK [29989]: [2-1] 63fc5c64.7525 0     LOG:  checkpoint complete: wrote 9622 buffers (7.6%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.201 s, sync=0.419 s, total=0.676 s; sync files=37, longest=0
.416 s, average=0.011 s; distance=74637 kB, estimate=74637 kB
2023-02-27 10:34:11 MSK [29949]: [22-1] 63fc5c63.74fd 0     LOG:  MultiXact member wraparound protections are now enabled
2023-02-27 10:34:11 MSK [30280]: [1-1] 63fc5cf3.7648 0     LOG:  autovacuum launcher started
2023-02-27 10:34:11 MSK [29946]: [4-1] 63fc5c62.74fa 0     LOG:  database system is ready to accept connections

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 3
  • Comments: 16

Commits related to this issue

Most upvoted comments

What could we do - set hot_standby to off only if recovery_target_action = 'promote'

If hot_standby is not enabled, a setting of pause will act the same as shutdown.

And as I already explained, hot_standby is set to off for a good reason