patroni: Patroni count incorrect lag during auto-failover

Describe the bug I’m running a patroni cluster with 3 nodes. All works well when postgres instance in non-archive mode, when I shutdown leader machine, auto failover happen as expected, new leader can elect success.

When I setup postgres in archive mode via below option, things changed. archive_mode = always
archive_command = ‘test ! -f /home/service/var/postgresql/archived_log/%f && cp %p /home/service/var/postgresql/archived_log/%f’ archive_timeout = 0

in archive mode, when I shutdown leader machine(init 0/reboot), auto-failover can’t elect new leader.

the scenario as below: ----before shutdown leader machine, all works well, no lag [postgres@localhost ~]$ patronictl list

  • Cluster: pgsql (6933028654481369510) --±--------±—±----------+ | Member | Host | Role | State | TL | Lag in MB | ±-------±--------------------±--------±--------±—±----------+ | pg01 | 192.168.56.94:54322 | Leader | running | 6 | | | pg02 | 192.168.56.95:54322 | Replica | running | 6 | 0 | | pg03 | 192.168.56.96:54322 | Replica | running | 6 | 0 | ±-------±--------------------±--------±--------±—±----------+ [postgres@localhost ~]$

–check postgres lsn and replication lag on master pg01: postgres=# postgres=# select pg_current_wal_lsn(); -[ RECORD 1 ]------±---------- pg_current_wal_lsn | 1/60000268

postgres=# select * from pg_stat_replication ; -[ RECORD 1 ]----±----------------------------- pid | 1677 usesysid | 16384 usename | repl application_name | pg02 client_addr | 192.168.56.95 client_hostname | client_port | 56686 backend_start | 2021-02-25 11:50:52.927098+08 backend_xmin | state | streaming sent_lsn | 1/60000268 write_lsn | 1/60000268 flush_lsn | 1/60000268 replay_lsn | 1/60000268 write_lag | flush_lag | replay_lag | sync_priority | 0 sync_state | async reply_time | 2021-02-25 11:52:11.98201+08 -[ RECORD 2 ]----±----------------------------- pid | 1706 usesysid | 16384 usename | repl application_name | pg03 client_addr | 192.168.56.96 client_hostname | client_port | 35944 backend_start | 2021-02-25 11:50:53.446353+08 backend_xmin | state | streaming sent_lsn | 1/60000268 write_lsn | 1/60000268 flush_lsn | 1/60000268 replay_lsn | 1/60000268 write_lag | flush_lag | replay_lag | sync_priority | 0 sync_state | async reply_time | 2021-02-25 11:52:12.446186+08

postgres=#

so before shutdown, both patronictl and postgres showed no lag. static db with no any data change.

----------------------------ater shutdown pg01 machine----------------------------------------- [postgres@localhost postgresql]$ patronictl list

  • Cluster: pgsql (6933028654481369510) --±--------±—±----------+ | Member | Host | Role | State | TL | Lag in MB | ±-------±--------------------±--------±--------±—±----------+ | pg02 | 192.168.56.95:54322 | Replica | running | 6 | 512 | | pg03 | 192.168.56.96:54322 | Replica | running | 6 | 512 | ±-------±--------------------±--------±--------±—±----------+ [postgres@localhost postgresql]$

–pg02 patronictl log 2021-02-25 11:52:28,429 INFO: no action. i am a secondary and i am following a leader 2021-02-25 11:52:33,414 INFO: Lock owner: pg01; I am pg02 2021-02-25 11:52:33,414 INFO: does not have lock 2021-02-25 11:52:33,420 INFO: no action. i am a secondary and i am following a leader 2021-02-25 11:52:38,420 INFO: Lock owner: pg01; I am pg02 2021-02-25 11:52:38,420 INFO: does not have lock 2021-02-25 11:52:38,425 INFO: no action. i am a secondary and i am following a leader 2021-02-25 11:52:43,681 INFO: My wal position exceeds maximum replication lag 2021-02-25 11:52:43,698 INFO: following a different leader because i am not the healthiest node 2021-02-25 11:52:48,680 INFO: Lock owner: None; I am pg02 2021-02-25 11:52:48,680 INFO: not healthy enough for leader race 2021-02-25 11:52:48,684 INFO: changing primary_conninfo and restarting in progress 2021-02-25 11:52:48,740 INFO: closed patroni connection to the postgresql cluster 2021-02-25 11:52:48,934 INFO: postmaster pid=14570 2021-02-25 11:52:48 CST:😡:[14570]: LOG: starting PostgreSQL 12.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36), 64-bit 2021-02-25 11:52:48 CST:😡:[14570]: LOG: listening on IPv4 address “0.0.0.0”, port 54322 2021-02-25 11:52:48 CST:😡:[14570]: LOG: listening on Unix socket “/tmp/.s.PGSQL.54322” 2021-02-25 11:52:48 CST:😡:[14570]: LOG: redirecting log output to logging collector process 2021-02-25 11:52:48 CST:😡:[14570]: HINT: Future log output will appear in directory “log”. localhost:54322 - rejecting connections localhost:54322 - rejecting connections localhost:54322 - accepting connections 2021-02-25 11:52:50,025 INFO: establishing a new patroni connection to the postgres cluster 2021-02-25 11:52:50,031 INFO: My wal position exceeds maximum replication lag 2021-02-25 11:52:50,044 INFO: following a different leader because i am not the healthiest node 2021-02-25 11:52:55,022 INFO: My wal position exceeds maximum replication lag 2021-02-25 11:52:55,029 INFO: following a different leader because i am not the healthiest node 2021-02-25 11:53:00,022 INFO: My wal position exceeds maximum replication lag 2021-02-25 11:53:00,031 INFO: following a different leader because i am not the healthiest node 2021-02-25 11:53:05,023 INFO: My wal position exceeds maximum replication lag 2021-02-25 11:53:05,032 INFO: following a different leader because i am not the healthiest node

–pg02 postgres log 2021-02-25 11:52:43.712 CST,14520,60371e9c.38b8,4,2021-02-25 11:50:52 CST,0,LOG,00000,“aborting any active transactions”,“” 2021-02-25 11:52:43.713 CST,“postgres”,“postgres”,14537,“127.0.0.1:58466”,60371e9d.38c9,3,“idle”,2021-02-25 11:50:53 CST,3/0,0,FATAL,57P01,“terminating connection due to administrator command”,“Patroni” 2021-02-25 11:52:43.713 CST,“postgres”,“postgres”,14537,“127.0.0.1:58466”,60371e9d.38c9,4,“idle”,2021-02-25 11:50:53 CST,0,LOG,00000,“disconnection: session time: 0:01:49.786 user=postgres database=postgres host=127.0.0.1 port=58466”,“Patroni” 2021-02-25 11:52:43.713 CST,14531,60371e9c.38c3,2,2021-02-25 11:50:52 CST,2/0,0,LOG,00000,“bgworker pgsentinel signal: processed SIGTERM”,“” 2021-02-25 11:52:48.719 CST,14527,60371e9c.38bf,4,2021-02-25 11:50:52 CST,0,LOG,00000,“shutting down”,“” 2021-02-25 11:52:48.735 CST,14520,60371e9c.38b8,5,2021-02-25 11:50:52 CST,0,LOG,00000,“database system is shut down”,“” 2021-02-25 11:52:48.982 CST,14570,60371f10.38ea,1,2021-02-25 11:52:48 CST,0,LOG,00000,“ending log output to stderr”,“Future log output will go to log destination ““csvlog””.”,“” 2021-02-25 11:52:48.984 CST,14573,60371f10.38ed,1,2021-02-25 11:52:48 CST,0,LOG,00000,“database system was shut down in recovery at 2021-02-25 11:52:48 CST”,“” 2021-02-25 11:52:48.985 CST,14574,“127.0.0.1:58478”,60371f10.38ee,1,“”,2021-02-25 11:52:48 CST,0,LOG,00000,“connection received: host=127.0.0.1 port=58478”,“” 2021-02-25 11:52:48.985 CST,“postgres”,“postgres”,14574,“127.0.0.1:58478”,60371f10.38ee,2,“”,2021-02-25 11:52:48 CST,0,FATAL,57P03,“the database system is starting up”,“” 2021-02-25 11:52:48.987 CST,14573,60371f10.38ed,2,2021-02-25 11:52:48 CST,0,WARNING,01000,“specified neither primary_conninfo nor restore_command”,“The database server will regularly poll the pg_wal subdirectory to check for files placed there.”,“” 2021-02-25 11:52:48.987 CST,14573,60371f10.38ed,3,2021-02-25 11:52:48 CST,0,LOG,00000,“entering standby mode”,“” 2021-02-25 11:52:48.992 CST,14576,“127.0.0.1:58482”,60371f10.38f0,1,“”,2021-02-25 11:52:48 CST,0,LOG,00000,“connection received: host=127.0.0.1 port=58482”,“” 2021-02-25 11:52:48.992 CST,“postgres”,“postgres”,14576,“127.0.0.1:58482”,60371f10.38f0,2,“”,2021-02-25 11:52:48 CST,0,FATAL,57P03,“the database system is starting up”,“” 2021-02-25 11:52:48.995 CST,14573,60371f10.38ed,4,2021-02-25 11:52:48 CST,1/0,0,LOG,00000,“redo starts at 1/60000180”,“” 2021-02-25 11:52:48.995 CST,14573,60371f10.38ed,5,2021-02-25 11:52:48 CST,1/0,0,LOG,00000,“consistent recovery state reached at 1/60000268”,“” 2021-02-25 11:52:48.995 CST,14573,60371f10.38ed,6,2021-02-25 11:52:48 CST,1/0,0,LOG,00000,“invalid record length at 1/60000268: wanted 24, got 0”,“” 2021-02-25 11:52:48.996 CST,14570,60371f10.38ea,2,2021-02-25 11:52:48 CST,0,LOG,00000,“database system is ready to accept read only connections”,“” 2021-02-25 11:52:48.998 CST,14581,60371f10.38f5,1,2021-02-25 11:52:48 CST,0,LOG,00000,“starting bgworker pgsentinel”,“” 2021-02-25 11:52:50.011 CST,14583,“127.0.0.1:58486”,60371f12.38f7,1,“”,2021-02-25 11:52:50 CST,0,LOG,00000,“connection received: host=127.0.0.1 port=58486”,“” 2021-02-25 11:52:50.013 CST,“postgres”,“postgres”,14583,“127.0.0.1:58486”,60371f12.38f7,2,“authentication”,2021-02-25 11:52:50 CST,3/1,0,LOG,00000,“connection authorized: user=postgres database=postgres application_name=pg_isready”,“” 2021-02-25 11:52:50.021 CST,“postgres”,“postgres”,14583,“127.0.0.1:58486”,60371f12.38f7,3,“idle”,2021-02-25 11:52:50 CST,0,LOG,00000,“disconnection: session time: 0:00:00.010 user=postgres database=postgres host=127.0.0.1 port=58486”,“pg_isready” 2021-02-25 11:52:50.026 CST,14584,“127.0.0.1:58490”,60371f12.38f8,1,“”,2021-02-25 11:52:50 CST,0,LOG,00000,“connection received: host=127.0.0.1 port=58490”,“” 2021-02-25 11:52:50.028 CST,“postgres”,“postgres”,14584,“127.0.0.1:58490”,60371f12.38f8,2,“authentication”,2021-02-25 11:52:50 CST,3/2,0,LOG,00000,“connection authorized: user=postgres database=postgres application_name=Patroni”,“” 2021-02-25 11:52:50.038 CST,14585,“127.0.0.1:58494”,60371f12.38f9,1,“”,2021-02-25 11:52:50 CST,0,LOG,00000,“connection received: host=127.0.0.1 port=58494”,“” 2021-02-25 11:52:50.039 CST,“repl”,“”,14585,“127.0.0.1:58494”,60371f12.38f9,2,“authentication”,2021-02-25 11:52:50 CST,4/1,0,LOG,00000,“replication connection authorized: user=repl”,“” 2021-02-25 11:52:50.040 CST,“repl”,“”,14585,“127.0.0.1:58494”,60371f12.38f9,3,“idle”,2021-02-25 11:52:50 CST,0,LOG,00000,“disconnection: session time: 0:00:00.002 user=repl database= host=127.0.0.1 port=58494”,“” 2021-02-25 11:52:55.024 CST,14586,“127.0.0.1:58498”,60371f17.38fa,1,“”,2021-02-25 11:52:55 CST,0,LOG,00000,“connection received: host=127.0.0.1 port=58498”,“” 2021-02-25 11:52:55.025 CST,“repl”,“”,14586,“127.0.0.1:58498”,60371f17.38fa,2,“authentication”,2021-02-25 11:52:55 CST,4/2,0,LOG,00000,“replication connection authorized: user=repl”,“” 2021-02-25 11:52:55.026 CST,“repl”,“”,14586,“127.0.0.1:58498”,60371f17.38fa,3,“idle”,2021-02-25 11:52:55 CST,0,LOG,00000,“disconnection: session time: 0:00:00.001 user=repl database= host=127.0.0.1 port=58498”,“” 2021-02-25 11:53:00.026 CST,14587,“127.0.0.1:58502”,60371f1c.38fb,1,“”,2021-02-25 11:53:00 CST,0,LOG,00000,“connection received: host=127.0.0.1 port=58502”,“” 2021-02-25 11:53:00.026 CST,“repl”,“”,14587,“127.0.0.1:58502”,60371f1c.38fb,2,“authentication”,2021-02-25 11:53:00 CST,4/3,0,LOG,00000,“replication connection authorized: user=repl”,“” 2021-02-25 11:53:00.028 CST,“repl”,“”,14587,“127.0.0.1:58502”,60371f1c.38fb,3,“idle”,2021-02-25 11:53:00 CST,0,LOG,00000,“disconnection: session time: 0:00:00.002 user=repl database= host=127.0.0.1 port=58502”,“” 2021-02-25 11:53:05.025 CST,14588,“127.0.0.1:58506”,60371f21.38fc,1,“”,2021-02-25 11:53:05 CST,0,LOG,00000,“connection received: host=127.0.0.1 port=58506”,“” 2021-02-25 11:53:05.026 CST,“repl”,“”,14588,“127.0.0.1:58506”,60371f21.38fc,2,“authentication”,2021-02-25 11:53:05 CST,4/4,0,LOG,00000,“replication connection authorized: user=repl”,“” 2021-02-25 11:53:05.027 CST,“repl”,“”,14588,“127.0.0.1:58506”,60371f21.38fc,3,“idle”,2021-02-25 11:53:05 CST,0,LOG,00000,“disconnection: session time: 0:00:00.002 user=repl database= host=127.0.0.1 port=58506”,“” 2021-02-25 11:53:10.025 CST,14589,“127.0.0.1:58510”,60371f26.38fd,1,“”,2021-02-25 11:53:10 CST,0,LOG,00000,“connection received: host=127.0.0.1 port=58510”,“” 2021-02-25 11:53:10.026 CST,“repl”,“”,14589,“127.0.0.1:58510”,60371f26.38fd,2,“authentication”,2021-02-25 11:53:10 CST,4/5,0,LOG,00000,“replication connection authorized: user=repl”,“” 2021-02-25 11:53:10.028 CST,“repl”,“”,14589,“127.0.0.1:58510”,60371f26.38fd,3,“idle”,2021-02-25 11:53:10 CST,0,LOG,00000,“disconnection: session time: 0:00:00.003 user=repl database= host=127.0.0.1 port=58510”,“” 2021-02-25 11:53:15.026 CST,14590,“127.0.0.1:58514”,60371f2b.38fe,1,“”,2021-02-25 11:53:15 CST,0,LOG,00000,“connection received: host=127.0.0.1 port=58514”,“” 2021-02-25 11:53:15.026 CST,“repl”,“”,14590,“127.0.0.1:58514”,60371f2b.38fe,2,“authentication”,2021-02-25 11:53:15 CST,4/6,0,LOG,00000,“replication connection authorized: user=repl”,“” 2021-02-25 11:53:15.028 CST,“repl”,“”,14590,“127.0.0.1:58514”,60371f2b.38fe,3,“idle”,2021-02-25 11:53:15 CST,0,LOG,00000,“disconnection: session time: 0:00:00.002 user=repl database= host=127.0.0.1 port=58514”,“” 2021-02-25 11:53:20.025 CST,14591,“127.0.0.1:58518”,60371f30.38ff,1,“”,2021-02-25 11:53:20 CST,0,LOG,00000,“connection received: host=127.0.0.1 port=58518”,“” 2021-02-25 11:53:20.025 CST,“repl”,“”,14591,“127.0.0.1:58518”,60371f30.38ff,2,“authentication”,2021-02-25 11:53:20 CST,4/7,0,LOG,00000,“replication connection authorized: user=repl”,“” 2021-02-25 11:53:20.027 CST,“repl”,“”,14591,“127.0.0.1:58518”,60371f30.38ff,3,“idle”,2021-02-25 11:53:20 CST,0,LOG,00000,“disconnection: session time: 0:00:00.001 user=repl database= host=127.0.0.1 port=58518”,“” 2021-02-25 11:53:25.025 CST,14592,“127.0.0.1:58522”,60371f35.3900,1,“”,2021-02-25 11:53:25 CST,0,LOG,00000,“connection received: host=127.0.0.1 port=58522”,“” 2021-02-25 11:53:25.026 CST,“repl”,“”,14592,“127.0.0.1:58522”,60371f35.3900,2,“authentication”,2021-02-25 11:53:25 CST,4/8,0,LOG,00000,“replication connection authorized: user=repl”,“” 2021-02-25 11:53:25.027 CST,“repl”,“”,14592,“127.0.0.1:58522”,60371f35.3900,3,“idle”,2021-02-25 11:53:25 CST,0,LOG,00000,“disconnection: session time: 0:00:00.001 user=repl database= host=127.0.0.1 port=58522”,“” 2021-02-25 11:53:30.024 CST,14593,“127.0.0.1:58526”,60371f3a.3901,1,“”,2021-02-25 11:53:30 CST,0,LOG,00000,“connection received: host=127.0.0.1 port=58526”,“” 2021-02-25 11:53:30.025 CST,“repl”,“”,14593,“127.0.0.1:58526”,60371f3a.3901,2,“authentication”,2021-02-25 11:53:30 CST,4/9,0,LOG,00000,“replication connection authorized: user=repl”,“” 2021-02-25 11:53:30.026 CST,“repl”,“”,14593,“127.0.0.1:58526”,60371f3a.3901,3,“idle”,2021-02-25 11:53:30 CST,0,LOG,00000,“disconnection: session time: 0:00:00.001 user=repl database= host=127.0.0.1 port=58526”,“” 2021-02-25 11:53:35.024 CST,14597,“127.0.0.1:58530”,60371f3f.3905,1,“”,2021-02-25 11:53:35 CST,0,LOG,00000,“connection received: host=127.0.0.1 port=58530”,“” 2021-02-25 11:53:35.025 CST,“repl”,“”,14597,“127.0.0.1:58530”,60371f3f.3905,2,“authentication”,2021-02-25 11:53:35 CST,4/10,0,LOG,00000,“replication connection authorized: user=repl”,“” 2021-02-25 11:53:35.026 CST,“repl”,“”,14597,“127.0.0.1:58530”,60371f3f.3905,3,“idle”,2021-02-25 11:53:35 CST,0,LOG,00000,“disconnection: session time: 0:00:00.001 user=repl database= host=127.0.0.1 port=58530”,“” 2021-02-25 11:53:40.025 CST,14609,“127.0.0.1:58534”,60371f44.3911,1,“”,2021-02-25 11:53:40 CST,0,LOG,00000,“connection received: host=127.0.0.1 port=58534”,“” 2021-02-25 11:53:40.025 CST,“repl”,“”,14609,“127.0.0.1:58534”,60371f44.3911,2,“authentication”,2021-02-25 11:53:40 CST,4/11,0,LOG,00000,“replication connection authorized: user=repl”,“” 2021-02-25 11:53:40.027 CST,“repl”,“”,14609,“127.0.0.1:58534”,60371f44.3911,3,“idle”,2021-02-25 11:53:40 CST,0,LOG,00000,“disconnection: session time: 0:00:00.002 user=repl database= host=127.0.0.1 port=58534”,“”

To Reproduce Steps to reproduce the behavior: configure patroni and postgres as I provided config file

Expected behavior when shutdown leader machine, auto failover happen and elect new leader success.

Screenshots If applicable, add screenshots to help explain your problem.

Environment

  • Patroni version: 2.0.1
  • PostgreSQL version: 12.5
  • DCS (and its version): etcd 3.3.1

Patroni configuration file Please copy&paste your Patroni configuration file here

scope: pgsql
namespace: /service/
name: pg01

restapi:
  listen: 192.168.56.94:8008
  connect_address: 192.168.56.94:8008
etcd:
  hosts: 192.168.56.91:2379,192.168.56.92:2379,192.168.56.93:2379
bootstrap:
  dcs:
    ttl: 20
    loop_wait: 5
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    postgresql:
      use_pg_rewind: true
      remove_data_directory_on_rewind_failure: true
      parameters:
  initdb:  
  - encoding: UTF8
  - data-checksums
  - wal-segsize: "512"
  pg_hba:  
  - host replication repl 0.0.0.0/0 md5
  - host all all 0.0.0.0/0 md5
postgresql:
  listen: 0.0.0.0:54322
  connect_address: 192.168.56.94:54322
  data_dir: /home/service/var/postgresql/pgdata
  bin_dir: /usr/local/postgresql_12.5/bin
  pgpass: /home/postgres/patroni/pgpassfile     
  authentication:
    replication:
      username: repl
      password: "xxxxxx"
    superuser:
      username: postgres
      password: "xxxxxx"
  parameters:
  callbacks:
    on_role_change: /bin/bash /home/postgres/patroni/scripts/pg_role_change.sh  
tags:
    nofailover: false
    noloadbalance: false
    clonefrom: false
    nosync: false

Please copy&paste the output of "patronictl show-config" command here

[postgres@localhost postgresql]$ patronictl show-config loop_wait: 5 maximum_lag_on_failover: 1048576 postgresql: parameters: null remove_data_directory_on_rewind_failure: true use_pg_rewind: true retry_timeout: 10 ttl: 20

[postgres@localhost postgresql]$

Have you checked Patroni logs? Please provide a snippet of Patroni log files here so sorry , I can’t upload file, have pasted above

Have you checked PostgreSQL logs? Please provide a snippet here

Have you tried to use GitHub issue search? Maybe there is already a similar issue solved.

Additional context –postgres config file postgresql.base.conf superuser_reserved_connections = 10 unix_socket_directories = ‘/tmp’ unix_socket_permissions = 0700 tcp_keepalives_idle = 60 tcp_keepalives_interval = 10 tcp_keepalives_count = 10

temp_buffers = 128MB

shared_preload_libraries = ‘pg_stat_statements,auto_explain,pgsentinel,pg_squeeze’
temp_file_limit = 15GB track_functions = pl track_activity_query_size = 10240
pg_stat_statements.max = 10000
pg_stat_statements.track = all
pg_stat_statements.track_utility = on
pg_stat_statements.save = on
auto_explain.log_min_duration = ‘1s’ auto_explain.log_nested_statements =on
bgwriter_delay = 20ms
wal_compression = on fsync = on
synchronous_commit = local wal_buffers = 16MB archive_mode = always
archive_command = ‘test ! -f /home/service/var/postgresql/archived_log/%f && cp %p /home/service/var/postgresql/archived_log/%f’ archive_timeout = 0 log_destination = ‘csvlog’ logging_collector = on log_directory = ‘log’ log_filename = ‘postgresql-%d.log’ log_truncate_on_rotation = on log_line_prefix = '%t:%r:%u@%d:[%p]: ’ log_statement = ‘ddl’ log_checkpoints = on log_connections = on log_disconnections = on log_min_duration_statement = 1000 log_lock_waits = on log_timezone = ‘PRC’
log_autovacuum_min_duration = 1000
datestyle = ‘iso, mdy’ timezone = ‘PRC’
lc_messages = ‘en_US.UTF-8’ lc_monetary = ‘en_US.UTF-8’ lc_numeric = ‘en_US.UTF-8’ lc_time = ‘en_US.UTF-8’ default_text_search_config = ‘pg_catalog.english’
random_page_cost = 1.1
search_path = ‘public’ autovacuum_max_workers = 5 vacuum_freeze_table_age = 1500000000 vacuum_multixact_freeze_table_age = 1500000000 old_snapshot_threshold = 3h autovacuum_vacuum_cost_limit = 1000 autovacuum_vacuum_cost_delay = 10ms idle_in_transaction_session_timeout = 2h vacuum_cost_page_miss = 5 vacuum_cost_page_dirty = 10 max_wal_size =48GB min_wal_size =12GB max_parallel_maintenance_workers = 4 work_mem = 128MB shared_buffers = 1GB maintenance_work_mem = 1GB

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 16

Most upvoted comments

I have told you already a dozen times, Patroni reports replication lag absolutely correctly. You can check by isolating the former primary and starting it up in recovery (manually, without Patroni!).

The fact that replicas were streaming from the primary at some moment in time doesn’t guaranty that they were streaming during the shutdown. When Postgres is being shutdown it does a checkpoint and writes the checkpoint record to the new WAL file, which explains why LSN jumps ahead by the size of a single WAL file.

During the “normal” shutdown, Postgres keeps walsnder processes till the very last bits are replicated (including the final checkpoint). In your case apparently, walsender processes are terminated prematurely by something from the outside. And I am 100% sure there is evidence of that (that walsenders were terminated) in the postgres logs on the former primary. Instead of spending 10 minutes analyzing all logs in the distributed system you wasted a few days hunting witches and trying to prove that Patroni doesn’t work correctly.