patroni: Can't auto failover after machine reboot

Describe the bug A clear and concise description of what the bug is.

I run a patroni cluster with 3 nodes. maximum_lag_on_failover: 1048576 dcs: ttl: 10 loop_wait: 10 retry_timeout: 10 maximum_lag_on_failover: 1048576

when cluster run well, I reboot pg03 machine which is leader. [postgres@localhost ~]$ patronictl list

Cluster: pgsql (6921911383203823102) --±--------±—±----------+ | Member | Host | Role | State | TL | Lag in MB | ±-------±--------------------±--------±--------±—±----------+ | pg01 | 192.168.56.94:54322 | Replica | running | 13 | 0 | | pg02 | 192.168.56.95:54322 | Replica | running | 13 | 0 | | pg03 | 192.168.56.96:54322 | Leader | running | 13 | | ±-------±--------------------±--------±--------±—±----------+ [postgres@localhost ~]$

I expect failover occur and will elect a new leader. But new Leader election failed.

below is patroni log on the standby nodes:

2021-02-20 18:17:45,215 INFO: no action. i am a secondary and i am following a leader 2021-02-20 18:17:51,683 INFO: My wal position exceeds maximum replication lag 2021-02-20 18:17:51,696 INFO: following a different leader because i am not the healthiest node 2021-02-20 18:17:56,772 INFO: closed patroni connection to the postgresql cluster 2021-02-20 18:17:57 CST:😡:[1944]: LOG: starting PostgreSQL 12.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36), 64-bit 2021-02-20 18:17:57 CST:😡:[1944]: LOG: listening on IPv4 address “0.0.0.0”, port 54322 2021-02-20 18:17:57,050 INFO: postmaster pid=1944 2021-02-20 18:17:57 CST:😡:[1944]: LOG: listening on Unix socket “/tmp/.s.PGSQL.54322” 2021-02-20 18:17:57 CST:😡:[1944]: LOG: redirecting log output to logging collector process 2021-02-20 18:17:57 CST:😡:[1944]: HINT: Future log output will appear in directory “log”. localhost:54322 - rejecting connections localhost:54322 - rejecting connections localhost:54322 - accepting connections 2021-02-20 18:17:58,260 INFO: establishing a new patroni connection to the postgres cluster 2021-02-20 18:17:58,268 INFO: My wal position exceeds maximum replication lag 2021-02-20 18:17:58,292 INFO: following a different leader because i am not the healthiest node 2021-02-20 18:18:08,259 INFO: My wal position exceeds maximum replication lag 2021-02-20 18:18:08,271 INFO: following a different leader because i am not the healthiest node

2021-02-20 18:21:08,255 INFO: My wal position exceeds maximum replication lag 2021-02-20 18:21:08,276 INFO: following a different leader because i am not the healthiest node

but before shutdown machine, patronictl list shows no wal LAG. why can’t failover?

To Reproduce Steps to reproduce the behavior:

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Environment

Patroni version:
PostgreSQL version:
DCS (and its version):

Patroni configuration file

Please copy&paste your Patroni configuration file here

patronictl show-config

Please copy&paste the output of "patronictl show-config" command here

Have you checked Patroni logs? Please provide a snippet of Patroni log files here

Have you checked PostgreSQL logs? Please provide a snippet here

Have you tried to use GitHub issue search? Maybe there is already a similar issue solved.

Additional context Add any other context about the problem here.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 22

Most upvoted comments

@hunterhuang8810 can you please post the solution for the issue you have encountered? I am facing similar situation and wanted to see how you fixed it

venkat-jaligama on Oct 4, 2021