patroni: Unable to set up replica nodes after converting leader from standalone

Describe the bug Was able to set up leader, but not the replica nodes when converting standalone installation to patroni.

To Reproduce Steps to reproduce the behavior: Follow https://patroni.readthedocs.io/en/latest/existing_data.html and add replica node. Even added the rewind user according to https://www.postgresql.org/docs/13/app-pgrewind.html

Expected behavior Replica to work.

Environment

  • Patroni version: 2.1.3
  • PostgreSQL version: 13.5.1PGDG.rhel7
  • DCS (and its version): etcd3

Patroni configuration file

scope: cluster
name: pg2

log:
  level: WARNING
  format: '%(levelname)s: %(message)s'
  loggers:
    patroni.postmaster: WARNING
    urllib3: INFO

restapi:
  listen: 0.0.0.0:8008
  connect_address: 192.168.2.3:8008

etcd:
  hosts: 192.168.2.21:2379,192.168.2.22:2379,192.168.2.23:2379
  protocol: https
  cacert: /etc/patroni/etcd/ca.crt
  cert: /etc/patroni/etcd/client.crt
  key: /etc/patroni/etcd/client.key
bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    postgresql:
      use_pg_rewind: true
      use_slots: true
      parameters:
        log_destination: syslog
        log_line_prefix: ''
        max_connections: 100

  initdb:  
  - encoding: UTF8
  - data-checksums

  pg_hba:  
  - host replication replicator 192.168.2.2/32 scram-sha-256
  - host replication replicator 192.168.2.3/32 scram-sha-256
  - host replication replicator 192.168.2.4/32 scram-sha-256
  - host replication replicator 192.168.2.5/32 scram-sha-256
  - host all rewind_user 192.168.2.0/24 scram-sha-256
  - host all patroni 192.168.2.0/24 scram-sha-256

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 192.168.2.3:5432
  data_dir: /var/lib/pgsql/13/data
  bin_dir: /usr/pgsql-13/bin
  pgpass: /var/lib/pgsql/.pgpass
  authentication:
    superuser:
      username: patroni
      password: superuserpass
    replication:
      username: replicator
      password: replicatorpass
    rewind:
      username: rewind_user
      password: rewindpass

tags:
    nofailover: false
    noloadbalance: false
    clonefrom: true
    nosync: false

patronictl show-config On that machine it is unable to show it, but might be due to not getting set up. On leader:

loop_wait: 10
maximum_lag_on_failover: 1048576
postgresql:
  parameters:
    log_destination: syslog
    log_line_prefix: ''
    max_connections: 100
  pg_hba:
  - local   all             all                                     peer
  - '# IPv4 local connections:'
  - host    all             all             127.0.0.1/32            scram-sha-256
  - '# IPv6 local connections:'
  - host    all             all             ::1/128                 scram-sha-256
  - '# Allow replication connections from localhost, by a user with the'
  - '# replication privilege.'
  - local   replication     all                                     peer
  - host    replication     all             127.0.0.1/32            scram-sha-256
  - host    replication     all             ::1/128                 scram-sha-256
  - ''
  - '# Patroni'
  - host    replication     replicator      192.168.2.2/32          scram-sha-256
  - host    replication     replicator      192.168.2.3/32          scram-sha-256
  - host    replication     replicator      192.168.2.4/32          scram-sha-256
  - host    replication     replicator      192.168.2.5/32          scram-sha-256
  - host    all             rewind_user     192.168.2.0/24          scram-sha-256
  - host    all             patroni         192.168.2.0/24          scram-sha-256
  use_pg_rewind: true
  use_slots: true
retry_timeout: 10
ttl: 30

Have you checked Patroni logs?

Mar 31 15:41:09 pg2.my.intra patroni[5596]: ERROR: Exception when working with leader
Mar 31 15:41:09 pg2.my.intra patroni[5596]: Traceback (most recent call last):
Mar 31 15:41:09 pg2.my.intra patroni[5596]: File "/usr/lib/python3.6/site-packages/patroni/postgresql/rewind.py", line 60, in check_leader_is_not_in_recovery
Mar 31 15:41:09 pg2.my.intra patroni[5596]: with get_connection_cursor(connect_timeout=3, options='-c statement_timeout=2000', **conn_kwargs) as cur:
Mar 31 15:41:09 pg2.my.intra patroni[5596]: File "/usr/lib64/python3.6/contextlib.py", line 81, in __enter__
Mar 31 15:41:09 pg2.my.intra patroni[5596]: return next(self.gen)
Mar 31 15:41:09 pg2.my.intra patroni[5596]: File "/usr/lib/python3.6/site-packages/patroni/postgresql/connection.py", line 44, in get_connection_cursor
Mar 31 15:41:09 pg2.my.intra patroni[5596]: conn = psycopg.connect(**kwargs)
Mar 31 15:41:09 pg2.my.intra patroni[5596]: File "/usr/lib64/python3.6/site-packages/psycopg2/__init__.py", line 127, in connect
Mar 31 15:41:09 pg2.my.intra patroni[5596]: conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
Mar 31 15:41:09 pg2.my.intra patroni[5596]: psycopg2.OperationalError: FATAL:  no pg_hba.conf entry for host "192.168.2.3", user "replicator", database "postgres", SSL off

Have you checked PostgreSQL logs? On Leader:

2022-03-31T15:41:09.067491+03:00 local0:err pg1.my.intra postgres[117034]:  [10-1] FATAL:  no pg_hba.conf entry for host "192.168.2.3", user "replicator", database "postgres", SSL off
2022-03-31T15:41:09.067874+03:00 local0:err pg1.my.intra postgres[117034]:  [10-2] DETAIL:  Client IP address resolved to "pg2.my.intra", forward lookup not checked.

Have you tried to use GitHub issue search?

Yes…

#2257 Issue seems to relate to the same issue, however the instructions for converting standalone do not have any reference to pg_hba config guidelines and no notice of having to give replica user access to postgres db. Nor does any sample config file include in bootstrap the access required. host all all 0.0.0.0/0 md5 is clearly way too much. Might be also related to change in #2162.

Additional context

Error itself is quite cryptic if you are trying to set up the cluster. The documentation doesn’t seem to state that you need to give replication user access to postgres db. Such access is not needed for replication itself, but should be duties of some sort of “control user”. Makes me wonder why superuser is even used, if it is not used for that purpose.

Checking the documentation, then it mentions only - host replication replicator 127.0.0.1/32 md5: A line like this is required for replication. in https://github.com/zalando/patroni/blob/master/docs/SETTINGS.rst

It seems a lot more work to fix all the documentation rather than make it work on some other user than replication user. Like a lot of people requested in #2257.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 2
  • Comments: 16

Commits related to this issue

Most upvoted comments

It was the design choice. The rewind user can’t be used on v10 and older and the chances that the superuser connections are not allowed across nodes are much higher than the same for replication user. Introducing yet another user just because we need to figure out that the new leader is not in recovery anymore creates way more problems than solves.

The design is to use replication user first and only replication user? Smarter would be to try other options first and as last resort go for the replication user. Heck, even trying other users other than replication could avoid problem in my case.

Possibly the post of “I agree with x” is not much of value and they delete. The thumbs up kind of gives same result.

Before raising that whole issue, I gathered my information from the samples in github and documentation what is stated as required in pg_hba, so I find I did my part in trying to solve it. Vanilla replication works with having “replication” in database field and that is the reason they have not documented having to add postgres db in there.

In addition to comments in #2257, then there is difference in security side still. One thing is yes the data replication and if you can replicate, you have all the data, but at the same time if you can connect to database and possibly execute something on db machine, then it is another story. The connecting to postgres db potentially gives more access to the machine itself. Currently I am not aware if you can get to shell from psql console, but if I am correct, then using only replication protocol should minimize the risks of having such possibilities.

What is the case when all the connections are busy and the rewind script cannot connect using replication user to the leader?Superuser has possibility of ignoring the connection limit.

Generally that could apply yes, however my point is from issue side and not the change side. I do not feel that this kind of change on replication user usage should have been performed. So I am not the best person on documenting change that is causing problems and I do not understand the real “why such change was performed”. There is rewind user and superuser defined, possibly could rename rewind user to control user and put the non-superuser non-replication tasks on that user duties. Then in config there could be disclaimer that control user is required, however if rewind feature is not wanted, then you just do not give the rights for it and do not turn on the feature in config.

Currently I would keep the ticket open, at least for a while. Maybe someone else will find solution from it or it will stay as reminder to either change design or improve documentation.

The following line will most likely fix the issue:

Adding the following to pg_hba will help:

- host    postgres             replicator         192.168.2.0/24          scram-sha-256

and the rewind script cannot connect using replication user to the leader?

It will retry.

Superuser has possibility of ignoring the connection limit.

Please, make up your mind. Just a couple of sensntnces before you told that you don’t want the replication user to log in and execute commands, and now you want to allow the superuser access, which is clearly makes even worse security consequences.

I would prefer it not trying to connect using replication user at all. Superuser is again security issue. Rewind user is not supported in v10 that has around 6 months til EOL. A lot easier to require v10 and older ones to give this access to replication users, others try to use rewind? Is that a bad idea on how it should work? In that case the replication user would be fallback when rewind is not working.

Adding the following to pg_hba will help:

- host    postgres             replicator         192.168.2.0/24          scram-sha-256

From the link you posted, I am quite sure I have set up the replication.

  - host    replication     replicator      192.168.2.2/32          scram-sha-256
  - host    replication     replicator      192.168.2.3/32          scram-sha-256
  - host    replication     replicator      192.168.2.4/32          scram-sha-256
  - host    replication     replicator      192.168.2.5/32          scram-sha-256

The issue is not the replication, but more like the control commands it tries to do. I do not get error about unable to connect for replication. I have set up the .pgpass files and I can even connect normally with rewind or superuser account. Replica of course not using pgsql, because it is using the replication protocol.

Another node I set up worked fine, but in pg_hba it had addidion like

  - host    postgres        replicator      IP/32          scram-sha-256

Note that it is not documented in https://www.postgresql.org/docs/current/warm-standby.html#STREAMING-REPLICATION-AUTHENTICATION to add replication user access to postgres DB. In there it talks about adding replication user and filling database filed with special parameter “replication”

You have to properly configure pg_hba. Postgres documentation explains how to do that: https://www.postgresql.org/docs/current/warm-standby.html#STREAMING-REPLICATION-AUTHENTICATION