neon: Epic: replicas are slow to start
Motivation
We observed following behavior on prod now:
- primary actively writes the data, so there is some sk<>ps lag
secondary checks last sk lsnbasebackup is asked with lsn=0- secondary tries to get basebackup with that lsn from ps
- it takes more then 2m for pageserver to catch up to that lsn
- secondary start times out (timeout is 2m) and we retry from step 2 and looping over
That lead to secondary start being stuck in that loop for a significant amount of time.
more details in https://neondb.slack.com/archives/C03F5SM1N02/p1701369162911829
DoD
Replica is quick to start
Implementation ideas
Most likely we don’t have to start replica from commmit_lsn
, we can start it from pageserver’s last_record_lsn
. Replica will start lagged, but that will happen anyway in a few minutes.
### Tasks
- [ ] validate proposed solution
- [ ] implement proposed solution : )
Other related tasks and Epics
About this issue
- Original URL
- State: closed
- Created 7 months ago
- Comments: 17 (17 by maintainers)
I am not suer that we are trying to solve the real problem… Why PS can not catch-up in more than 2 minutes? We have quite small
max_replication_write_lag=15MB
and it should not take more than few seconds to reply this WAL. Also not only replica can wait for commit LSN. Anyget_page
at this LSN can also be blocked by more than 2 minutes. But there is 1 minute wait for LSN timeout.Sao IMHO the problem is not that we are trying to launch replica on wrong LSN, but that PS is not able to catch-up for so long time. This is a real problem which needs to be investigated. If it will be solved then no hack with spawning lagged replica is needed.