neon: Epic: replicas are slow to start

Motivation

We observed following behavior on prod now:

  1. primary actively writes the data, so there is some sk<>ps lag
  2. secondary checks last sk lsn basebackup is asked with lsn=0
  3. secondary tries to get basebackup with that lsn from ps
  4. it takes more then 2m for pageserver to catch up to that lsn
  5. secondary start times out (timeout is 2m) and we retry from step 2 and looping over

That lead to secondary start being stuck in that loop for a significant amount of time.

more details in https://neondb.slack.com/archives/C03F5SM1N02/p1701369162911829

DoD

Replica is quick to start

Implementation ideas

Most likely we don’t have to start replica from commmit_lsn, we can start it from pageserver’s last_record_lsn. Replica will start lagged, but that will happen anyway in a few minutes.

### Tasks
- [ ] validate proposed solution
- [ ] implement proposed solution : )

Other related tasks and Epics

About this issue

  • Original URL
  • State: closed
  • Created 7 months ago
  • Comments: 17 (17 by maintainers)

Most upvoted comments

I am not suer that we are trying to solve the real problem… Why PS can not catch-up in more than 2 minutes? We have quite small max_replication_write_lag=15MB and it should not take more than few seconds to reply this WAL. Also not only replica can wait for commit LSN. Any get_page at this LSN can also be blocked by more than 2 minutes. But there is 1 minute wait for LSN timeout.

Sao IMHO the problem is not that we are trying to launch replica on wrong LSN, but that PS is not able to catch-up for so long time. This is a real problem which needs to be investigated. If it will be solved then no hack with spawning lagged replica is needed.