neon: initdb_lsn is not reproducible

Pageserver, in order to create an empty timeline, runs initdb command: https://github.com/neondatabase/neon/blob/3be3bb77302c371d1c58fda7d17dbf56dd9ad061/pageserver/src/tenant.rs#L994-L1000

After this call, we import pg data, extract initdb_lsn from it and query safekeepers to get more WAL from compute, if there’s any, using Lsn, based on initdb_lsn as a start lsn for these queries. Safekeepers, on their side, do record this start lsn as an offset from where the WAL streaming should start and store WAL in the files based on this offset.

Later, if we want to restore pageserver entirely from safekeeper WAL, it’s possible to do this similar way: create an empty timeline on pageserver with the needed IDs, make it query safekeeper for more WAL that it had stored from the precious wal streaming.

That would not work, if the Lsn offset is different: safekeepers will start streaming WAL from a different offset, WAL segments’ checksums won’t match and data inside that WAL would not match the expected on pageserver’s side. Example: https://neondb.slack.com/archives/C03H1K0PGKH/p1664971119442359


Turns out, that initdb does not produce the same output even when run from the same binaries. Consider https://github.com/neondatabase/neon/pull/2589 PR that was built recently: it had produced neondatabase/neon:2178 image that was tagged as neondatabase/neon:latest one after the build.

If you run things inside Docker image for it docker run -it --rm neondatabase/neon:2178 bash, it will output the following:

neon@8478c4e9b86a:~$ env LD_LIBRARY_PATH="/usr/local/v14/lib/" env DYLD_LIBRARY_PATH="/usr/local/v14/lib/" /usr/local/v14/bin/initdb -D ./pg14-initdb/ -U test_user -E utf8 --no-instructions --no-sync
... snip, successful operation, creates `pg14-initdb` directory

neon@8478c4e9b86a:~$ /usr/local/bin/pageserver_binutils ./pg14-initdb/global/pg_control
... snip
pg_initdb_lsn: 0/1696070, aligned: 0/1696070

neon@8478c4e9b86a:~$ /usr/local/bin/pageserver_binutils --version
Neon Pageserver binutils git-env:13f0e7a5b4a2ea1187955926e036d4ac57ed094c

neon@8478c4e9b86a:~$ env LD_LIBRARY_PATH="/usr/local/v14/lib/" env DYLD_LIBRARY_PATH="/usr/local/v14/lib/" /usr/local/v14/bin/initdb --version
initdb (PostgreSQL) 14.5

root@8478c4e9b86a:/data# file /usr/local/v14/bin/initdb
/usr/local/v14/bin/initdb: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=7846631cab9d142450aea3e6e8a5330944a9d291, for GNU/Linux 3.2.0, with debug_info, not stripped

Same image, after being deployed on stage-ps-2, outputs the following:

admin@zenith-us-stage-ps-2:~$ env LD_LIBRARY_PATH="/usr/local/v14/lib/" env DYLD_LIBRARY_PATH="/usr/local/v14/lib/" /usr/local/v14/bin/initdb -D ./pg14-initdb/ -U test_user -E utf8 --no-instructions --no-sync
... snip, successful operation, creates `pg14-initdb` directory

admin@zenith-us-stage-ps-2:~$ /usr/local/bin/pageserver_binutils ./pg14-initdb/global/pg_control
... snip
pg_initdb_lsn: 0/1696068, aligned: 0/1696068

admin@zenith-us-stage-ps-2:~$ /usr/local/bin/pageserver_binutils --version
Neon Pageserver binutils git-env:13f0e7a5b4a2ea1187955926e036d4ac57ed094c 

admin@zenith-us-stage-ps-2:~$ env LD_LIBRARY_PATH="/usr/local/v14/lib/" env DYLD_LIBRARY_PATH="/usr/local/v14/lib/" /usr/local/v14/bin/initdb --version
initdb (PostgreSQL) 14.5

admin@zenith-us-stage-ps-2:~$ file /usr/local/v14/bin/initdb
/usr/local/v14/bin/initdb: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=7846631cab9d142450aea3e6e8a5330944a9d291, for GNU/Linux 3.2.0, with debug_info, not stripped

Binary files and build hashes match, but their pg_initdb_lsn output is different for a couple of bytes. With the tenant relocation and various other ways to dynamically switch environments for the timeline, at the current point, we cannot guarantee that initdb_lsn is identical hence cannot rely that safekeeper WAL restoration will ever work now.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 19 (19 by maintainers)

Commits related to this issue

Most upvoted comments

We can create a physical snapshot of existing databases, as of now, from the pageserver. That will allow us to recover from any issues that arise in the future, although it won’t allow you to recover to an earlier point.

The zstd --long option makes a big difference:

/tmp$ tar c pgdata | zstd  | wc -c 
3851746
/tmp$ tar c pgdata | zstd --long  | wc -c 
1521159

And --single-thread saves a little too:

/tmp$ tar c pgdata | zstd --long --single-thread  | wc -c 
1502937