neon: Bug in GC

Sometimes GC deletes layers which are still needed. Originally I faced with this problem when played on EC2 with large data size and pageserver is crashed because of disk space exhaustion. My first idea was that it is caused by garbage collecting layers beyond disk consistent LSN: https://github.com/zenithdb/zenith/pull/1004 But I failed to create test for it because deleted layers are restored by replaying WAL from safekeeper: https://github.com/zenithdb/zenith/pull/1043 Configuration I have used on EC2 has not safekeepers.

But recently I was able to reproduce this problems locally without any restarts. Just run read-only pgbench with scale 100 and 10 client for a long time (1000 sec). I got this errors:

pgbench: error: client 4 script 0 aborted in command 1 query 0: ERROR:  could not read block 13685 in rel 1663/13010/16404.0 from page server at lsn 1/ACE40AA8
DETAIL:  page server returned error: tried to request a page version that was garbage collected. requested at 1/ACE40AA8 gc cutoff 1/ACE4C878
progress: 770.0 s, 788.3 tps, lat 12.500 ms stddev 52.382
progress: 780.0 s, 730.6 tps, lat 12.206 ms stddev 32.596
pgbench: error: client 6 script 0 aborted in command 1 query 0: ERROR:  could not read block 134612 in rel 1663/13010/16396.0 from page server at lsn 1/B1CC5D48
DETAIL:  page server returned error: tried to request a page version that was garbage collected. requested at 1/B1CC5D48 gc cutoff 1/B1CCFB40

So something else is wrong in GC logic.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 29 (29 by maintainers)

Commits related to this issue

Most upvoted comments

O, now I inspected code of wait_or_get_last_lsn and understand why this check is correct.

So yes, the last evicted LSN can be very old. And that’s OK. That’s not what’s passed to get_page_at_lsn.