neon: Bug in GC
Sometimes GC deletes layers which are still needed. Originally I faced with this problem when played on EC2 with large data size and pageserver is crashed because of disk space exhaustion. My first idea was that it is caused by garbage collecting layers beyond disk consistent LSN: https://github.com/zenithdb/zenith/pull/1004 But I failed to create test for it because deleted layers are restored by replaying WAL from safekeeper: https://github.com/zenithdb/zenith/pull/1043 Configuration I have used on EC2 has not safekeepers.
But recently I was able to reproduce this problems locally without any restarts. Just run read-only pgbench with scale 100 and 10 client for a long time (1000 sec). I got this errors:
pgbench: error: client 4 script 0 aborted in command 1 query 0: ERROR: could not read block 13685 in rel 1663/13010/16404.0 from page server at lsn 1/ACE40AA8
DETAIL: page server returned error: tried to request a page version that was garbage collected. requested at 1/ACE40AA8 gc cutoff 1/ACE4C878
progress: 770.0 s, 788.3 tps, lat 12.500 ms stddev 52.382
progress: 780.0 s, 730.6 tps, lat 12.206 ms stddev 32.596
pgbench: error: client 6 script 0 aborted in command 1 query 0: ERROR: could not read block 134612 in rel 1663/13010/16396.0 from page server at lsn 1/B1CC5D48
DETAIL: page server returned error: tried to request a page version that was garbage collected. requested at 1/B1CC5D48 gc cutoff 1/B1CCFB40
So something else is wrong in GC logic.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 29 (29 by maintainers)
Commits related to this issue
- Reproduce github issue #1047. — committed to neondatabase/neon by hlinnaka 2 years ago
- Reproduce github issue #1047. — committed to neondatabase/neon by hlinnaka 2 years ago
- Gc cutoff rwlock (#1139) * Reproduce github issue #1047. * Use RwLock to protect gc_cuttof_lsn * Eeduce number of updates in test_gc_aggressive * Change test_prohibit_get_page_at_lsn_for_ga... — committed to neondatabase/neon by knizhnik 2 years ago
O, now I inspected code of
wait_or_get_last_lsn
and understand why this check is correct.So yes, the last evicted LSN can be very old. And that’s OK. That’s not what’s passed to
get_page_at_lsn
.