neon: Gc failed, retrying: could not find data for key 010000000000000000000000000000000000
Error seen on production (ps-3):
2022-09-30T11:24:39.283877Z ERROR gc_loop{tenant_id=8f9f9c862e559fab4d8fcdea148e24a5}: Gc failed, retrying: could not find data for key 010000000000000000000000000000000000 at LSN 0/496A2B31, for request at LSN 0/496A2B30
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 20 (20 by maintainers)
Commits related to this issue
- Ignore key not found error when mapping timestamp to LSN refer #2539 — committed to neondatabase/neon by knizhnik 2 years ago
- Ignore key not found error when mapping timestamp to LSN refer #2539 — committed to neondatabase/neon by knizhnik 2 years ago
- Ignore key not found error when mapping timestamp to LSN refer #2539 — committed to neondatabase/neon by knizhnik 2 years ago
- Peform some refactoring and code deduplication refer #2539 — committed to neondatabase/neon by knizhnik 2 years ago
- Persists latest_gc_cutoff_lsn before performing GC (#2558) * Persists latest_gc_cutoff_lsn before performing GC * Peform some refactoring and code deduplication refer #2539 * Add test for pe... — committed to neondatabase/neon by knizhnik 2 years ago
- Cleanup test_gc_cutoff.py test. - Remove the 'scale' parameter, this isn't a benchmark - Tweak pgbench and pageserver options to create garbage faster that the the GC can collect away. The test use... — committed to neondatabase/neon by hlinnaka 2 years ago
- test_gc_cutoff.py fixes (#2655) * Fix bogus early exit from GC. Commit 91411c415a added this failpoint, but the early exit was not intentional. * Cleanup test_gc_cutoff.py test. - Remove the 'scal... — committed to neondatabase/neon by hlinnaka 2 years ago
- Add an option to set "latest gc cutoff lsn" in pageserver binutils (#4290) ## Problem [#2539](https://github.com/neondatabase/neon/issues/2539) ## Summary of changes Add support for latest_gc_cut... — committed to neondatabase/neon by shanyp a year ago
https://neondb.slack.com/archives/C03H1K0PGKH/p1671039692717009
latest_gc_cutoff_lsn
just determines boundary of PITR (prohibiting requests with smaller LSN). Also it is used to check whether it is time to start new GC iterations. The rule is the following:min(last_record_lsn - gc_horizon, find_lsn_for_timestamp(current_time - pitr_interval))
So doesn’t matter which value of
latest_gc_cutoff_lsn
is stored in metadata. It doesn’t affect new gc cutoff.The problem is that GC code tries to get from storage some value which has to be reconstructed. Whiles colleting data for reconstruction we goo too far in the past. We know key: it is SlruDir. It is used in:
3 is used by compaction, 4-5 - by wal receiver, 1-2 - by basebackup, wal receiver and … is_latest_commit_timestamp_ge_than ! The last one is called by
find_lsn_for_timestamp
which is used to determine cutoff boundary based on pitr_interval.So looks like the source of the problem is now clear (thanks to @SomeoneToIgnore)!
find_lsn_for_timestamp
is using binary search in boundarieslatest_gc_cutoff..last_record_lsn
If value oflatest_gc_cutoff
is actually too small and precedes actually performed GC, then we may try to access already removed record.