neon: assertion failed: self.historic_layers.remove(&LayerRTreeObject::new(layer)).is_some()
Steps to reproduce
Investigating.
Expected result
No panics
Actual result
One task panicked. Additionally this caused problems for other tenants. Other tenants became unavailable.
Part of the log containing stacktrace:
2023-01-19T16:15:20.672271Z INFO download complete: /storage/pageserver/data/tenants/X/timelines/Y/000000067F00004002000004EB0000000007-030000000000000000000000000000000002__0000>
2023-01-19T16:15:20.576264Z INFO synthetic_size_worker:calculate_synthetic_size{tenant_id=X}:gather_size_inputs{tenant_id=X}: on-demand downloading remote layer remote 73e1f731e8>
thread 'background op worker' panicked at 'assertion failed: self.historic_layers.remove(&LayerRTreeObject::new(layer)).is_some()', /home/nonroot/pageserver/src/tenant/layer_map.rs:397:9
stack backtrace:
0: rust_begin_unwind
at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/panicking.rs:575:5
1: core::panicking::panic_fmt
at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/core/src/panicking.rs:65:14
2: core::panicking::panic
at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/core/src/panicking.rs:115:5
3: pageserver::tenant::layer_map::LayerMap<L>::remove_historic
at /home/nonroot/pageserver/src/tenant/layer_map.rs:397:9
4: pageserver::tenant::timeline::Timeline::download_remote_layer::{{closure}}::{{closure}}::{{closure}}
at /home/nonroot/pageserver/src/tenant/timeline.rs:3086:25
5: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/core/src/future/mod.rs:91:19
6: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::future::future::Future>::poll
at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/core/src/panic/unwind_safe.rs:296:9
7: <futures_util::future::future::catch_unwind::CatchUnwind<Fut> as core::future::future::Future>::poll::{{closure}}
at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.25/src/future/future/catch_unwind.rs:36:42
8: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/core/src/panic/unwind_safe.rs:271:9
9: std::panicking::try::do_call
at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/panicking.rs:483:40
10: std::panicking::try
at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/panicking.rs:447:19
11: std::panic::catch_unwind
at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/panic.rs:137:14
12: <futures_util::future::future::catch_unwind::CatchUnwind<Fut> as core::future::future::Future>::poll
at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.25/src/future/future/catch_unwind.rs:36:9
13: <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll::{{closure}}
at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/task/task_local.rs:348:35
14: tokio::task::task_local::LocalKey<T>::scope_inner
at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/task/task_local.rs:233:19
15: <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll
at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/task/task_local.rs:345:13
16: <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll::{{closure}}
at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/task/task_local.rs:348:35
17: tokio::task::task_local::LocalKey<T>::scope_inner
at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/task/task_local.rs:233:19
18: <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll
at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/task/task_local.rs:345:13
19: pageserver::task_mgr::task_wrapper::{{closure}}
at /home/nonroot/pageserver/src/task_mgr.rs:326:9
20: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/core/src/future/mod.rs:91:19
21: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/runtime/task/core.rs:223:17
22: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/loom/std/unsafe_cell.rs:14:9
23: tokio::runtime::task::core::Core<T,S>::poll
at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/runtime/task/core.rs:212:13
24: tokio::runtime::task::harness::poll_future::{{closure}}
at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/runtime/task/harness.rs:476:19
25: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/core/src/panic/unwind_safe.rs:271:9
26: std::panicking::try::do_call
at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/panicking.rs:483:40
27: std::panicking::try
at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/panicking.rs:447:19
28: std::panic::catch_unwind
at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/panic.rs:137:14
29: tokio::runtime::task::harness::poll_future
at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/runtime/task/harness.rs:464:18
30: tokio::runtime::task::harness::Harness<T,S>::poll_inner
at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/runtime/task/harness.rs:198:27
31: tokio::runtime::task::harness::Harness<T,S>::poll
at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/runtime/task/harness.rs:152:15
32: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
33: tokio::runtime::scheduler::multi_thread::worker::Context::run
34: tokio::macros::scoped_tls::ScopedKey<T>::set
35: tokio::runtime::scheduler::multi_thread::worker::run
36: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
37: tokio::runtime::task::core::Core<T,S>::poll
38: tokio::runtime::task::harness::Harness<T,S>::poll
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
2023-01-19T16:15:20.873463Z ERROR Task 'download layer 000000067F00004002000004EB0000000007-030000000000000000000000000000000002__000000001E6DE619-000000001E7B0B41' tenant_id: Some(X), timeline_id: Some(Y) panicked: Any { .. }
2023-01-19T16:15:20.873542Z ERROR synthetic_size_worker: failed to calculate synthetic size for tenant 1acf99ebac0
Environment
prod
Logs, links
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 17 (17 by maintainers)
As later found out in #3589 we always fail with “cannot iterate a remote layer” for
RemoteLayer
’s so this assertion error could not be compaction related (should compaction had touched a RemoteLayer, it would had failed).Reproduced the layer download itself with the release-2722 binary (sha256:
dbbde2386583ba8f138ab813a116c84e499d53e73d229233b313daf0f8657e22
) by just therefresh_gc_info
, no panic. This required a highly hacky time-specific tenant configuration:Where pitr interval is near
select extract (epoch from ('7 days' + (('now'::timestamptz) - ('2023-01-25 06:34:32.436+00:00'::timestamp))));
.