neon: assertion failed: self.historic_layers.remove(&LayerRTreeObject::new(layer)).is_some()

Steps to reproduce

Investigating.

Expected result

No panics

Actual result

One task panicked. Additionally this caused problems for other tenants. Other tenants became unavailable.

Part of the log containing stacktrace:

2023-01-19T16:15:20.672271Z  INFO download complete: /storage/pageserver/data/tenants/X/timelines/Y/000000067F00004002000004EB0000000007-030000000000000000000000000000000002__0000>
2023-01-19T16:15:20.576264Z  INFO synthetic_size_worker:calculate_synthetic_size{tenant_id=X}:gather_size_inputs{tenant_id=X}: on-demand downloading remote layer remote 73e1f731e8>
thread 'background op worker' panicked at 'assertion failed: self.historic_layers.remove(&LayerRTreeObject::new(layer)).is_some()', /home/nonroot/pageserver/src/tenant/layer_map.rs:397:9
stack backtrace:
   0: rust_begin_unwind
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/core/src/panicking.rs:65:14
   2: core::panicking::panic
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/core/src/panicking.rs:115:5
   3: pageserver::tenant::layer_map::LayerMap<L>::remove_historic
             at /home/nonroot/pageserver/src/tenant/layer_map.rs:397:9
   4: pageserver::tenant::timeline::Timeline::download_remote_layer::{{closure}}::{{closure}}::{{closure}}
             at /home/nonroot/pageserver/src/tenant/timeline.rs:3086:25
   5: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/core/src/future/mod.rs:91:19
   6: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::future::future::Future>::poll
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/core/src/panic/unwind_safe.rs:296:9
   7: <futures_util::future::future::catch_unwind::CatchUnwind<Fut> as core::future::future::Future>::poll::{{closure}}
             at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.25/src/future/future/catch_unwind.rs:36:42
   8: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/core/src/panic/unwind_safe.rs:271:9
   9: std::panicking::try::do_call
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/panicking.rs:483:40
  10: std::panicking::try
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/panicking.rs:447:19
  11: std::panic::catch_unwind
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/panic.rs:137:14
  12: <futures_util::future::future::catch_unwind::CatchUnwind<Fut> as core::future::future::Future>::poll
             at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.25/src/future/future/catch_unwind.rs:36:9
  13: <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll::{{closure}}
             at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/task/task_local.rs:348:35
  14: tokio::task::task_local::LocalKey<T>::scope_inner
             at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/task/task_local.rs:233:19
  15: <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll
             at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/task/task_local.rs:345:13
  16: <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll::{{closure}}
             at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/task/task_local.rs:348:35
  17: tokio::task::task_local::LocalKey<T>::scope_inner
             at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/task/task_local.rs:233:19
  18: <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll
             at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/task/task_local.rs:345:13
  19: pageserver::task_mgr::task_wrapper::{{closure}}
             at /home/nonroot/pageserver/src/task_mgr.rs:326:9
  20: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/core/src/future/mod.rs:91:19
  21: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
             at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/runtime/task/core.rs:223:17
  22: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
             at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/loom/std/unsafe_cell.rs:14:9
  23: tokio::runtime::task::core::Core<T,S>::poll
             at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/runtime/task/core.rs:212:13
  24: tokio::runtime::task::harness::poll_future::{{closure}}
             at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/runtime/task/harness.rs:476:19
  25: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/core/src/panic/unwind_safe.rs:271:9
  26: std::panicking::try::do_call
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/panicking.rs:483:40
  27: std::panicking::try
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/panicking.rs:447:19
  28: std::panic::catch_unwind
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/panic.rs:137:14
  29: tokio::runtime::task::harness::poll_future
             at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/runtime/task/harness.rs:464:18
  30: tokio::runtime::task::harness::Harness<T,S>::poll_inner
             at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/runtime/task/harness.rs:198:27
  31: tokio::runtime::task::harness::Harness<T,S>::poll
             at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/runtime/task/harness.rs:152:15
  32: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
  33: tokio::runtime::scheduler::multi_thread::worker::Context::run
  34: tokio::macros::scoped_tls::ScopedKey<T>::set
  35: tokio::runtime::scheduler::multi_thread::worker::run
  36: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
  37: tokio::runtime::task::core::Core<T,S>::poll
  38: tokio::runtime::task::harness::Harness<T,S>::poll
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
2023-01-19T16:15:20.873463Z ERROR Task 'download layer 000000067F00004002000004EB0000000007-030000000000000000000000000000000002__000000001E6DE619-000000001E7B0B41' tenant_id: Some(X),  timeline_id: Some(Y) panicked: Any { .. }
2023-01-19T16:15:20.873542Z ERROR synthetic_size_worker: failed to calculate synthetic size for tenant 1acf99ebac0

Environment

prod

Logs, links

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 17 (17 by maintainers)

Most upvoted comments

As later found out in #3589 we always fail with “cannot iterate a remote layer” for RemoteLayer’s so this assertion error could not be compaction related (should compaction had touched a RemoteLayer, it would had failed).

Reproduced the layer download itself with the release-2722 binary (sha256: dbbde2386583ba8f138ab813a116c84e499d53e73d229233b313daf0f8657e22) by just the refresh_gc_info, no panic. This required a highly hacky time-specific tenant configuration:

[tenant_config]
pitr_interval = "713007s"
gc_horizon = 87772040

Where pitr interval is near select extract (epoch from ('7 days' + (('now'::timestamptz) - ('2023-01-25 06:34:32.436+00:00'::timestamp))));.