risingwave: bug(debug profile): segfault/EXC_BAD_ACCESS during backtrace capture

Describe the bug

When running playground on macOS using latest main (first bad commit db6691ba142af74544d208afcd3e47b809b00e00), the following sql commands leads to a server crash with segfault/EXC_BAD_ACCESS.

It works as expected in cluster mode (./risedev d) rather than playground.

To Reproduce

CREATE TABLE t(a int, b int);
CREATE VIEW v AS SELECT * FROM t;
DROP TABLE t;

Expected behavior

Before that commit we were able to see the expected error:

ERROR:  QueryError: Permission denied: PermissionDenied: Fail to delete table `t` because 1 other relation(s) depend on it

Additional context

console warnings before segfault (also there on last good commit, may unrelated):

2022-11-04T15:37:18.08988+08:00  WARN risingwave_storage::hummock::state_store: sealing invalid epoch    
2022-11-04T15:37:18.090268+08:00  WARN risingwave_storage::hummock::state_store: syncing invalid epoch    

backtrace from lldb:

* thread #7, name = 'risingwave-main', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x000000019f39685c libunwind.dylib`libunwind::CFI_Parser<libunwind::LocalAddressSpace>::parseFDEInstructions(libunwind::LocalAddressSpace&, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::FDE_Info const&, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::CIE_Info const&, unsigned long, int, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::PrologInfo*) + 204
    frame #1: 0x000000019f396710 libunwind.dylib`libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_arm64>::getInfoFromFdeCie(libunwind::CFI_Parser<libunwind::LocalAddressSpace>::FDE_Info const&, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::CIE_Info const&, unsigned long, unsigned long) + 100
    frame #2: 0x000000019f3963e8 libunwind.dylib`libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_arm64>::getInfoFromDwarfSection(unsigned long, libunwind::UnwindInfoSections const&, unsigned int) + 184
    frame #3: 0x000000019f3962a0 libunwind.dylib`libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_arm64>::setInfoBasedOnIPRegister(bool) + 1012
    frame #4: 0x000000019f398788 libunwind.dylib`libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_arm64>::step() + 696
    frame #5: 0x000000019f39b138 libunwind.dylib`_Unwind_Backtrace + 352
    frame #6: 0x0000000109081e40 risingwave`std::backtrace::Backtrace::create::h908375f7f84cb508 [inlined] std::backtrace_rs::backtrace::libunwind::trace::h471a59e08ff9e5dc at mod.rs:66:5 [opt]
    frame #7: 0x0000000109081e30 risingwave`std::backtrace::Backtrace::create::h908375f7f84cb508 [inlined] std::backtrace_rs::backtrace::trace_unsynchronized::h4e694232d85e2708 at mod.rs:66:5 [opt]
    frame #8: 0x0000000109081e24 risingwave`std::backtrace::Backtrace::create::h908375f7f84cb508 at backtrace.rs:333:13 [opt]
    frame #9: 0x00000001060d8454 risingwave`_$LT$risingwave_meta..error..MetaError$u20$as$u20$core..convert..From$LT$risingwave_meta..error..MetaErrorInner$GT$$GT$::from::hb4b62fbc8685e728(inner=<unavailable>) at error.rs:66:33
    frame #10: 0x0000000105e7e1d8 risingwave`_$LT$T$u20$as$u20$core..convert..Into$LT$U$GT$$GT$::into::h7837bc8fb77e8181(self=<unavailable>) at mod.rs:726:9
    frame #11: 0x00000001060d8830 risingwave`risingwave_meta::error::MetaError::permission_denied::h6592a4f64415a283(s=<unavailable>) at error.rs:96:9
    frame #12: 0x00000001064bb43c risingwave`risingwave_meta::manager::catalog::CatalogManager$LT$S$GT$::drop_materialized_source::_$u7b$$u7b$closure$u7d$$u7d$::h36956825a3496ee1((null)=ResumeTy @ 0x0000000170c3e2b0) at mod.rs:1012:36
(... more callers omitted ...)

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 3
  • Comments: 47 (47 by maintainers)

Commits related to this issue

Most upvoted comments

2342e8b8fc036c004e6935628e790ef007a6d6c7 SEGFAULTs immediately for me.

It’s caused by write_exclusive_cluster_id calling object_store.read, which returns Err for non-existing data. And then object_store::ObjectError captures backtrace. It seems quite reasonable. Really funny. 🥵

Really hope to investigate this problem … if I have enough time


What’s worse is that adding some random dummy code doesn’t fix it. 🥶

That’s interesting.

There’s a update in the upstream issue https://github.com/rust-lang/rust/issues/104388#issuecomment-1794027542. Looks promising

On main branch (f0f96a841cfcd02b1aab8ff6bba6ce7139efd49e, likely earlier):

./risedev p

-- taken from e2e_test/batch/./basic/query.slt.part
create table t3 (v1 int, v2 int, v3 int);
insert into t3 values(1, 2, NULL);
select v1/0 from t3;

same segfault in libunwind::CFI_Parser

reproducible again on 3f75c49…😢 I’d like to try looking into it.

I also met this issue for risedev in my local setup macOS, but it can’t be reproduced if restart frontend manually as well. And the issue is gone when I rebased the main branch. 🥵

Let’s workaround this issue by capturing only the stack without resolving the symbols? https://github.com/risingwavelabs/risingwave/issues/6357