iceoryx: Roudi SEGFAULT when a subscriber dies
Required information
Operating system: Ubuntu 20.04 LTS
Compiler version: GCC 9.4.0
Observed result or behavior: All this behavior occured on the iceoryx version shipped with cyclone DDS on ros2 galactic. I see similar behavior also with ros2 humble
Roudi do SEGFAULT when a subscriber dies. It happens expecially when you have a publisher with high publish rate. ROUDI segfault after it detects that the node is dead and try to clean the memory.
Tracking with GDB this is the backtrace:
(gdb) bt
#0 0x00007ffff7a13149 in std::__atomic_base<unsigned long>::fetch_sub (__m=std::memory_order_relaxed, __i=1, this=0x1e20010)
at /usr/include/c++/9/bits/atomic_base.h:551
#1 iox::mepoo::SharedChunk::decrementReferenceCounter (this=0x7fff70d827b8) at /er/src/external/iceoryx/iceoryx_posh/source/mepoo/shared_chunk.cpp:56
#2 0x00007ffff7a130b8 in iox::mepoo::SharedChunk::~SharedChunk (this=0x7fff70d827b8, __in_chrg=<optimized out>)
at /er/src/external/iceoryx/iceoryx_posh/source/mepoo/shared_chunk.cpp:42
#3 0x00007ffff7a237e2 in iox::popo::UsedChunkList<257u>::cleanup (this=0x7ffff677c6f8)
at /er/src/external/iceoryx/iceoryx_posh/include/iceoryx_posh/internal/popo/used_chunk_list.inl:115
#4 0x00007ffff7a23730 in iox::popo::ChunkReceiver<iox::popo::ChunkReceiverData<256u, iox::popo::ChunkQueueData<iox::DefaultChunkQueueConfig, iox::popo::ThreadSafePolicy> > >::releaseAll (this=0x7fff70d82880) at /er/src/external/iceoryx/iceoryx_posh/include/iceoryx_posh/internal/popo/building_blocks/chunk_receiver.inl:88
#5 0x00007ffff7a23506 in iox::popo::SubscriberPortRouDi::releaseAllChunks (this=0x7fff70d82870)
at /er/src/external/iceoryx/iceoryx_posh/source/popo/ports/subscriber_port_roudi.cpp:47
#6 0x00007ffff7b38a9d in iox::roudi::PortManager::destroySubscriberPort (this=0x7ffff7bae138 <iox::roudi::IceOryxRouDiApp::run()::m_rouDiComponents+51352>,
subscriberPortData=0x7ffff677a3e8) at /er/src/external/iceoryx/iceoryx_posh/source/roudi/port_manager.cpp:571
#7 0x00007ffff7b3808e in iox::roudi::PortManager::deletePortsOfProcess (this=0x7ffff7bae138 <iox::roudi::IceOryxRouDiApp::run()::m_rouDiComponents+51352>,
runtimeName=...) at /er/src/external/iceoryx/iceoryx_posh/source/roudi/port_manager.cpp:494
#8 0x00007ffff7b5d7c2 in iox::roudi::ProcessManager::monitorProcesses (this=0x7ffff7c9f470 <iox::roudi::IceOryxRouDiApp::run()::roudi+112>)
at /er/src/external/iceoryx/iceoryx_posh/source/roudi/process_manager.cpp:668
#9 0x00007ffff7b5d23a in iox::roudi::ProcessManager::run (this=0x7ffff7c9f470 <iox::roudi::IceOryxRouDiApp::run()::roudi+112>)
at /er/src/external/iceoryx/iceoryx_posh/source/roudi/process_manager.cpp:597
#10 0x00007ffff7b4f44a in iox::roudi::RouDi::monitorAndDiscoveryUpdate (this=0x7ffff7c9f400 <iox::roudi::IceOryxRouDiApp::run()::roudi>)
at /er/src/external/iceoryx/iceoryx_posh/source/roudi/roudi.cpp:151
#11 0x00007ffff7b58135 in std::__invoke_impl<void, void (iox::roudi::RouDi::*)(), iox::roudi::RouDi*> (
__f=@0x5555555761f0: (void (iox::roudi::RouDi::*)(iox::roudi::RouDi * const)) 0x7ffff7b4f3e8 <iox::roudi::RouDi::monitorAndDiscoveryUpdate()>,
__t=@0x5555555761e8: 0x7ffff7c9f400 <iox::roudi::IceOryxRouDiApp::run()::roudi>) at /usr/include/c++/9/bits/invoke.h:73
#12 0x00007ffff7b57dcb in std::__invoke<void (iox::roudi::RouDi::*)(), iox::roudi::RouDi*> (
__fn=@0x5555555761f0: (void (iox::roudi::RouDi::*)(iox::roudi::RouDi * const)) 0x7ffff7b4f3e8 <iox::roudi::RouDi::monitorAndDiscoveryUpdate()>)
at /usr/include/c++/9/bits/invoke.h:95
#13 0x00007ffff7b57a51 in std::thread::_Invoker<std::tuple<void (iox::roudi::RouDi::*)(), iox::roudi::RouDi*> >::_M_invoke<0ul, 1ul> (this=0x5555555761e8)
at /usr/include/c++/9/thread:244
#14 0x00007ffff7b57969 in std::thread::_Invoker<std::tuple<void (iox::roudi::RouDi::*)(), iox::roudi::RouDi*> >::operator() (this=0x5555555761e8)
at /usr/include/c++/9/thread:251
#15 0x00007ffff7b5791a in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (iox::roudi::RouDi::*)(), iox::roudi::RouDi*> > >::_M_run (this=0x5555555761e0)
--Type <RET> for more, q to quit, c to continue without paging--
at /usr/include/c++/9/thread:195
#16 0x00007ffff7759de4 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#17 0x00007ffff745b609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#18 0x00007ffff7595133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
I inserted a debug print in the SEGFAULTED function:
void SharedChunk::decrementReferenceCounter() noexcept
{
std::cout<<"DECREMENT: "<<m_chunkManagement<<"\n";
if ((m_chunkManagement != nullptr)
&& (m_chunkManagement->m_referenceCounter.fetch_sub(1U, std::memory_order_relaxed) == 1U))
{
freeChunk();
}
}
It seems that m_chunkManagement is a wrong pointer, compared with the previous ones:
[...]
DECREMENT: 0x7f499f075ff0
DECREMENT: 0
DECREMENT: 0
DECREMENT: 0
DECREMENT: 0x7f499f075f48
DECREMENT: 0x7f499f075f48
[...]
DECREMENT: 0x1df0000
Segmentation fault
Looks like that m_chunkManagement is corrupted somewhere
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 1
- Comments: 28 (17 by maintainers)
Commits related to this issue
- WIP: creating ROS2-like publisher and subscriber to reproduce #1740 — committed to robotechvision/iceoryx by deleted user 2 years ago
Ok, the reason of “stalling” was just my debugging clutter, so things work properly now, no crashes, no hangs. So since I assume that there was no intention for
iox_sub_take_chunkandiox_sub_release_chunkto be thread-safe, I think this issue can be closed here and I’ll create a PR in cyclonedds.Btw, there’s missing
iox_sub_release_chunk(subscriber, userPayload);in theice_c_callbacks_subscriber.cexample, that’s one flaw I have found here, I’ll create a PR for that@elfenpiff @afrixs my gut tells me this is synchronization issue. After having a closer look, it seems there is a small window where the subscriber removed the sample from the
UsedChunkListbut the memory is not yet synchronized. When the subscriber is killed within this window and RouDi does the cleanup we have a double free. The problem can be solved, the question is just how expensive it will be.@ceccocats I created an issue also on cyclone-dds side: https://github.com/eclipse-cyclonedds/cyclonedds/issues/1445 since I suspect maybe a couple of issues here. @MatthiasKillat maybe you have some insights.
@MatthiasKillat @elfenpiff after creating #1771 I had an even closer look to the
UsedChunkListand came to the conclusion that we do not have any issue there. It was just me who was a little bit confused 😄cc @afrixs
@afrixs hahaha, I started my last comment with
my gut tells me the subscriber is accessed from multiple threads without synchronizationand then I had another look at theUsedChunkListto be sure we do not have an issue there. Your are right, the window is too small to reproduce it that reliable. Great that you found the the root cause of your problem but we need to also fix the synchronization issue in theUsedChunkList