iceoryx: Roudi SEGFAULT when a subscriber dies

Required information

Operating system: Ubuntu 20.04 LTS

Compiler version: GCC 9.4.0

Observed result or behavior: All this behavior occured on the iceoryx version shipped with cyclone DDS on ros2 galactic. I see similar behavior also with ros2 humble

Roudi do SEGFAULT when a subscriber dies. It happens expecially when you have a publisher with high publish rate. ROUDI segfault after it detects that the node is dead and try to clean the memory.

Tracking with GDB this is the backtrace:

(gdb) bt
#0  0x00007ffff7a13149 in std::__atomic_base<unsigned long>::fetch_sub (__m=std::memory_order_relaxed, __i=1, this=0x1e20010)
    at /usr/include/c++/9/bits/atomic_base.h:551
#1  iox::mepoo::SharedChunk::decrementReferenceCounter (this=0x7fff70d827b8) at /er/src/external/iceoryx/iceoryx_posh/source/mepoo/shared_chunk.cpp:56
#2  0x00007ffff7a130b8 in iox::mepoo::SharedChunk::~SharedChunk (this=0x7fff70d827b8, __in_chrg=<optimized out>)
    at /er/src/external/iceoryx/iceoryx_posh/source/mepoo/shared_chunk.cpp:42
#3  0x00007ffff7a237e2 in iox::popo::UsedChunkList<257u>::cleanup (this=0x7ffff677c6f8)
    at /er/src/external/iceoryx/iceoryx_posh/include/iceoryx_posh/internal/popo/used_chunk_list.inl:115
#4  0x00007ffff7a23730 in iox::popo::ChunkReceiver<iox::popo::ChunkReceiverData<256u, iox::popo::ChunkQueueData<iox::DefaultChunkQueueConfig, iox::popo::ThreadSafePolicy> > >::releaseAll (this=0x7fff70d82880) at /er/src/external/iceoryx/iceoryx_posh/include/iceoryx_posh/internal/popo/building_blocks/chunk_receiver.inl:88
#5  0x00007ffff7a23506 in iox::popo::SubscriberPortRouDi::releaseAllChunks (this=0x7fff70d82870)
    at /er/src/external/iceoryx/iceoryx_posh/source/popo/ports/subscriber_port_roudi.cpp:47
#6  0x00007ffff7b38a9d in iox::roudi::PortManager::destroySubscriberPort (this=0x7ffff7bae138 <iox::roudi::IceOryxRouDiApp::run()::m_rouDiComponents+51352>, 
    subscriberPortData=0x7ffff677a3e8) at /er/src/external/iceoryx/iceoryx_posh/source/roudi/port_manager.cpp:571
#7  0x00007ffff7b3808e in iox::roudi::PortManager::deletePortsOfProcess (this=0x7ffff7bae138 <iox::roudi::IceOryxRouDiApp::run()::m_rouDiComponents+51352>, 
    runtimeName=...) at /er/src/external/iceoryx/iceoryx_posh/source/roudi/port_manager.cpp:494
#8  0x00007ffff7b5d7c2 in iox::roudi::ProcessManager::monitorProcesses (this=0x7ffff7c9f470 <iox::roudi::IceOryxRouDiApp::run()::roudi+112>)
    at /er/src/external/iceoryx/iceoryx_posh/source/roudi/process_manager.cpp:668
#9  0x00007ffff7b5d23a in iox::roudi::ProcessManager::run (this=0x7ffff7c9f470 <iox::roudi::IceOryxRouDiApp::run()::roudi+112>)
    at /er/src/external/iceoryx/iceoryx_posh/source/roudi/process_manager.cpp:597
#10 0x00007ffff7b4f44a in iox::roudi::RouDi::monitorAndDiscoveryUpdate (this=0x7ffff7c9f400 <iox::roudi::IceOryxRouDiApp::run()::roudi>)
    at /er/src/external/iceoryx/iceoryx_posh/source/roudi/roudi.cpp:151
#11 0x00007ffff7b58135 in std::__invoke_impl<void, void (iox::roudi::RouDi::*)(), iox::roudi::RouDi*> (
    __f=@0x5555555761f0: (void (iox::roudi::RouDi::*)(iox::roudi::RouDi * const)) 0x7ffff7b4f3e8 <iox::roudi::RouDi::monitorAndDiscoveryUpdate()>, 
    __t=@0x5555555761e8: 0x7ffff7c9f400 <iox::roudi::IceOryxRouDiApp::run()::roudi>) at /usr/include/c++/9/bits/invoke.h:73
#12 0x00007ffff7b57dcb in std::__invoke<void (iox::roudi::RouDi::*)(), iox::roudi::RouDi*> (
    __fn=@0x5555555761f0: (void (iox::roudi::RouDi::*)(iox::roudi::RouDi * const)) 0x7ffff7b4f3e8 <iox::roudi::RouDi::monitorAndDiscoveryUpdate()>)
    at /usr/include/c++/9/bits/invoke.h:95
#13 0x00007ffff7b57a51 in std::thread::_Invoker<std::tuple<void (iox::roudi::RouDi::*)(), iox::roudi::RouDi*> >::_M_invoke<0ul, 1ul> (this=0x5555555761e8)
    at /usr/include/c++/9/thread:244
#14 0x00007ffff7b57969 in std::thread::_Invoker<std::tuple<void (iox::roudi::RouDi::*)(), iox::roudi::RouDi*> >::operator() (this=0x5555555761e8)
    at /usr/include/c++/9/thread:251
#15 0x00007ffff7b5791a in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (iox::roudi::RouDi::*)(), iox::roudi::RouDi*> > >::_M_run (this=0x5555555761e0)
--Type <RET> for more, q to quit, c to continue without paging--
   at /usr/include/c++/9/thread:195
#16 0x00007ffff7759de4 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#17 0x00007ffff745b609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#18 0x00007ffff7595133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

I inserted a debug print in the SEGFAULTED function:

void SharedChunk::decrementReferenceCounter() noexcept
{
    std::cout<<"DECREMENT: "<<m_chunkManagement<<"\n";
    if ((m_chunkManagement != nullptr)
        && (m_chunkManagement->m_referenceCounter.fetch_sub(1U, std::memory_order_relaxed) == 1U))
    {
        freeChunk();
    }
}

It seems that m_chunkManagement is a wrong pointer, compared with the previous ones:

[...]
DECREMENT: 0x7f499f075ff0
DECREMENT: 0
DECREMENT: 0
DECREMENT: 0
DECREMENT: 0x7f499f075f48
DECREMENT: 0x7f499f075f48
[...]
DECREMENT: 0x1df0000
Segmentation fault

Looks like that m_chunkManagement is corrupted somewhere

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 28 (17 by maintainers)

Commits related to this issue

Most upvoted comments

Ok, the reason of “stalling” was just my debugging clutter, so things work properly now, no crashes, no hangs. So since I assume that there was no intention for iox_sub_take_chunk and iox_sub_release_chunk to be thread-safe, I think this issue can be closed here and I’ll create a PR in cyclonedds.

Btw, there’s missing iox_sub_release_chunk(subscriber, userPayload); in the ice_c_callbacks_subscriber.c example, that’s one flaw I have found here, I’ll create a PR for that

@elfenpiff @afrixs my gut tells me this is synchronization issue. After having a closer look, it seems there is a small window where the subscriber removed the sample from the UsedChunkList but the memory is not yet synchronized. When the subscriber is killed within this window and RouDi does the cleanup we have a double free. The problem can be solved, the question is just how expensive it will be.

@ceccocats I created an issue also on cyclone-dds side: https://github.com/eclipse-cyclonedds/cyclonedds/issues/1445 since I suspect maybe a couple of issues here. @MatthiasKillat maybe you have some insights.

@MatthiasKillat @elfenpiff after creating #1771 I had an even closer look to the UsedChunkList and came to the conclusion that we do not have any issue there. It was just me who was a little bit confused 😄

cc @afrixs

@afrixs hahaha, I started my last comment with my gut tells me the subscriber is accessed from multiple threads without synchronization and then I had another look at the UsedChunkList to be sure we do not have an issue there. Your are right, the window is too small to reproduce it that reliable. Great that you found the the root cause of your problem but we need to also fix the synchronization issue in the UsedChunkList