roc-toolkit: roc_lib test hangs when removing a port

After fixing #223 (PR #275), a new issue appeared. It can be reproduced when running roc_lib tests repeatedly under valgrind. Sometimes, the test hangs when trying to remove port and is then killed by timeout.py. I can’t reproduce this without valgrind.

while :; do python2 scripts/wrappers/timeout.py 300 \
  valgrind --max-stackframe=10475520 --error-exitcode=1 --exit-on-first-error=yes \
    bin/x86_64-pc-linux-gnu/roc-test-lib || break; done
==3280== 
==3280== HEAP SUMMARY:
==3280==     in use at exit: 0 bytes in 0 blocks
==3280==   total heap usage: 358 allocs, 358 frees, 665,735 bytes allocated
==3280== 
==3280== All heap blocks were freed -- no leaks are possible
==3280== 
==3280== For counts of detected and suppressed errors, rerun with: -v
==3280== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==3291== Memcheck, a memory error detector
==3291== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==3291== Using Valgrind-3.14.0 and LibVEX; rerun with -h for copyright info
==3291== Command: bin/x86_64-pc-linux-gnu/roc-test-lib
==3291== 
timeout of 300s expired, aborting
==3291== 
==3291== Process terminating with default action of signal 15 (SIGTERM)
==3291==    at 0x499514C: futex_wait_cancelable (futex-internal.h:88)
==3291==    by 0x499514C: __pthread_cond_wait_common (pthread_cond_wait.c:502)
==3291==    by 0x499514C: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.c:655)
==3291==    by 0x496A498: uv_cond_wait (in /usr/lib64/libuv.so.1.0.0)
==3291==    by 0x487D268: roc::core::Cond::wait() const (cond.h:42)
==3291==    by 0x4888978: roc::netio::Transceiver::run_task_(roc::netio::Transceiver::Task&) (transceiver.cpp:242)
==3291==    by 0x4888E61: roc::netio::Transceiver::remove_port(roc::packet::Address) (transceiver.cpp:151)
==3291==    by 0x48790D9: (anonymous namespace)::receiver_close_port(void*, roc::pipeline::PortConfig const&) (receiver.cpp:22)
==3291==    by 0x487BD59: roc::pipeline::Receiver::iterate_ports(void (*)(void*, roc::pipeline::PortConfig const&), void*) const (receiver.cpp:84)
==3291==    by 0x4878FCF: roc_receiver_close (receiver.cpp:163)
==3291==    by 0x41A2A7: roc::(anonymous namespace)::Receiver::~Receiver() (test_sender_receiver.cpp:172)
==3291==    by 0x41A690: roc::TEST_sender_receiver_fec_with_losses_Test::testBody() (test_sender_receiver.cpp:467)
==3291==    by 0x42B863: helperDoTestBody(void*) (in /home/victor/dev/roc-project/roc/bin/x86_64-pc-linux-gnu/roc-test-lib)
==3291==    by 0x42C3C5: PlatformSpecificSetJmp (in /home/victor/dev/roc-project/roc/bin/x86_64-pc-linux-gnu/roc-test-lib)
==3291== 
==3291== HEAP SUMMARY:
==3291==     in use at exit: 246,108 bytes in 72 blocks
==3291==   total heap usage: 157 allocs, 85 frees, 344,500 bytes allocated
==3291== 
==3291== LEAK SUMMARY:
==3291==    definitely lost: 0 bytes in 0 blocks
==3291==    indirectly lost: 0 bytes in 0 blocks
==3291==      possibly lost: 680 bytes in 2 blocks
==3291==    still reachable: 245,428 bytes in 70 blocks
==3291==                       of which reachable via heuristic:
==3291==                         newarray           : 448 bytes in 2 blocks
==3291==         suppressed: 0 bytes in 0 blocks
==3291== Rerun with --leak-check=full to see details of leaked memory
==3291== 
==3291== For counts of detected and suppressed errors, rerun with: -v
==3291== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 16 (16 by maintainers)

Commits related to this issue

Most upvoted comments

Alexandre was able to reliably reproduce the problem and investigated that the problem disappears if we read all pending packets before calling roc_receiver_close(): https://github.com/roc-project/roc/pull/283#issuecomment-552929159

Here is the backtrace. I captured it by running the test from Alexandre branch under valgrind and sending SIGABRT to individual threads when the test is stuck (threads can be listed using ps axu -T | grep roc-test-lib).

Main thread:

ERROR: caught SIGABRT

#1: 0x48a64dd roc::core::print_emergency_backtrace()+0x1d
#2: 0x48a5e23 roc::core::(anonymous namespace)::signal_handler(int, siginfo_t*, void*)+0x53
#3: 0x499e910 funlockfile+0x80
#4: 0x499914c pthread_cond_wait+0x1fc
#5: 0x496cf29 uv_cond_wait+0x9
#6: 0x487e299 roc::core::Cond::wait() const+0x19
#7: 0x4889879 roc::netio::Transceiver::run_task_(roc::netio::Transceiver::Task&)+0xd9
#8: 0x4889d52 roc::netio::Transceiver::remove_port(roc::address::SocketAddr)+0x92
#9: 0x487a16f (anonymous namespace)::receiver_close_port(void*, roc::pipeline::PortConfig const&)+0x9f
#10: 0x487cdaa roc::pipeline::Receiver::iterate_ports(void (*)(void*, roc::pipeline::PortConfig const&), void*) const+0xaa
#11: 0x487a070 roc_receiver_close+0x70
#12: 0x41a4e8 roc::(anonymous namespace)::Receiver::~Receiver()+0x18
#13: 0x41a930 roc::TEST_sender_receiver_fec_with_losses_Test::testBody()+0x210
#14: 0x42bcc4 helperDoTestBody(void*)+0x24
#15: 0x42c826 PlatformSpecificSetJmp+0x56
#16: 0x42bb87 Utest::run()+0x47
#17: 0x429f99 UtestShell::runOneTestInCurrentProcess(TestPlugin*, TestResult&)+0x99
#18: 0x429e4c helperDoRunOneTestInCurrentProcess(void*)+0x4c
#19: 0x42c826 PlatformSpecificSetJmp+0x56
#20: 0x429d9c UtestShell::runOneTest(TestPlugin*, TestResult&)+0x8c
#21: 0x428b40 TestRegistry::runAllTests(TestResult&)+0x110
#22: 0x41fba0 CommandLineTestRunner::runAllTests()+0xd0
#23: 0x41f8e6 CommandLineTestRunner::runAllTestsMain()+0xb6
#24: 0x41f6bc CommandLineTestRunner::RunAllTests(int, char const**)+0xfc
#25: 0x41d50e main+0xce
#26: 0x50e3e5b __libc_start_main+0xeb
#27: 0x416dca _start+0x2a

libuv thread:

ERROR: caught SIGABRT

#1: 0x48a64dd roc::core::print_emergency_backtrace()+0x1d
#2: 0x48a5e23 roc::core::(anonymous namespace)::signal_handler(int, siginfo_t*, void*)+0x53
#3: 0x499e910 funlockfile+0x80
#4: 0x499cfcc longjmp+0x20c
#5: 0x49951c4 __pthread_mutex_lock+0x64
#6: 0x496ca69 uv_mutex_lock+0x9
#7: 0x487c3b5 roc::core::Mutex::lock() const+0x15
#8: 0x487c1a3 roc::core::ScopedLock<roc::core::Mutex>::ScopedLock(roc::core::Mutex const&)+0x33
#9: 0x487d029 roc::pipeline::Receiver::write(roc::core::SharedPtr<roc::packet::Packet, roc::core::RefCntOwnership> const&)+0x39
#10: 0x488d83f roc::netio::UDPReceiverPort::recv_cb_(uv_udp_s*, long, uv_buf_t const*, sockaddr const*, unsigned int)+0x75f
#11: 0x496dcc7 uv_tty_get_vterm_state+0x577
#12: 0x49701f8 uv__io_poll+0x3f8
#13: 0x4960675 uv_run+0x115
#14: 0x48892c8 roc::netio::Transceiver::run()+0x48
#15: 0x48a49f4 roc::core::Thread::thread_runner_(void*)+0x24
#16: 0x4992458 __libpthread_freeres+0x1788
#17: 0x51c680f clone+0x3f

So the problem is clear now. We have a deadlock:

  • roc_receiver_close() calls pipeline::Receiver::iterate_ports(). It locks pipeline::Receiver::control_mutex_ and calls fn. fn sends a task to libuv thread and waits until it finishes.

  • libuv thread calls netio::UDPReceiverPort::recv_cb_(), which calls pipeline::Receiver::write(). It tries to lock pipeline::Receiver::control_mutex_ and hangs because it is already locked. In result, libuv thread is stuck in recv_cb_() and never proceeds to the task scheduled by roc_receiver_close().

It appears that iterate_ports() function is problematic and error-prone. It was a bad idea to call user-provided callback while holding an internal mutex.

As a fix, I suggest to remove this function at all.

roc_receiver_close() (src/lib/src/receiver.cpp) uses it to remove all ports it registered in Transceiver. Instead of using iterate_ports() for that, roc_receiver can maintain an array (core::Array) of addresses (address::SocketAddr). roc_receiver_bind() will add an address to array, and roc_receiver_close() will iterate the array and call trx.remove_port() for each one.