roc-toolkit: roc_lib test hangs when removing a port
After fixing #223 (PR #275), a new issue appeared. It can be reproduced when running roc_lib tests repeatedly under valgrind. Sometimes, the test hangs when trying to remove port and is then killed by timeout.py. I can’t reproduce this without valgrind.
while :; do python2 scripts/wrappers/timeout.py 300 \
valgrind --max-stackframe=10475520 --error-exitcode=1 --exit-on-first-error=yes \
bin/x86_64-pc-linux-gnu/roc-test-lib || break; done
==3280==
==3280== HEAP SUMMARY:
==3280== in use at exit: 0 bytes in 0 blocks
==3280== total heap usage: 358 allocs, 358 frees, 665,735 bytes allocated
==3280==
==3280== All heap blocks were freed -- no leaks are possible
==3280==
==3280== For counts of detected and suppressed errors, rerun with: -v
==3280== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==3291== Memcheck, a memory error detector
==3291== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==3291== Using Valgrind-3.14.0 and LibVEX; rerun with -h for copyright info
==3291== Command: bin/x86_64-pc-linux-gnu/roc-test-lib
==3291==
timeout of 300s expired, aborting
==3291==
==3291== Process terminating with default action of signal 15 (SIGTERM)
==3291== at 0x499514C: futex_wait_cancelable (futex-internal.h:88)
==3291== by 0x499514C: __pthread_cond_wait_common (pthread_cond_wait.c:502)
==3291== by 0x499514C: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.c:655)
==3291== by 0x496A498: uv_cond_wait (in /usr/lib64/libuv.so.1.0.0)
==3291== by 0x487D268: roc::core::Cond::wait() const (cond.h:42)
==3291== by 0x4888978: roc::netio::Transceiver::run_task_(roc::netio::Transceiver::Task&) (transceiver.cpp:242)
==3291== by 0x4888E61: roc::netio::Transceiver::remove_port(roc::packet::Address) (transceiver.cpp:151)
==3291== by 0x48790D9: (anonymous namespace)::receiver_close_port(void*, roc::pipeline::PortConfig const&) (receiver.cpp:22)
==3291== by 0x487BD59: roc::pipeline::Receiver::iterate_ports(void (*)(void*, roc::pipeline::PortConfig const&), void*) const (receiver.cpp:84)
==3291== by 0x4878FCF: roc_receiver_close (receiver.cpp:163)
==3291== by 0x41A2A7: roc::(anonymous namespace)::Receiver::~Receiver() (test_sender_receiver.cpp:172)
==3291== by 0x41A690: roc::TEST_sender_receiver_fec_with_losses_Test::testBody() (test_sender_receiver.cpp:467)
==3291== by 0x42B863: helperDoTestBody(void*) (in /home/victor/dev/roc-project/roc/bin/x86_64-pc-linux-gnu/roc-test-lib)
==3291== by 0x42C3C5: PlatformSpecificSetJmp (in /home/victor/dev/roc-project/roc/bin/x86_64-pc-linux-gnu/roc-test-lib)
==3291==
==3291== HEAP SUMMARY:
==3291== in use at exit: 246,108 bytes in 72 blocks
==3291== total heap usage: 157 allocs, 85 frees, 344,500 bytes allocated
==3291==
==3291== LEAK SUMMARY:
==3291== definitely lost: 0 bytes in 0 blocks
==3291== indirectly lost: 0 bytes in 0 blocks
==3291== possibly lost: 680 bytes in 2 blocks
==3291== still reachable: 245,428 bytes in 70 blocks
==3291== of which reachable via heuristic:
==3291== newarray : 448 bytes in 2 blocks
==3291== suppressed: 0 bytes in 0 blocks
==3291== Rerun with --leak-check=full to see details of leaked memory
==3291==
==3291== For counts of detected and suppressed errors, rerun with: -v
==3291== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 16 (16 by maintainers)
Commits related to this issue
- roc_netion: fix log message Relates: #277 — committed to roc-streaming/roc-toolkit by dshil 5 years ago
- roc_netio: fix log message Relates: #277 — committed to roc-streaming/roc-toolkit by dshil 5 years ago
- roc_netio: fix log message Relates: #277 — committed to roc-streaming/roc-toolkit by dshil 5 years ago
- roc_netio: fix log message Relates: #277 — committed to roc-streaming/roc-toolkit by dshil 5 years ago
- roc_netio: fix log message Relates: #277 — committed to roc-streaming/roc-toolkit by dshil 5 years ago
- roc_netio: fix log message Relates: #277 — committed to hrishikeshSuresh/roc by dshil 5 years ago
- roc_netio: fix log message Relates: #277 — committed to hrishikeshSuresh/roc by dshil 5 years ago
- roc_netio: fix log message Relates: #277 — committed to roc-streaming/roc-toolkit by dshil 5 years ago
- roc_netio: fix log message Relates: #277 — committed to roc-streaming/roc-toolkit by dshil 5 years ago
- roc_netio: fix log message Relates: #277 — committed to roc-streaming/roc-toolkit by dshil 5 years ago
- roc_netio: fix log message Relates: #277 — committed to gavv/roc-toolkit by dshil 5 years ago
- roc_netio: fix log message Relates: #277 — committed to gavv/roc-toolkit by dshil 5 years ago
- roc_netio: fix log message Relates: #277 — committed to gavv/roc-toolkit by dshil 5 years ago
- roc_netio: fix log message Relates: #277 — committed to roc-streaming/roc-toolkit by dshil 5 years ago
- roc_netio: fix log message Relates: #277 — committed to roc-streaming/roc-toolkit by dshil 5 years ago
- roc_netio: fix log message Relates: #277 — committed to roc-streaming/roc-toolkit by dshil 5 years ago
- roc_netio: fix log message Relates: #277 — committed to roc-streaming/roc-toolkit by dshil 5 years ago
- roc_netio: fix log message Relates: #277 — committed to gavv/roc-toolkit by dshil 5 years ago
- roc_netio: fix log message Relates: #277 — committed to gavv/roc-toolkit by dshil 5 years ago
- roc_netio: fix log message Relates: #277 — committed to gavv/roc-toolkit by dshil 5 years ago
Alexandre was able to reliably reproduce the problem and investigated that the problem disappears if we read all pending packets before calling roc_receiver_close(): https://github.com/roc-project/roc/pull/283#issuecomment-552929159
Here is the backtrace. I captured it by running the test from Alexandre branch under valgrind and sending SIGABRT to individual threads when the test is stuck (threads can be listed using
ps axu -T | grep roc-test-lib).Main thread:
libuv thread:
So the problem is clear now. We have a deadlock:
roc_receiver_close() calls pipeline::Receiver::iterate_ports(). It locks pipeline::Receiver::control_mutex_ and calls fn. fn sends a task to libuv thread and waits until it finishes.
libuv thread calls netio::UDPReceiverPort::recv_cb_(), which calls pipeline::Receiver::write(). It tries to lock pipeline::Receiver::control_mutex_ and hangs because it is already locked. In result, libuv thread is stuck in recv_cb_() and never proceeds to the task scheduled by roc_receiver_close().
It appears that iterate_ports() function is problematic and error-prone. It was a bad idea to call user-provided callback while holding an internal mutex.
As a fix, I suggest to remove this function at all.
roc_receiver_close() (src/lib/src/receiver.cpp) uses it to remove all ports it registered in Transceiver. Instead of using iterate_ports() for that, roc_receiver can maintain an array (core::Array) of addresses (address::SocketAddr). roc_receiver_bind() will add an address to array, and roc_receiver_close() will iterate the array and call trx.remove_port() for each one.