usrsctp: Possible deadlock/infinite wait
Hi all,
We are running https://github.com/sctplab/usrsctp/commit/bc6a2d19ebc57cbae6fc1f118b64c15d29c5a32a currently. At some point, while doing ingest, we get into a deadlock/infinite wait in this function:
sctp_os_timer_stop
Here is the stack trace (I have eliminated the frames originating into our side of the software and only left ConsumeSCTPRawData
which belongs to us):
#0 0x0000007f9482733c in __pthread_cond_wait (cond=0x5555b52478, mutex=0x5555b52e30 <sctp_os_timerwait_mtx>) at pthread_cond_wait.c:188
#1 0x00000055559bd8b4 in sctp_os_timer_stop ()
#2 0x00000055559a149c in sctp_timer_stop ()
#3 0x00000055559be350 in ?? ()
#4 0x00000055559d08a0 in sctp_handle_sack ()
#5 0x00000055559532ac in ?? ()
#6 0x0000005555954890 in sctp_common_input_processing ()
#7 0x00000055559443a0 in usrsctp_conninput ()
#8 0x00000055557e80dc in ConsumeSCTPRawData(unsigned char const*, unsigned long) ()
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 2
- Comments: 57 (53 by maintainers)
Commits related to this issue
- -- fix for https://github.com/sctplab/usrsctp/issues/325 — committed to ubiquiti/ubnt-usrsctp by deleted user 5 years ago
- Fix a deadlock in usrsctplib. See https://github.com/sctplab/usrsctp/issues/325 The fix is copied from https://github.com/ubiquiti/ubnt-usrsctp/pull/3 — committed to 4shome/webrtc by chao-camect 5 years ago
- Fix a deadlock in usrsctp_finish The deadlock occurs when it's running on 2 virtual cores ubuntu machines with some very specific sequences. References: * Discussions on usrsctp, https://github.com/... — committed to awslabs/amazon-kinesis-video-streams-webrtc-sdk-c by lherman-cs 4 years ago
- Fix a deadlock in usrsctp_finish The deadlock occurs when it's running on 2 virtual cores ubuntu machines with some very specific sequences. References: * Discussions on usrsctp, https://github.com/... — committed to awslabs/amazon-kinesis-video-streams-webrtc-sdk-c by lherman-cs 4 years ago
- Added usrsctp reliablitiy stress test. The reliability test is intended to reproduce a deadlock issue in usrsctp known as https://github.com/sctplab/usrsctp/issues/325. Change-Id: If3c1ff70f41dcc5e... — committed to dangxiwang/webrtc by mstyura 5 years ago
- Added usrsctp reliablitiy stress test. The reliability test is intended to reproduce a deadlock issue in usrsctp known as https://github.com/sctplab/usrsctp/issues/325. Bug: None Change-Id: If3c1ff7... — committed to dangxiwang/webrtc by mstyura 5 years ago
- Added usrsctp reliablitiy stress test. The reliability test is intended to reproduce a deadlock issue in usrsctp known as https://github.com/sctplab/usrsctp/issues/325. Change-Id: If3c1ff70f41dcc5e... — committed to dangxiwang/webrtc by mstyura 5 years ago
- Added usrsctp reliablitiy stress test. The reliability test is intended to reproduce an deadlock issue in usrsctp known as https://github.com/sctplab/usrsctp/issues/325. Change-Id: If3c1ff70f41dcc5... — committed to dangxiwang/webrtc by mstyura 5 years ago
- Added usrsctp reliablitiy stress test. The reliability test is intended to reproduce an deadlock issue in usrsctp known as https://github.com/sctplab/usrsctp/issues/325. Change-Id: If3c1ff70f41dcc5... — committed to dangxiwang/webrtc by mstyura 5 years ago
- Added usrsctp reliablitiy stress test. The reliability test is intended to reproduce a deadlock issue in usrsctp known as https://github.com/sctplab/usrsctp/issues/325. Change-Id: If3c1ff70f41dcc5e... — committed to dangxiwang/webrtc by mstyura 5 years ago
- Added usrsctp reliablitiy stress test. The reliability test is intended to reproduce a deadlock issue in usrsctp known as https://github.com/sctplab/usrsctp/issues/325. Change-Id: If3c1ff70f41dcc5e... — committed to dangxiwang/webrtc by mstyura 5 years ago
- Added usrsctp reliablitiy stress test. The reliability test is intended to reproduce a deadlock issue in usrsctp known as https://github.com/sctplab/usrsctp/issues/325. Change-Id: If3c1ff70f41dcc5e... — committed to dangxiwang/webrtc by mstyura 5 years ago
- Added usrsctp reliablitiy stress test. The reliability test is intended to reproduce a deadlock issue in usrsctp known as https://github.com/sctplab/usrsctp/issues/325. Change-Id: If3c1ff70f41dcc5e... — committed to dangxiwang/webrtc by mstyura 5 years ago
- Added usrsctp reliablitiy stress test. The reliability test is intended to reproduce an deadlock issue in usrsctp known as https://github.com/sctplab/usrsctp/issues/325. Change-Id: If3c1ff70f41dcc5... — committed to dangxiwang/webrtc by mstyura 5 years ago
- Added usrsctp reliablitiy test. The reliability test is intended to detec an deadlock issue in usrsctp known as https://github.com/sctplab/usrsctp/issues/325. Mark constructor as explicit. Fixed d... — committed to dangxiwang/webrtc by mstyura 5 years ago
- Added usrsctp reliablitiy test. The reliability test is intended to detec an deadlock issue in usrsctp known as https://github.com/sctplab/usrsctp/issues/325. Mark constructor as explicit. Fixed d... — committed to dangxiwang/webrtc by mstyura 5 years ago
- Added usrsctp reliablitiy stress test. The reliability test is intended to reproduce an deadlock issue in usrsctp known as https://github.com/sctplab/usrsctp/issues/325. Change-Id: If3c1ff70f41dcc5... — committed to dangxiwang/webrtc by mstyura 5 years ago
- Update patch set 4 Patch Set 4: Hello Seth Hampson, This CL adds usrsctp stress test which triggers deadlock known as https://github.com/sctplab/usrsctp/issues/325. As you've mentioned here (https:... — committed to dangxiwang/webrtc by deleted user 5 years ago
- Update patch set 4 Patch Set 4: > Patch Set 4: > > Hello Seth Hampson, > > This CL adds usrsctp stress test which triggers deadlock known as https://github.com/sctplab/usrsctp/issues/325. As you'v... — committed to dangxiwang/webrtc by deleted user 5 years ago
- Update patch set 4 Patch Set 4: > Patch Set 4: > > > Patch Set 4: > > > > Hello Seth Hampson, > > > > This CL adds usrsctp stress test which triggers deadlock known as https://github.com/sctplab/... — committed to dangxiwang/webrtc by deleted user 5 years ago
Is this being worked on? We would like to roll in a newer version of usrsctp into Chromium & WebRTC, but are blocked on this.
@peterlei said in an email exchange with some Google folks that he will be reviewing #397 during the Thanksgiving timeframe.
@tuexen, @peterlei I’m really sorry for bothering. Is there any updates on this? Maybe some offline communication happen between both of you? I’m asking because externally it looks like Peter is not active on github at all according to his profile page, but maybe communication happen in different place… Thanks a lot in advance for answering!
@shiretu Does https://github.com/sctplab/usrsctp/pull/397 fix the issue for you? @peterlei What do you think about https://github.com/sctplab/usrsctp/pull/397?
@taylor-b I’ve successfully ran to completion stress test (
18
times in total) againstusrsctp
version from #417. I haven’t observed deadlock (problematic conditional variable were removed). I haven’t observed other crashes. It seems that even #405 (maybe I was unlucky this time) was fixed by #417. Thank you very much for your work, Taylor!@mstyura Sure, here’s a possible sequence of events:
sctp_remove_net
called.I guess another solution would be to have separate reference counts to be used by the timers versus everything else, so that at step 4 releasing the last “normal” reference would reattempt stopping the timers. Seems potentially error prone though.
Sorry for the delay. Ran into a small wrinkle but I think I have something that works, just need to do some more testing.
@taylor-b Your considerations make sense to me. Please go ahead and provide a PR. This is very appreciated.
@mstyura @tuexen
I just finished getting myself mostly up to speed on this issue. (For those who don’t know me, I used to work on WebRTC/usrsctp integration)
I may be overlooking something, but I don’t quite see how #397 fixes the problem. I have no doubt that switching to the heap has benefits, but it doesn’t seem to make any difference with regards to the race condition/deadlock. As far as I can tell, the “use after free” race condition still exists because
SCTP_OS_TIMER_STOP_DRAIN
isn’t called everywhere a timer is about to be freed (specifically the association timers), and the deadlock still exists becausesctp_os_timer_stop
blocks while holding locks that the callback is in all likelihood trying to acquire.My first thought after understanding the problem was “this seems like a job for reference counting”. When deleting something with timers, we can postpone the deletion if a callback has a reference to it, relying on the callback to finish the deletion when it acquires the relevant lock and sees that the object is marked for deletion. I’ve used this approach and seen it used in very similar situations.
And to my pleasant surprise, I found that all of the objects that have timers are already reference counted! A separate callback is used for the postponed deletion, which is also a pretty common approach. The problem is that the reference count is incremented too late when it comes to to callbacks. The reference count should be incremented when the timer is enqueued, while still holding the relevant lock; by the time it’s taken off the queue it could already be too late.
So as far as I can tell, if the reference counting is fixed things should “just work”. Would you like me to take a stab at it? I think it would be a much smaller/easier to review change.
@nxrighthere: our installs are linux only. To make it happen, try to disconnect abruptly on high traffic connection, where lots of ACKs are in flight. It also helps if you reduce the sendv amounts to say… 1024 to make even more ACKs