usrsctp: Possible deadlock/infinite wait

Hi all,

We are running https://github.com/sctplab/usrsctp/commit/bc6a2d19ebc57cbae6fc1f118b64c15d29c5a32a currently. At some point, while doing ingest, we get into a deadlock/infinite wait in this function:

sctp_os_timer_stop

Here is the stack trace (I have eliminated the frames originating into our side of the software and only left ConsumeSCTPRawData which belongs to us):

#0  0x0000007f9482733c in __pthread_cond_wait (cond=0x5555b52478, mutex=0x5555b52e30 <sctp_os_timerwait_mtx>) at pthread_cond_wait.c:188
#1  0x00000055559bd8b4 in sctp_os_timer_stop ()
#2  0x00000055559a149c in sctp_timer_stop ()
#3  0x00000055559be350 in ?? ()
#4  0x00000055559d08a0 in sctp_handle_sack ()
#5  0x00000055559532ac in ?? ()
#6  0x0000005555954890 in sctp_common_input_processing ()
#7  0x00000055559443a0 in usrsctp_conninput ()
#8  0x00000055557e80dc in ConsumeSCTPRawData(unsigned char const*, unsigned long) ()

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 2
Comments: 57 (53 by maintainers)

Commits related to this issue

-- fix for https://github.com/sctplab/usrsctp/issues/325 — committed to ubiquiti/ubnt-usrsctp by deleted user 5 years ago
Fix a deadlock in usrsctplib. See https://github.com/sctplab/usrsctp/issues/325 The fix is copied from https://github.com/ubiquiti/ubnt-usrsctp/pull/3 — committed to 4shome/webrtc by chao-camect 5 years ago
Fix a deadlock in usrsctp_finish The deadlock occurs when it's running on 2 virtual cores ubuntu machines with some very specific sequences. References: * Discussions on usrsctp, https://github.com/... — committed to awslabs/amazon-kinesis-video-streams-webrtc-sdk-c by lherman-cs 4 years ago
Fix a deadlock in usrsctp_finish The deadlock occurs when it's running on 2 virtual cores ubuntu machines with some very specific sequences. References: * Discussions on usrsctp, https://github.com/... — committed to awslabs/amazon-kinesis-video-streams-webrtc-sdk-c by lherman-cs 4 years ago
Added usrsctp reliablitiy stress test. The reliability test is intended to reproduce a deadlock issue in usrsctp known as https://github.com/sctplab/usrsctp/issues/325. Change-Id: If3c1ff70f41dcc5e... — committed to dangxiwang/webrtc by mstyura 5 years ago
Added usrsctp reliablitiy stress test. The reliability test is intended to reproduce a deadlock issue in usrsctp known as https://github.com/sctplab/usrsctp/issues/325. Bug: None Change-Id: If3c1ff7... — committed to dangxiwang/webrtc by mstyura 5 years ago
Added usrsctp reliablitiy stress test. The reliability test is intended to reproduce a deadlock issue in usrsctp known as https://github.com/sctplab/usrsctp/issues/325. Change-Id: If3c1ff70f41dcc5e... — committed to dangxiwang/webrtc by mstyura 5 years ago
Added usrsctp reliablitiy stress test. The reliability test is intended to reproduce an deadlock issue in usrsctp known as https://github.com/sctplab/usrsctp/issues/325. Change-Id: If3c1ff70f41dcc5... — committed to dangxiwang/webrtc by mstyura 5 years ago
Added usrsctp reliablitiy stress test. The reliability test is intended to reproduce an deadlock issue in usrsctp known as https://github.com/sctplab/usrsctp/issues/325. Change-Id: If3c1ff70f41dcc5... — committed to dangxiwang/webrtc by mstyura 5 years ago
Added usrsctp reliablitiy stress test. The reliability test is intended to reproduce a deadlock issue in usrsctp known as https://github.com/sctplab/usrsctp/issues/325. Change-Id: If3c1ff70f41dcc5e... — committed to dangxiwang/webrtc by mstyura 5 years ago
Added usrsctp reliablitiy stress test. The reliability test is intended to reproduce a deadlock issue in usrsctp known as https://github.com/sctplab/usrsctp/issues/325. Change-Id: If3c1ff70f41dcc5e... — committed to dangxiwang/webrtc by mstyura 5 years ago
Added usrsctp reliablitiy stress test. The reliability test is intended to reproduce a deadlock issue in usrsctp known as https://github.com/sctplab/usrsctp/issues/325. Change-Id: If3c1ff70f41dcc5e... — committed to dangxiwang/webrtc by mstyura 5 years ago
Added usrsctp reliablitiy stress test. The reliability test is intended to reproduce a deadlock issue in usrsctp known as https://github.com/sctplab/usrsctp/issues/325. Change-Id: If3c1ff70f41dcc5e... — committed to dangxiwang/webrtc by mstyura 5 years ago
Added usrsctp reliablitiy stress test. The reliability test is intended to reproduce an deadlock issue in usrsctp known as https://github.com/sctplab/usrsctp/issues/325. Change-Id: If3c1ff70f41dcc5... — committed to dangxiwang/webrtc by mstyura 5 years ago
Added usrsctp reliablitiy test. The reliability test is intended to detec an deadlock issue in usrsctp known as https://github.com/sctplab/usrsctp/issues/325. Mark constructor as explicit. Fixed d... — committed to dangxiwang/webrtc by mstyura 5 years ago
Added usrsctp reliablitiy test. The reliability test is intended to detec an deadlock issue in usrsctp known as https://github.com/sctplab/usrsctp/issues/325. Mark constructor as explicit. Fixed d... — committed to dangxiwang/webrtc by mstyura 5 years ago
Added usrsctp reliablitiy stress test. The reliability test is intended to reproduce an deadlock issue in usrsctp known as https://github.com/sctplab/usrsctp/issues/325. Change-Id: If3c1ff70f41dcc5... — committed to dangxiwang/webrtc by mstyura 5 years ago
Update patch set 4 Patch Set 4: Hello Seth Hampson, This CL adds usrsctp stress test which triggers deadlock known as https://github.com/sctplab/usrsctp/issues/325. As you've mentioned here (https:... — committed to dangxiwang/webrtc by deleted user 5 years ago
Update patch set 4 Patch Set 4: > Patch Set 4: > > Hello Seth Hampson, > > This CL adds usrsctp stress test which triggers deadlock known as https://github.com/sctplab/usrsctp/issues/325. As you'v... — committed to dangxiwang/webrtc by deleted user 5 years ago
Update patch set 4 Patch Set 4: > Patch Set 4: > > > Patch Set 4: > > > > Hello Seth Hampson, > > > > This CL adds usrsctp stress test which triggers deadlock known as https://github.com/sctplab/... — committed to dangxiwang/webrtc by deleted user 5 years ago

Most upvoted comments

Is this being worked on? We would like to roll in a newer version of usrsctp into Chromium & WebRTC, but are blocked on this.

shampson on Aug 30, 2019

@peterlei said in an email exchange with some Google folks that he will be reviewing #397 during the Thanksgiving timeframe.

tuexen on Nov 27, 2019

@tuexen, @peterlei I’m really sorry for bothering. Is there any updates on this? Maybe some offline communication happen between both of you? I’m asking because externally it looks like Peter is not active on github at all according to his profile page, but maybe communication happen in different place… Thanks a lot in advance for answering!

mstyura on Nov 14, 2019

@shiretu Does https://github.com/sctplab/usrsctp/pull/397 fix the issue for you? @peterlei What do you think about https://github.com/sctplab/usrsctp/pull/397?

tuexen on Nov 4, 2019

@taylor-b I’ve successfully ran to completion stress test (18 times in total) against usrsctp version from #417. I haven’t observed deadlock (problematic conditional variable were removed). I haven’t observed other crashes. It seems that even #405 (maybe I was unlucky this time) was fixed by #417. Thank you very much for your work, Taylor!

mstyura on Jan 14, 2020

@mstyura Sure, here’s a possible sequence of events:

sctp_remove_net called.
Even after stopping all timers for that net, reference count is nonzero because a currently executing callout (or something else) holds a reference.
Before that reference is released, a new timer is started.
Reference released. But because the timer holds a reference, the reference count is still non-zero and nothing happens until the timer fires. Presumably this could take a long time in some cases, depending on the timer.

I guess another solution would be to have separate reference counts to be used by the timers versus everything else, so that at step 4 releasing the last “normal” reference would reattempt stopping the timers. Seems potentially error prone though.

taylor-b on Jan 10, 2020

Sorry for the delay. Ran into a small wrinkle but I think I have something that works, just need to do some more testing.

taylor-b on Jan 2, 2020

@taylor-b Your considerations make sense to me. Please go ahead and provide a PR. This is very appreciated.

tuexen on Dec 18, 2019

@mstyura @tuexen

I just finished getting myself mostly up to speed on this issue. (For those who don’t know me, I used to work on WebRTC/usrsctp integration)

I may be overlooking something, but I don’t quite see how #397 fixes the problem. I have no doubt that switching to the heap has benefits, but it doesn’t seem to make any difference with regards to the race condition/deadlock. As far as I can tell, the “use after free” race condition still exists because SCTP_OS_TIMER_STOP_DRAIN isn’t called everywhere a timer is about to be freed (specifically the association timers), and the deadlock still exists because sctp_os_timer_stop blocks while holding locks that the callback is in all likelihood trying to acquire.

My first thought after understanding the problem was “this seems like a job for reference counting”. When deleting something with timers, we can postpone the deletion if a callback has a reference to it, relying on the callback to finish the deletion when it acquires the relevant lock and sees that the object is marked for deletion. I’ve used this approach and seen it used in very similar situations.

And to my pleasant surprise, I found that all of the objects that have timers are already reference counted! A separate callback is used for the postponed deletion, which is also a pretty common approach. The problem is that the reference count is incremented too late when it comes to to callbacks. The reference count should be incremented when the timer is enqueued, while still holding the relevant lock; by the time it’s taken off the queue it could already be too late.

So as far as I can tell, if the reference counting is fixed things should “just work”. Would you like me to take a stab at it? I think it would be a much smaller/easier to review change.

taylor-b on Dec 17, 2019

@nxrighthere: our installs are linux only. To make it happen, try to disconnect abruptly on high traffic connection, where lots of ACKs are in flight. It also helps if you reduce the sendv amounts to say… 1024 to make even more ACKs

shiretu on Sep 3, 2019