usrsctp: Possible deadlock/infinite wait

Hi all,

We are running https://github.com/sctplab/usrsctp/commit/bc6a2d19ebc57cbae6fc1f118b64c15d29c5a32a currently. At some point, while doing ingest, we get into a deadlock/infinite wait in this function:

sctp_os_timer_stop

Here is the stack trace (I have eliminated the frames originating into our side of the software and only left ConsumeSCTPRawData which belongs to us):

#0  0x0000007f9482733c in __pthread_cond_wait (cond=0x5555b52478, mutex=0x5555b52e30 <sctp_os_timerwait_mtx>) at pthread_cond_wait.c:188
#1  0x00000055559bd8b4 in sctp_os_timer_stop ()
#2  0x00000055559a149c in sctp_timer_stop ()
#3  0x00000055559be350 in ?? ()
#4  0x00000055559d08a0 in sctp_handle_sack ()
#5  0x00000055559532ac in ?? ()
#6  0x0000005555954890 in sctp_common_input_processing ()
#7  0x00000055559443a0 in usrsctp_conninput ()
#8  0x00000055557e80dc in ConsumeSCTPRawData(unsigned char const*, unsigned long) ()

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 2
  • Comments: 57 (53 by maintainers)

Commits related to this issue

Most upvoted comments

Is this being worked on? We would like to roll in a newer version of usrsctp into Chromium & WebRTC, but are blocked on this.

@peterlei said in an email exchange with some Google folks that he will be reviewing #397 during the Thanksgiving timeframe.

@tuexen, @peterlei I’m really sorry for bothering. Is there any updates on this? Maybe some offline communication happen between both of you? I’m asking because externally it looks like Peter is not active on github at all according to his profile page, but maybe communication happen in different place… Thanks a lot in advance for answering!

@taylor-b I’ve successfully ran to completion stress test (18 times in total) against usrsctp version from #417. I haven’t observed deadlock (problematic conditional variable were removed). I haven’t observed other crashes. It seems that even #405 (maybe I was unlucky this time) was fixed by #417. Thank you very much for your work, Taylor!

@mstyura Sure, here’s a possible sequence of events:

  1. sctp_remove_net called.
  2. Even after stopping all timers for that net, reference count is nonzero because a currently executing callout (or something else) holds a reference.
  3. Before that reference is released, a new timer is started.
  4. Reference released. But because the timer holds a reference, the reference count is still non-zero and nothing happens until the timer fires. Presumably this could take a long time in some cases, depending on the timer.

I guess another solution would be to have separate reference counts to be used by the timers versus everything else, so that at step 4 releasing the last “normal” reference would reattempt stopping the timers. Seems potentially error prone though.

Sorry for the delay. Ran into a small wrinkle but I think I have something that works, just need to do some more testing.

@taylor-b Your considerations make sense to me. Please go ahead and provide a PR. This is very appreciated.

@mstyura @tuexen

I just finished getting myself mostly up to speed on this issue. (For those who don’t know me, I used to work on WebRTC/usrsctp integration)

I may be overlooking something, but I don’t quite see how #397 fixes the problem. I have no doubt that switching to the heap has benefits, but it doesn’t seem to make any difference with regards to the race condition/deadlock. As far as I can tell, the “use after free” race condition still exists because SCTP_OS_TIMER_STOP_DRAIN isn’t called everywhere a timer is about to be freed (specifically the association timers), and the deadlock still exists because sctp_os_timer_stop blocks while holding locks that the callback is in all likelihood trying to acquire.

My first thought after understanding the problem was “this seems like a job for reference counting”. When deleting something with timers, we can postpone the deletion if a callback has a reference to it, relying on the callback to finish the deletion when it acquires the relevant lock and sees that the object is marked for deletion. I’ve used this approach and seen it used in very similar situations.

And to my pleasant surprise, I found that all of the objects that have timers are already reference counted! A separate callback is used for the postponed deletion, which is also a pretty common approach. The problem is that the reference count is incremented too late when it comes to to callbacks. The reference count should be incremented when the timer is enqueued, while still holding the relevant lock; by the time it’s taken off the queue it could already be too late.

So as far as I can tell, if the reference counting is fixed things should “just work”. Would you like me to take a stab at it? I think it would be a much smaller/easier to review change.

@nxrighthere: our installs are linux only. To make it happen, try to disconnect abruptly on high traffic connection, where lots of ACKs are in flight. It also helps if you reduce the sendv amounts to say… 1024 to make even more ACKs