sled: Transactions are deadlocking at insert_inner
Transactions are deadlocking inside of commit and never returning in my application. LLDB shows the thread as stuck on this line:
while self.tree.insert_inner(k, v_opt.clone(), &mut guard)?.is_err()
{
}
Source things: https://github.com/spacejam/sled/blob/master/src/transaction.rs#L364
- expected result: I expect
TransactionTree::committo always return a result. - actual result:
TransactionTree::commitintermittently blocks until I kill the application. - sled version:
0.34.2+ some small error handling experiments: https://github.com/D1plo1d/sled/tree/53d0e30a749f27fb808101730f1794a5f85b6216 - rustc version:
rustc 1.44.1 (c7087fe00 2020-06-17) - operating system: Ubuntu 20.04
- minimal code sample that helps to reproduce the issue: Intermittent issue. TBH I could use a hand figuring out how to reliably reproduce my issue.
- logs, panic messages, stack traces: Not sure what would be helpful here (please let me know!). I added a couple
trace!macros to commit and observed that the commit started but never finished.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 19 (8 by maintainers)
Commits related to this issue
- Merge pull request #1161 from soro/deadlock_fix fix destructor deadlock from issue #1152 — committed to spacejam/sled by spacejam 4 years ago
Oh no, the destructor that blocks is the subscription destructor in sled, not the receiver destructor. That locks too, but in a way that cooperates correctly with the writing side. There is no issue in stdlib afaict. Also, I would leave this issue open until testing has been done that confirms that my change fixes this problem.
Ok, so I think what happens here is a deadlock in the sled code. The destructor of a subscription can not proceed if the sender is blocked on sending, because the write lock cannot be acquired while a read lock is being held, which is going to be the case here. If the receiver is however not processing any more input and the destructor of the subscription was called, we are at an impasse. The receiver instance will be dropped after the destructor of subscription has run, which is never, and the send cannot notice the inactivity. Fix would be to run the destructor of the Receiver explicitly before taking the lock, i.e. before https://github.com/spacejam/sled/blob/50982abb6ebfce5092ac94903fed8819847ae61f/src/subscriber.rs#L99
Edit: submitted https://github.com/spacejam/sled/pull/1161 to try to fix the problem
Ok I think I’ve reproduced it by adding 5 subscriber threads to the concurrent_tree_transactions test. Here are the portions of the thread stacks that have sled:: as a prefix:
all subscribing threads have terminated, but thread 23 is blocked trying to send data to one of the already terminated receiver threads.
so, thread 23 is the culprit, where it’s trying to send data into a SyncSender that has been filled up with 1024 items. It is interesting to me that the call to SyncSender::send is not erroring out, because the corresponding mpsc::Receiver has been dropped, and I expect this to instantly error out.
The band-aid solution is to make the bounded mpsc channel into an unbounded one. But I think it’s important to figure out why this call to SyncSender::send is not erroring out, despite its backing Receiver having been dropped. And I don’t really like the idea of silently letting the system blow up memory usage when there is a slow subscriber.
Ok, I think the fastest way to get more info might be to add some subscription threads into some of the intense concurrent transaction tests to try to recreate hopefully a much more pessimistic and gnarly “bug microwave” that causes the bug to pop out in a few seconds instead of a week. I’ll write that now and throw it on a test server and see what it comes up with.
@D1plo1d thanks for the additional info! That should help to track this down quickly 😃