sled: Transactions are deadlocking at insert_inner

Transactions are deadlocking inside of commit and never returning in my application. LLDB shows the thread as stuck on this line:

while self.tree.insert_inner(k, v_opt.clone(), &mut guard)?.is_err()
{
}

Source things: https://github.com/spacejam/sled/blob/master/src/transaction.rs#L364

expected result: I expect TransactionTree::commit to always return a result.
actual result: TransactionTree::commit intermittently blocks until I kill the application.
sled version: 0.34.2 + some small error handling experiments: https://github.com/D1plo1d/sled/tree/53d0e30a749f27fb808101730f1794a5f85b6216
rustc version: rustc 1.44.1 (c7087fe00 2020-06-17)
operating system: Ubuntu 20.04
minimal code sample that helps to reproduce the issue: Intermittent issue. TBH I could use a hand figuring out how to reliably reproduce my issue.
logs, panic messages, stack traces: Not sure what would be helpful here (please let me know!). I added a couple trace! macros to commit and observed that the commit started but never finished.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 19 (8 by maintainers)

Commits related to this issue

Merge pull request #1161 from soro/deadlock_fix fix destructor deadlock from issue #1152 — committed to spacejam/sled by spacejam 4 years ago

Most upvoted comments

Oh no, the destructor that blocks is the subscription destructor in sled, not the receiver destructor. That locks too, but in a way that cooperates correctly with the writing side. There is no issue in stdlib afaict. Also, I would leave this issue open until testing has been done that confirms that my change fixes this problem.

soro on Sep 3, 2020

Ok, so I think what happens here is a deadlock in the sled code. The destructor of a subscription can not proceed if the sender is blocked on sending, because the write lock cannot be acquired while a read lock is being held, which is going to be the case here. If the receiver is however not processing any more input and the destructor of the subscription was called, we are at an impasse. The receiver instance will be dropped after the destructor of subscription has run, which is never, and the send cannot notice the inactivity. Fix would be to run the destructor of the Receiver explicitly before taking the lock, i.e. before https://github.com/spacejam/sled/blob/50982abb6ebfce5092ac94903fed8819847ae61f/src/subscriber.rs#L99

Edit: submitted https://github.com/spacejam/sled/pull/1161 to try to fix the problem

soro on Sep 3, 2020

Ok I think I’ve reproduced it by adding 5 subscriber threads to the concurrent_tree_transactions test. Here are the portions of the thread stacks that have sled:: as a prefix:

flusher thread: 

Thread 45 (Thread 0x7faff6df6700 (LWP 24173)):
#10 sled::concurrency_control::ConcurrencyControl::read (self=<optimized out>) at src/concurrency_control.rs:85
#11 sled::concurrency_control::read () at src/concurrency_control.rs:55
#12 sled::pagecache::iobuf::flush (iobufs=0x7fb004001328) at src/pagecache/iobuf.rs:1039
#13 0x0000558797b9082c in sled::pagecache::logger::Log::flush (self=0x80) at src/pagecache/logger.rs:37
#14 sled::pagecache::PageCache::flush (self=<optimized out>) at src/pagecache/mod.rs:756
#15 <sled::context::Context as core::ops::drop::Drop>::drop (self=0x7faff6df5920) at src/context.rs:61

read/write waiting transaction threads:

Thread 44 (Thread 0x7faff71f8700 (LWP 24171)):
#10 sled::concurrency_control::ConcurrencyControl::write (self=<optimized out>) at src/concurrency_control.rs:102
#11 sled::concurrency_control::write () at src/concurrency_control.rs:59
#12 sled::transaction::TransactionalTrees::stage (self=<optimized out>) at src/transaction.rs:387
#13 0x0000558797ab32ff in sled::tree::Tree::transaction ()

...32 identical stacktraces as above

currently executing transaction, blocked on Subscribers::reserve

Thread 23 (Thread 0x7fb001dee700 (LWP 24149)):
#9  sled::subscriber::Subscribers::reserve (self=<optimized out>, key=<optimized out>) at src/subscriber.rs:261
#10 0x0000558797b99d01 in sled::tree::Tree::insert_inner (self=<optimized out>, key=..., value=..., guard=0x7fb001ded5f0) at src/tree.rs:156
#11 0x0000558797b7ad36 in sled::transaction::TransactionalTree::commit (self=0x7fafac000b20) at src/transaction.rs:364
#12 sled::transaction::TransactionalTrees::commit (self=<optimized out>, guard=<optimized out>) at src/transaction.rs:408
#13 0x0000558797ab22f2 in sled::tree::Tree::transaction ()

threadpool threads hanging out

Thread 9 (Thread 0x7fb0039fc700 (LWP 24135)):
#8  sled::threadpool::Queue::recv_timeout (self=0x7fb0040065a0, duration=...) at src/threadpool.rs:53
#9  sled::threadpool::perform_work (is_immortal=true) at src/threadpool.rs:103
#10 0x0000558797ba5a08 in sled::threadpool::spawn_new_thread::{{closure}} () at src/threadpool.rs:161

... 5 identical stacktraces as above

all subscribing threads have terminated, but thread 23 is blocked trying to send data to one of the already terminated receiver threads.

so, thread 23 is the culprit, where it’s trying to send data into a SyncSender that has been filled up with 1024 items. It is interesting to me that the call to SyncSender::send is not erroring out, because the corresponding mpsc::Receiver has been dropped, and I expect this to instantly error out.

The band-aid solution is to make the bounded mpsc channel into an unbounded one. But I think it’s important to figure out why this call to SyncSender::send is not erroring out, despite its backing Receiver having been dropped. And I don’t really like the idea of silently letting the system blow up memory usage when there is a slow subscriber.

spacejam on Aug 31, 2020

Ok, I think the fastest way to get more info might be to add some subscription threads into some of the intense concurrent transaction tests to try to recreate hopefully a much more pessimistic and gnarly “bug microwave” that causes the bug to pop out in a few seconds instead of a week. I’ll write that now and throw it on a test server and see what it comes up with.

spacejam on Aug 31, 2020

@D1plo1d thanks for the additional info! That should help to track this down quickly 😃

spacejam on Aug 31, 2020