esqlite: Occasional VM crash due to race condition between deferred prepare and garbage collection
This is admittedly an unlikely condition in real-world usage, but I have observed an occasional BEAM crash in the esqlite NIF code when the following sequence of events occurs in quick succession:
- A database is created using
esqlite:open/4. - A statement is prepared using
esqlite:prepare/4. - The database resource gets garbage collected.
I added some debug logging to the native code which suggests the above sequence can (though doesn’t always) expand into the following sequence:
- (Scheduler thread) Database created. It creates the command-processing thread.
- (Scheduler thread)
esqlite_prepareis called. It queues up acmd_prepareevent. - (Scheduler thread)
destruct_esqlite_connectionis called. It queues up acmd_stopevent and waits for the command-processing thread to stop. - (Command processor thread) Receives the
cmd_prepareevent and callsdo_prepare. - (Command processor thread)
do_preparecallsenif_keep_resourceon the database connection resources and subsequently returns a reference to a statement resource. - (Command processor thread) Processes
cmd_stopevent and exits. - (Scheduler thread)
destruct_esqlite_connectionis unblocked and finishes tearing down data structures. - (Scheduler thread)
destruct_esqlite_statementgets called. It callsenif_release_resourceon the database resource. - (Scheduler thread)
destruct_esqlite_connectiongets called a second time for the same resource cited in step #3 above. The data structures were released at step 7 so the pointers are now invalid. Crash with an assertion failure or segfault.
Here’s an example from a recent test run:
esqlite_start
conn = 0000000014902e60
queue = 0000000014d800b0
esqlite_prepare called on conn 0000000014902e60
do_prepare
conn = 0000000014902e60
sql = PRAGMA foreign_keys
destruct_esqlite_connection
conn = 0000000014902e60
do_prepare calling keep_resource on conn 0000000014902e60
destruct_esqlite_statement calling release_resource on conn 0000000014902e60
destruct_esqlite_connection
conn = 0000000014902e60
Assertion failed: (queue->message == NULL && "Attempting to send multiple messages."), function queue_send, file c_src/queue.c, line 173.
Abort trap: 6
Note that segfaults and other assertions have been observed as well, presumably because the precise failure mode can vary depending on how the data structures have been altered since being released.
Also: Line numbers may be slightly different from current master branch because I’ve added instrumentation.
The crash happens about 20-30% of the time if I run the test suite for https://github.com/scouten/sqlite_ecto2/pull/29. I’m attempting to derive a simpler test case for it, but haven’t succeeded in stripping it down yet. If I do, I’ll add that as a comment here.
I may also attempt a fix, but NIFs are relatively new to me.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 25 (25 by maintainers)
Commits related to this issue
- Use latest esqlite in an effort to help debug https://github.com/mmzeeman/esqlite/issues/33. Still seeing occasional crashes with this version. — committed to elixir-sqlite/sqlite_ecto2 by scouten 7 years ago
- Use latest esqlite in an effort to help debug (#41) * Use latest esqlite in an effort to help debug https://github.com/mmzeeman/esqlite/issues/33. Still seeing occasional crashes with this versio... — committed to elixir-sqlite/sqlite_ecto2 by scouten 7 years ago
Still seeing a crash, unfortunately. I think this may be a different issue that was unmasked by the previous fix.
I believe there’s a problem with the call to
enif_thread_joinat line 220 of esqlite3_nif.c. I’m sort of baffled by this finding; I would have expected that call to be battle-tested and robust, but I’ll show you how I got to that conclusion.I built a stripped-down version of my test harness to demonstrate. For purposes of this comment, the changes I made at https://github.com/scouten/crash_esqlite_case/commit/d800062eaf7455cacd808d513c6c151e3c2d65e7?w=1 (sorry for whitespace noise; my editor strips trailing space by default and I didn’t catch it) will be important. You may find those useful in debugging. Basically I’ve added console output to all of the calls relevant to the
connandqueuedata structures so I can observe how the two threads interact.(If you don’t mind installing Elixir 1.3.x or later, the
try-to-crash.shscript at the root of that repo is pretty effective. It re-runs the test script 100 times, but almost always crashes within the first 10 iterations on my machine.)What follows is a transcript of a test run on my machine that crashed. There were many other threads involved but I’ve stripped this down to the calls relevant to the thread pair that crashed:
At point A, I would have expected to see
esqlite_connection_run conn = 000000001e4077b0 queue = 000000001e8800b0 command thread EXITbut it didn’t happen. Note that I do see that message, as expected, in test runs that do not crash. I also see it in the crashing test runs, but not for the thread pair that crashes.At point B, it appears that the command processing thread apparently continued to run and called
queue_popwith a now-dangling pointer and crashed.From this report I believe that
enif_thread_joinincorrectly returns before the command-processing thread has actually terminated. In the documentation forerl_drv_thread_join(for whichenif_thread_joinis an alias), I do note the following warning …… but I don’t have any evidence that there is an extra
joincall happening.FWIW, my system information:
Thanks for your attention so far and hope all of this information helps!
I installed elixer and can now see crashes. I’ve upgraded sqlite, and inspecting core dumps. It almost looks like it is an issue inside sqlite itself.
/cc @fishcakez @tchoutri: If you know of anyone who could help propose a fix, this would be a big help to the sqlite_ecto2 effort (on which I’ve made quite a bit of progress lately).
Credit to http://erlang.org/pipermail/erlang-questions/2011-February/056282.html for suggestions on how this scenario might happen.