esqlite: Occasional VM crash due to race condition between deferred prepare and garbage collection

This is admittedly an unlikely condition in real-world usage, but I have observed an occasional BEAM crash in the esqlite NIF code when the following sequence of events occurs in quick succession:

  1. A database is created using esqlite:open/4.
  2. A statement is prepared using esqlite:prepare/4.
  3. The database resource gets garbage collected.

I added some debug logging to the native code which suggests the above sequence can (though doesn’t always) expand into the following sequence:

  1. (Scheduler thread) Database created. It creates the command-processing thread.
  2. (Scheduler thread) esqlite_prepare is called. It queues up a cmd_prepare event.
  3. (Scheduler thread) destruct_esqlite_connection is called. It queues up a cmd_stop event and waits for the command-processing thread to stop.
  4. (Command processor thread) Receives the cmd_prepare event and calls do_prepare.
  5. (Command processor thread) do_prepare calls enif_keep_resource on the database connection resources and subsequently returns a reference to a statement resource.
  6. (Command processor thread) Processes cmd_stop event and exits.
  7. (Scheduler thread) destruct_esqlite_connection is unblocked and finishes tearing down data structures.
  8. (Scheduler thread) destruct_esqlite_statement gets called. It calls enif_release_resource on the database resource.
  9. (Scheduler thread) destruct_esqlite_connection gets called a second time for the same resource cited in step #3 above. The data structures were released at step 7 so the pointers are now invalid. Crash with an assertion failure or segfault.

Here’s an example from a recent test run:

esqlite_start
  conn = 0000000014902e60
  queue = 0000000014d800b0

esqlite_prepare called on conn 0000000014902e60
do_prepare
  conn = 0000000014902e60
  sql = PRAGMA foreign_keys

destruct_esqlite_connection
  conn = 0000000014902e60
do_prepare calling keep_resource on conn 0000000014902e60
destruct_esqlite_statement calling release_resource on conn 0000000014902e60

destruct_esqlite_connection
  conn = 0000000014902e60
Assertion failed: (queue->message == NULL && "Attempting to send multiple messages."), function queue_send, file c_src/queue.c, line 173.
Abort trap: 6

Note that segfaults and other assertions have been observed as well, presumably because the precise failure mode can vary depending on how the data structures have been altered since being released.

Also: Line numbers may be slightly different from current master branch because I’ve added instrumentation.

The crash happens about 20-30% of the time if I run the test suite for https://github.com/scouten/sqlite_ecto2/pull/29. I’m attempting to derive a simpler test case for it, but haven’t succeeded in stripping it down yet. If I do, I’ll add that as a comment here.

I may also attempt a fix, but NIFs are relatively new to me.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 25 (25 by maintainers)

Commits related to this issue

Most upvoted comments

Still seeing a crash, unfortunately. I think this may be a different issue that was unmasked by the previous fix.

I believe there’s a problem with the call to enif_thread_join at line 220 of esqlite3_nif.c. I’m sort of baffled by this finding; I would have expected that call to be battle-tested and robust, but I’ll show you how I got to that conclusion.

I built a stripped-down version of my test harness to demonstrate. For purposes of this comment, the changes I made at https://github.com/scouten/crash_esqlite_case/commit/d800062eaf7455cacd808d513c6c151e3c2d65e7?w=1 (sorry for whitespace noise; my editor strips trailing space by default and I didn’t catch it) will be important. You may find those useful in debugging. Basically I’ve added console output to all of the calls relevant to the conn and queue data structures so I can observe how the two threads interact.

(If you don’t mind installing Elixir 1.3.x or later, the try-to-crash.sh script at the root of that repo is pretty effective. It re-runs the test script 100 times, but almost always crashes within the first 10 iterations on my machine.)

What follows is a transcript of a test run on my machine that crashed. There were many other threads involved but I’ve stripped this down to the calls relevant to the thread pair that crashed:

esqlite_start conn = 000000001e4077b0 queue = 000000001e8800b0
esqlite_connection_run conn = 000000001e4077b0 queue = 000000001e8800b0 command thread START
queue_pop q = 000000001e8800b0 calling enif_mutex_lock
esqlite_start conn = 000000001e4077b0 queue = 000000001e8800b0 command thread 000000001e8801e0 created
queue_push q = 000000001e8800b0 calling enif_mutex_lock
queue_pop q = 000000001e8800b0 calling enif_mutex_lock
queue_push q = 000000001e8800b0 calling enif_mutex_lock
queue_pop q = 000000001e8800b0 calling enif_mutex_lock
esqlite_prepare calling keep_resource on conn 000000001e4077b0
queue_push q = 000000001e8800b0 calling enif_mutex_lock
do_prepare conn = 000000001e4077b0 sql = PRAGMA foreign_keys
do_prepare conn = 000000001e4077b0 DONE
destruct_esqlite_statement release_resource conn = 000000001e4077b0
destruct_esqlite_connection conn = 000000001e4077b0 queue = 000000001e8800b0
queue_push q = 000000001e8800b0 calling enif_mutex_lock
destruct_esqlite_connection conn = 000000001e4077b0 queue = 000000001e8800b0 waiting for cmd thread 000000001e8801e0 to finish
   # --- point A
destruct_esqlite_connection conn = 000000001e4077b0 queue = 000000001e8800b0 draining remaining commands
queue_has_item q = 000000001e8800b0 calling enif_mutex_lock
queue_pop q = 000000001e8800b0 calling enif_mutex_lock
queue_has_item q = 000000001e8800b0 calling enif_mutex_lock
destruct_esqlite_connection conn = 000000001e4077b0 queue = 000000001e8800b0 queue is empty
queue_destroy q = 000000001e8800b0 calling enif_mutex_lock
destruct_esqlite_connection conn = 000000001e4077b0 queue = 000000001e8800b0 queue destroyed
destruct_esqlite_connection conn = 000000001e4077b0 DONE
queue_pop q = 000000001e8800b0 calling enif_mutex_lock
   # --- point B
../include/internal/ethr_mutex.h:656: Fatal error in ethr_mutex_lock(): Invalid argument (22)
./try-to-crash.sh: line 8: 16231 Abort trap: 6           mix test --no-deps-check

At point A, I would have expected to see esqlite_connection_run conn = 000000001e4077b0 queue = 000000001e8800b0 command thread EXIT but it didn’t happen. Note that I do see that message, as expected, in test runs that do not crash. I also see it in the crashing test runs, but not for the thread pair that crashes.

At point B, it appears that the command processing thread apparently continued to run and called queue_pop with a now-dangling pointer and crashed.

From this report I believe that enif_thread_join incorrectly returns before the command-processing thread has actually terminated. In the documentation for erl_drv_thread_join (for which enif_thread_join is an alias), I do note the following warning …

A thread can only be joined once. The behavior of joining more than once is undefined, an emulator crash is likely.

… but I don’t have any evidence that there is an extra join call happening.

FWIW, my system information:

Erlang/OTP 19 [erts-8.2] [source] [64-bit] [smp:8:8] [async-threads:10] [hipe] [kernel-poll:false] [dtrace]

Elixir 1.3.4

Thanks for your attention so far and hope all of this information helps!

I installed elixer and can now see crashes. I’ve upgraded sqlite, and inspecting core dumps. It almost looks like it is an issue inside sqlite itself.

Thread 29:: esqlite_connect
0   libsystem_kernel.dylib        	0x00007fffd203ec86 __psynch_cvwait + 10
1   beam.smp                      	0x00000000154698f6 ethr_cond_wait + 6
2   esqlite3_nif.so               	0x00000001b0cce2c9 esqlite_connection_run + 89 (esqlite3_nif.c:694)

Thread 30 Crashed:: esqlite_connect
0   libsystem_pthread.dylib       	0x00007fffd21254fc pthread_mutex_lock + 0
1   esqlite3_nif.so               	0x00000001b0ce9875 pthreadMutexEnter + 21 (sqlite3.c:23358)
2   esqlite3_nif.so               	0x00000001b0cd0d14 sqlite3_mutex_enter + 36 (sqlite3.c:22809)
3   esqlite3_nif.so               	0x00000001b0cdbfcd columnName + 109 (sqlite3.c:76920)
4   esqlite3_nif.so               	0x00000001b0cdbf54 sqlite3_column_name + 36 (sqlite3.c:76938)
5   esqlite3_nif.so               	0x00000001b0ccef3b do_column_names + 187 (esqlite3_nif.c:581)
6   esqlite3_nif.so               	0x00000001b0cce635 evaluate_command + 341 (esqlite3_nif.c:659)
7   esqlite3_nif.so               	0x00000001b0cce33b esqlite_connection_run + 203 (esqlite3_nif.c:701)

Thread 31:: esqlite_connect
0   esqlite3_nif.so               	0x00000001b0ce4930 setupLookaside + 352 (sqlite3.c:140082)
1   esqlite3_nif.so               	0x00000001b0ce6990 openDatabase + 1552 (sqlite3.c:142436)
2   esqlite3_nif.so               	0x00000001b0ce6376 sqlite3_open + 38 (sqlite3.c:142489)
3   esqlite3_nif.so               	0x00000001b0cce832 do_open + 162 (esqlite3_nif.c:272)
4   esqlite3_nif.so               	0x00000001b0cce537 evaluate_command + 87 (esqlite3_nif.c:645)
5   esqlite3_nif.so               	0x00000001b0cce33b esqlite_connection_run + 203 (esqlite3_nif.c:701)

/cc @fishcakez @tchoutri: If you know of anyone who could help propose a fix, this would be a big help to the sqlite_ecto2 effort (on which I’ve made quite a bit of progress lately).

Credit to http://erlang.org/pipermail/erlang-questions/2011-February/056282.html for suggestions on how this scenario might happen.