annoy: Bus error in get_nns_by_vector

We are seeing bus errors in production from time to time when calling get_nns_by_vector. The process crashes with SIGBUS. We are stuck because we can’t answer the following questions:

  1. We are not able to reproduce it in a development environment because we have no idea what could be causing it. Are there any obvious scenarios that would explain the SIGBUS?
  2. The core dumps from production don’t help us either because we can’t get gdb to use the correct symbols and show us function names etc. Is there a way to get that to work without running a debug version of Python in production?

Some more context:

  • We are running two processes, both loading the same index files from disk into memory.
  • After the initial loading at startup, one of them becomes the “query” process which handles incoming requests by searching the loaded indices.
  • The other becomes the “fetch” process which monitors the file system for changes to index files and reloads any changed index.
  • Once the fetch process finishes reloading an index, the roles reverse and it becomes the “query” process while the other becomes the “fetch” process.
  • These two processes are started from a main process which runs Flask and which is in turn started by uwsgi inside a docker container. uwsgi is configured to use a single process and a single thread.

At this point we are running out of ideas and any pointers on what to look at would be greatly appreciated.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 16 (1 by maintainers)

Most upvoted comments

See #339 – turns out the fix is a one liner