annoy: Bus error in get_nns_by_vector
We are seeing bus errors in production from time to time when calling get_nns_by_vector. The process crashes with SIGBUS. We are stuck because we can’t answer the following questions:
- We are not able to reproduce it in a development environment because we have no idea what could be causing it. Are there any obvious scenarios that would explain the SIGBUS?
- The core dumps from production don’t help us either because we can’t get gdb to use the correct symbols and show us function names etc. Is there a way to get that to work without running a debug version of Python in production?
Some more context:
- We are running two processes, both loading the same index files from disk into memory.
- After the initial loading at startup, one of them becomes the “query” process which handles incoming requests by searching the loaded indices.
- The other becomes the “fetch” process which monitors the file system for changes to index files and reloads any changed index.
- Once the fetch process finishes reloading an index, the roles reverse and it becomes the “query” process while the other becomes the “fetch” process.
- These two processes are started from a main process which runs Flask and which is in turn started by uwsgi inside a docker container. uwsgi is configured to use a single process and a single thread.
At this point we are running out of ideas and any pointers on what to look at would be greatly appreciated.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 16 (1 by maintainers)
See #339 – turns out the fix is a one liner