cuml: [BUG] Using latest libfaiss causes RAFT handle constructor to hang

Describe the bug Using the latest libfaiss package (1.7.0) causes the RAFT handle constructor to hang.

Steps/Code to reproduce bug

  1. Create a fresh Conda environment from conda/environments/cuml_dev_cuda11.0.yml:
conda create -n cuml_dev python=3.7
conda env update -n cuml_dev --file=conda/environments/cuml_dev_cuda11.0.yml
  1. Build libcuml from the source:
./build.sh libcuml
  1. Run a single Google C++ test which will hang:
$ ./cpp/build/test/ml --gtest_filter=HandleTest.CreateHandleAndDestroy
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from HandleTest
[ RUN      ] HandleTest.CreateHandleAndDestroy
  1. Now downgrade libfaiss to 1.6.3 and then re-build libcuml:
conda install -c rapidsai -c nvidia -c rapidsai-nightly -c conda-forge libfaiss=1.6.3
rm -rf cpp/build/
./build.sh libcuml
  1. Now the test won’t hang anymore:
$ ./cpp/build/test/ml --gtest_filter=HandleTest.CreateHandleAndDestroy

[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from HandleTest
[ RUN      ] HandleTest.CreateHandleAndDestroy
[       OK ] HandleTest.CreateHandleAndDestroy (437 ms)
[----------] 1 test from HandleTest (438 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (438 ms total)
[  PASSED  ] 1 test.

Environment details (please complete the following information):

  • Environment location: Bare-metal
  • Linux Distro/Architecture: Ubuntu 18.04 amd64
  • GPU Model/Driver: Quadro RTX 8000, driver 450.51.06
  • CUDA: 11.0
  • Method of cuDF & cuML install: from the source. CMake 3.18.5, GCC 7.5.0, commit hash 8b78fa34c4b2f4a0beb952f74387bd749fdb30e2

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 20 (20 by maintainers)

Most upvoted comments

@hcho3 @viclafargue there is a new libfaiss package now on conda-forge that solves the issue

Will close the issue since it seems that things have been fixed around, feel free to re-open or open another one if something doesn’t work as expected.

Ah that makes sense, it totally sounds like it is jit compiling faiss for some reason, and the handle test doesn’t touch faiss so it would pass with 1.6.3. I’ll try to repro in a Turing machine later today to see if it triggers generally in gpus with compute 75.