RAGatouille: Indexing can not be completed (on Windows)

I’m testing 01-basic_indexing_and_search.ipynb on a Windows 10 PC, in Cursor IDE, using Python 3.11.6

Cell: RAG.index(collection=[full_document], index_name="Miyazaki", max_document_length=180, split_documents=True) can not be completed after almost an hour!

[Jan 05, 10:46:21] #> Creating directory .ragatouille/colbert\indexes/Miyazaki 
#> Starting...

is shown, I restarted kernel after an hour.

The previous cell, which prints the length of full_document, worked properly.

About this issue

  • Original URL
  • State: closed
  • Created 6 months ago
  • Reactions: 1
  • Comments: 29 (11 by maintainers)

Most upvoted comments

The CUBLAS errors turned out to be faiss incompatible driver issues for most people. This should be fixed by the new experimental default indexing in 0.0.8, which skips using faiss (does K-means in pure pytorch) as long as you’re indexing fewer than ~100k documents!

Multiprocessing is no longer enforced for indexing when using no GPU or a single GPU thanks to @Anmol6’s excellent upstream work on https://github.com/stanford-futuredata/ColBERT/pull/290 & propagated by https://github.com/bclavie/RAGatouille/pull/51.

This is likely to fix the indexing problems on Windows (or at least, one of the problems). Please let me know if the latest version of RAGatouille fixes it for you!

Hey, thanks for this @jponline77 – indexing is slow sadly, taking a while to create the index is the tradeoff to querying very large corpuses at near-constant time. It can maybe be optimised though (that’d require work on the upstream ColBERT repo), but that’s something for the future! I’m working on a feature to do index-free search, it’s not very scaleable, at least at the moment (you could query maybe up to 1k documents in >1s on a T4 GPU, and obviously much slower every time you add something) but for smaller corpuses it will make it easy to try it out!


@vanetreg I think (not sure) you could try it out in a standalone script like I mentioned earlier? Wrap it in if __name__ == "__main__":… It’s not ideal for interactivity but it could work! (At least it does on every non-windows platform I’ve tried). Anyhow, the Mac Mini is an excellent choice 😄

I was using Windows 11, Cursor, Python 10 through WSL… Worked for me. So, may be a windows not in WSL thing. I gotta say it would be hard for me to imagine not working in WSL on a Windows machine myself at this point.

@jponline77 I tested it both in VSC and Cursor, in both WSL extension installed. Maybe Windows version (10 / 11 ) matters?

Yeah, maybe it’s a Windows 10 issue. Just be sure, if you are using WSL, that it’s actually running in WSL. If you are setup to run in WSL, then you should be able to try to run it command line from WSL directly without using VSC or Cursor. My experience with WSL is that it runs everything that runs in Ubuntu in a very similar way as if it was a standalone Linux system. So, it would surprise me a little if it matters if you are Windows 10 or 11. That said, any reason you aren’t interested in upgrading to 11? I’ve now got RAGatouille running on two different systems with Windows 11 and WSL. One was a Laptop with a low end integrated GPU and 16GB of memory. It did take 10 minutes to index a small file but it worked.

Hey @vanetreg, for your other issue, the partial init – no idea what’s going on there, it seems like something weird happened when initialising ntlk?

I’ve tested some things on my end and I can confirm this is due to how ColBERT does multiprocessing, which causes the issue in some environments (seemingly Colab and Windows 10). This will eventually be fixed once the multiprocessing handling is changed upstream but sadly there doesn’t seem to be a good in-notebook workaround on those two platforms at the moment.

If you use RAGatouille in a python script (making sure to have it inside if __name__ == "__main__":), it should hopefully run fine (though again, not tested on Windows)!

@bclavie You’re right, the code doesn’t work either in the Python CLI, and seems related to the ColBERT library.

I’ll open a new issue and dig a little bit more.

Hey @timothepearce, thanks for flagging! I believe this is a very separate problem (the multiprocessing in your case runs fine, but there seems to be another problem). Could you create a new issue so I can look into it a bit more? And could you try out the notebooks in examples/ ? I think there might be something wrong with the README, which is (probably) that there aren’t enough documents in the example (which I could fix by adopting a separate logic for n_docs that are far too small).

I was using Windows 11, Cursor, Python 10 through WSL… Worked for me. So, may be a windows not in WSL thing. I gotta say it would be hard for me to imagine not working in WSL on a Windows machine myself at this point.