faiss: Faiss runs very slowly on M1 Mac
Summary
running inference on a saved index it is painfully slow on M1 Pro (10 core CPU 16 core GPU). The index is about 3.4Gb in size and takes 1.5 seconds for inference on CPU backend on colab but is taking >20 minutes on M1 CPU, what would be the reason for such slow performance ?
Platform
OS: macOS 12.4 Faiss version: 1.7.2 Installed from: compiled by self following install.md, and this issue
Faiss compilation options:
LDFLAGS="-L/opt/homebrew/opt/llvm/lib" CPPFLAGS="-I/opt/homebrew/opt/llvm/include" CXX=/opt/homebrew/opt/llvm/bin/clang++ CC=/opt/homebrew/opt/llvm/bin/clang cmake -DFAISS_ENABLE_GPU=OFF -B build .
Running on:
- CPU
- GPU
Interface:
- C++
- Python
Reproduction instructions
The code that I’m running is as follows
import numpy as np
import pandas as pd
import faiss
from sentence_transformers import SentenceTransformer
# Training Index
df = pd.read_csv('abcnews-data-text.csv')
data = df.headline_text.to_list()
model = SentenceTransformer('distilbert-base-nli-mean-tokens')
encoded_data = model.encode(data)
index = faiss.IndexIDMap(faiss.IndexFlatIP(768))
index.add_with_ids(encoded_data, np.array(range(0, len(data))))
faiss.write_index(index, 'abc_news')
# Inference
def search(query):
t=time.time()
query_vector = model.encode([query])
k = 5
top_k = index.search(query_vector, k)
print('totaltime: {}'.format(time.time()-t))
return [data[_id] for _id in top_k[1].tolist()[0]]
index = faiss.read_index('abc_news')
query=str(input())
results=search(query)
print('results :')
for result in results:
print('\t',result)
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 17 (9 by maintainers)
@wx257osn2 Yes, OpenBLAS does give a good speed up. Thank you !
@SupreethRao99
Ah, that is. According to this and this, Apple’s Accelerate framework for M1 runs on the AMX coprocessor. It seems that the coprocessor is good at electric efficiency, but not good at run-time speed, especially in multi-thread execution. Could you try to use OpenBLAS which enabled OpenMP? I’m not sure of that OpenBLAS helps you, but it may does with appropriate thread counts.
I’m not a Meta employee for implementing
faiss
nor a contributor of GPU implementations so below is just my estimation: I think that supporting M1 GPU is currently not planed and needed to add a lot of implementations even if M1 GPU will be supported, because faiss uses CUDA to implement GPGPU codes and CUDA can’t work on M1 GPU. That must be the hard way, but the faiss team will probably welcome to your contribution if you will implement it.Thanks @wx257osn2 for the overview.
To summarize, the issues for us to support alternative hardware are:
we need to be able to test the support with CircleCI to track regressions, ie. the appropriate hardware must be available in CircleCI
the stages for support are (1) compiling (2) passing tests and (3) optimizing. For step (3), unfortunately due to hardware and compiler specificities, it is not obvious that the speed of the hardware accelerator is competitive with existing accelerators, sometimes it is even slower than CPU.
we prioritize the hardware we work on ourselves, which is currently NVIDIA gpus.
and finally, we already have trouble maintaining the precompiles packages in the set of platforms we support…
So if anyone is willing to take ownership of other hardware accelerators, we’d be very happy to collaborate 😉
@wx257osn2 thank you. I tried the approach that you suggested by adding
-DCMAKE_BUILD_TYPE=Release
when building FAISS, but inference is still taking >10 mins.Indeed the speed of flat search is dominated by the BLAS sgemm. Maybe the cmake logs indicate which BLAS version is used.
@SupreethRao99 Thanks trying. Hmm… what the BLAS library did you install? It seems that
IndexFlatIP
calls them.