faiss: Faiss runs very slowly on M1 Mac

Summary

running inference on a saved index it is painfully slow on M1 Pro (10 core CPU 16 core GPU). The index is about 3.4Gb in size and takes 1.5 seconds for inference on CPU backend on colab but is taking >20 minutes on M1 CPU, what would be the reason for such slow performance ?

Platform

OS: macOS 12.4 Faiss version: 1.7.2 Installed from: compiled by self following install.md, and this issue

Faiss compilation options:

LDFLAGS="-L/opt/homebrew/opt/llvm/lib" CPPFLAGS="-I/opt/homebrew/opt/llvm/include" CXX=/opt/homebrew/opt/llvm/bin/clang++ CC=/opt/homebrew/opt/llvm/bin/clang cmake -DFAISS_ENABLE_GPU=OFF -B build .

Running on:

  • CPU
  • GPU

Interface:

  • C++
  • Python

Reproduction instructions

The code that I’m running is as follows

import numpy as np
import pandas as pd
import faiss
from sentence_transformers import SentenceTransformer

# Training Index
df = pd.read_csv('abcnews-data-text.csv')
data = df.headline_text.to_list()

model = SentenceTransformer('distilbert-base-nli-mean-tokens')
encoded_data = model.encode(data)

index = faiss.IndexIDMap(faiss.IndexFlatIP(768))
index.add_with_ids(encoded_data, np.array(range(0, len(data))))
faiss.write_index(index, 'abc_news')

# Inference 

def search(query):
   t=time.time()
   query_vector = model.encode([query])
   k = 5
   top_k = index.search(query_vector, k)
   print('totaltime: {}'.format(time.time()-t))
   return [data[_id] for _id in top_k[1].tolist()[0]]

index = faiss.read_index('abc_news')
query=str(input())
results=search(query)
print('results :')
for result in results:
   print('\t',result)

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 17 (9 by maintainers)

Most upvoted comments

@wx257osn2 Yes, OpenBLAS does give a good speed up. Thank you !

@SupreethRao99

-- Found BLAS: /Library/Developer/CommandLineTools/SDKs/MacOSX12.3.sdk/System/Library/Frameworks/Accelerate.framework

Ah, that is. According to this and this, Apple’s Accelerate framework for M1 runs on the AMX coprocessor. It seems that the coprocessor is good at electric efficiency, but not good at run-time speed, especially in multi-thread execution. Could you try to use OpenBLAS which enabled OpenMP? I’m not sure of that OpenBLAS helps you, but it may does with appropriate thread counts.


Furthermore, are there plans to support GPU acceleration on M1 processors?

I’m not a Meta employee for implementing faiss nor a contributor of GPU implementations so below is just my estimation: I think that supporting M1 GPU is currently not planed and needed to add a lot of implementations even if M1 GPU will be supported, because faiss uses CUDA to implement GPGPU codes and CUDA can’t work on M1 GPU. That must be the hard way, but the faiss team will probably welcome to your contribution if you will implement it.

Thanks @wx257osn2 for the overview.

To summarize, the issues for us to support alternative hardware are:

  • we need to be able to test the support with CircleCI to track regressions, ie. the appropriate hardware must be available in CircleCI

  • the stages for support are (1) compiling (2) passing tests and (3) optimizing. For step (3), unfortunately due to hardware and compiler specificities, it is not obvious that the speed of the hardware accelerator is competitive with existing accelerators, sometimes it is even slower than CPU.

  • we prioritize the hardware we work on ourselves, which is currently NVIDIA gpus.

  • and finally, we already have trouble maintaining the precompiles packages in the set of platforms we support…

So if anyone is willing to take ownership of other hardware accelerators, we’d be very happy to collaborate 😉

@wx257osn2 thank you. I tried the approach that you suggested by adding -DCMAKE_BUILD_TYPE=Release when building FAISS, but inference is still taking >10 mins.

Indeed the speed of flat search is dominated by the BLAS sgemm. Maybe the cmake logs indicate which BLAS version is used.

@SupreethRao99 Thanks trying. Hmm… what the BLAS library did you install? It seems that IndexFlatIP calls them.