haystack: FAISSDocumentStore: number of documents present in the SQL database does not match the number of embeddings in FAISS

Describe the bug

After I created FAISSDocumentStore, it could not recreate FAISSDocumentStore, because it reports “documents are not presented in SQL database” error.

Error message

ValueError: The number of documents present in the SQL database does not match the number of embeddings in FAISS. Make sure your FAISS configuration file correctly points to the same database that was used when creating the original index.

Expected behavior Loaded saved documents in SQL DB and created FaissDocumentStore object.

Additional context None

To Reproduce

# Try this code twice! not once.

import argparse

from haystack.document_stores import FAISSDocumentStore
from haystack.nodes import EmbeddingRetriever
from haystack.pipelines import DocumentSearchPipeline


def load_data():
    return [{'content': 'test sent 1'},
            {'content': 'test sent 2'},
            {'content': 'test sent 3'}]


def build_pipeline(configs):
    dicts = load_data()

    document_store = FAISSDocumentStore(vector_dim=configs.hidden_dims, faiss_index_factory_str=configs.faiss_index)
    retriever = EmbeddingRetriever(document_store=document_store,
                                   embedding_model=configs.embedding_model,
                                   model_format=configs.model_format)

    document_store.write_documents(dicts)
    document_store.update_embeddings(retriever)

    return DocumentSearchPipeline(retriever)


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--hidden_dims', default=768)
    parser.add_argument('--faiss_index', default='Flat')
    parser.add_argument('--embedding_model', default='distilbert-base-uncased')
    parser.add_argument('--model_format', default='transformers')

    configs = parser.parse_args()

    pipe = build_pipeline(configs)
    print(pipe.run('test sent 1'))

FAQ Check

System:

  • OS: MacOS
  • GPU/CPU: CPU
  • Haystack version (commit or version number): 1.0.0
  • DocumentStore: FAISSDocumentStore
  • Reader: X
  • Retriever: EmbeddingRetriever

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 16 (8 by maintainers)

Most upvoted comments

Hey @Taekyoon, FAISSDocumentStore is a little special here. Under the hood it contains a sql database and a FAISS index. The sql database needs to be backed by file storage all the time, so it is stored automatically. The FAISS index on the other hand lives entirely in-memory and there is no automatic saving to disk taking place. That’s why FAISSDocumentStore has the save() method. This will write the FAISS index to disk. Furthermore you also have to load the previously stored FAISS index explicitly again. This works by calling the load() class method or passing faiss_index_path to the constructor. So you would basically first have to check whether there is a stored FAISS index on disk and decide if you want to create a new FAISSDocumentStore or use an existing one like:

if os.path.exists(configs.faiss_index_path):
    document_store = FAISSDocumentStore.load(configs.faiss_index_path)
else:
    document_store = FAISSDocumentStore(vector_dim=configs.hidden_dims, faiss_index_factory_str=configs.faiss_index)

@bogdankostic how do you do that? When creating a faiss document store, a ‘faiss_document_store.db’ file is created - how can I create a different one? TIA

Done deepset-ai/haystack-tutorials#38

Sounds good! let me try 😃 Thanks!

Sorry, I couldn’t get a message from the comment above. @tstadel I was thinking about the function or option to remove sql file while remove FAISSDocumentStore object, so that we don’t have to consider about “Is the db file exist?” before recreate FAISSDocumentStore again.