haystack: Weaviate error when updating embeddings

Describe the bug After initializing Weaviate with dummy embeddings (just meta), and trying to update embeddings, it throws an error because of pagination limits.

Error message

---------------------------------------------------------------------------
WeaviateDocumentStoreError                Traceback (most recent call last)
Cell In [8], line 8
      5 wv_docstore = WeaviateDocumentStore(host='http://weaviate.weaviate.svc.cluster.local',port=80,index='news',embedding_dim=768)
      6 wv_retriever = EmbeddingRetriever(document_store=wv_docstore,embedding_model='sentence-transformers/paraphrase-multilingual-mpnet-base-v2',model_format='sentence_transformers',use_gpu=True)
----> 8 wv_docstore.update_embeddings(retriever=wv_retriever,batch_size=1000)

File /opt/conda/lib/python3.10/site-packages/haystack/document_stores/weaviate.py:1230, in WeaviateDocumentStore.update_embeddings(self, retriever, index, filters, update_existing_embeddings, batch_size)
   1224     raise RuntimeError(
   1225         "All the documents in Weaviate store have an embedding by default. Only update is allowed!"
   1226     )
   1228 result = self._get_all_documents_in_index(index=index, filters=filters, batch_size=batch_size)
-> 1230 for result_batch in get_batches_from_generator(result, batch_size):
   1231     document_batch = [
   1232         self._convert_weaviate_result_to_document(hit, return_embedding=False) for hit in result_batch
   1233     ]
   1234     embeddings = retriever.embed_documents(document_batch)

File /opt/conda/lib/python3.10/site-packages/haystack/document_stores/base.py:890, in get_batches_from_generator(iterable, n)
    886 """
    887 Batch elements of an iterable into fixed-length chunks or blocks.
    888 """
    889 it = iter(iterable)
--> 890 x = tuple(islice(it, n))
    891 while x:
    892     yield x

File /opt/conda/lib/python3.10/site-packages/haystack/document_stores/weaviate.py:752, in WeaviateDocumentStore._get_all_documents_in_index(self, index, filters, batch_size, only_documents_without_embedding)
    749     raise WeaviateDocumentStoreError(f"Weaviate raised an exception: {e}")
    751 if "errors" in result:
--> 752     raise WeaviateDocumentStoreError(f"Query results contain errors: {result['errors']}")
    754 # If `query.do` didn't raise and `result` doesn't contain errors,
    755 # we are good accessing data
    756 docs = result.get("data").get("Get").get(index)

WeaviateDocumentStoreError: Query results contain errors: [{'locations': [{'column': 6, 'line': 1}], 'message': 'explorer: list class: search: invalid pagination params: query maximum results exceeded', 'path': ['Get', 'News']}]

Expected behavior It’s expected that embeddings get updated correctly.

Additional context This error happens because of the pagination limit (10000) that was introduced on release 1.8.0.

Probably should be implemented some way to update embeddings.

UPDATE: The best solution currently appears to be this one:

You need to add an ordered arbitrary field… … integer going from 0…N… and than not use pagination but filter over this… Yeah it is annoying… 😕 but no other way for now

The easy solution would be increasing the maximum query results, but this has dramatic impact in Weaviate memory consumption. Another quote:

if it works for you than OK, but be aware that there is a 10k limit to this approach (you can change it but it will consume a lot of memory) and also it’ll get slower the bigger i you will have… this might be you only way to do it, if your data are already saved and need to retrieve them. However i wouldn’t recommend it as a general solution. If you have more than 10k records, then you have problem… you can either change the default conf. variable (and make sure that you have enough memory) or split the dataset by some other field that you already know all the known values for.

To Reproduce First, load more then 10000 documents to Document Store

from haystack.document_stores import WeaviateDocumentStore
from haystack.nodes import EmbeddingRetriever
from haystack.schema import Document

wv_docstore = WeaviateDocumentStore(host='http://weaviate.weaviate.svc.cluster.local',port=80,index='news',embedding_dim=768)
wv_retriever = EmbeddingRetriever(document_store=wv_docstore,embedding_model='sentence-transformers/paraphrase-multilingual-mpnet-base-v2',model_format='sentence_transformers',use_gpu=True)

wv_docstore.update_embeddings(retriever=wv_retriever,batch_size=1000)

FAQ Check

Have you had a look at our new FAQ page?

System:

OS: Ubuntu
GPU/CPU: T4
Haystack version (commit or version number): 1.9.1
DocumentStore: WeaviateDocumentStore
Retriever: EmbeddingRetriever

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 18 (6 by maintainers)

Most upvoted comments

@asanoop24 what worked for me was to embed the documents before writing to Weaviate:

def pre_embedder(docs):
    print('Running the pre-embedding')
    retriever = EmbeddingRetriever(
    document_store=split_document_store,
    embedding_model=embedding_model,
    model_format=model_format
    )
    embeds = retriever.embed_documents(docs)
    for doc, emb in zip(docs,embeds):
        doc.embedding = emb
    return docs

fvanlitsenburg on Jan 21, 2023

Thank you, @bobvanluijt and @etiennedi for the support! 😃

This would allow using Weaviate into Haystack production scenarios without complex solutions.

danielbichuetti on Oct 15, 2022

@anakin87 The error message is similar, but unfortunately the error is because of the used logic and the pagination limits in Weaviate.

danielbichuetti on Oct 14, 2022