haystack: Weaviate error when updating embeddings
Describe the bug After initializing Weaviate with dummy embeddings (just meta), and trying to update embeddings, it throws an error because of pagination limits.
Error message
---------------------------------------------------------------------------
WeaviateDocumentStoreError Traceback (most recent call last)
Cell In [8], line 8
5 wv_docstore = WeaviateDocumentStore(host='http://weaviate.weaviate.svc.cluster.local',port=80,index='news',embedding_dim=768)
6 wv_retriever = EmbeddingRetriever(document_store=wv_docstore,embedding_model='sentence-transformers/paraphrase-multilingual-mpnet-base-v2',model_format='sentence_transformers',use_gpu=True)
----> 8 wv_docstore.update_embeddings(retriever=wv_retriever,batch_size=1000)
File /opt/conda/lib/python3.10/site-packages/haystack/document_stores/weaviate.py:1230, in WeaviateDocumentStore.update_embeddings(self, retriever, index, filters, update_existing_embeddings, batch_size)
1224 raise RuntimeError(
1225 "All the documents in Weaviate store have an embedding by default. Only update is allowed!"
1226 )
1228 result = self._get_all_documents_in_index(index=index, filters=filters, batch_size=batch_size)
-> 1230 for result_batch in get_batches_from_generator(result, batch_size):
1231 document_batch = [
1232 self._convert_weaviate_result_to_document(hit, return_embedding=False) for hit in result_batch
1233 ]
1234 embeddings = retriever.embed_documents(document_batch)
File /opt/conda/lib/python3.10/site-packages/haystack/document_stores/base.py:890, in get_batches_from_generator(iterable, n)
886 """
887 Batch elements of an iterable into fixed-length chunks or blocks.
888 """
889 it = iter(iterable)
--> 890 x = tuple(islice(it, n))
891 while x:
892 yield x
File /opt/conda/lib/python3.10/site-packages/haystack/document_stores/weaviate.py:752, in WeaviateDocumentStore._get_all_documents_in_index(self, index, filters, batch_size, only_documents_without_embedding)
749 raise WeaviateDocumentStoreError(f"Weaviate raised an exception: {e}")
751 if "errors" in result:
--> 752 raise WeaviateDocumentStoreError(f"Query results contain errors: {result['errors']}")
754 # If `query.do` didn't raise and `result` doesn't contain errors,
755 # we are good accessing data
756 docs = result.get("data").get("Get").get(index)
WeaviateDocumentStoreError: Query results contain errors: [{'locations': [{'column': 6, 'line': 1}], 'message': 'explorer: list class: search: invalid pagination params: query maximum results exceeded', 'path': ['Get', 'News']}]
Expected behavior It’s expected that embeddings get updated correctly.
Additional context This error happens because of the pagination limit (10000) that was introduced on release 1.8.0.
Probably should be implemented some way to update embeddings.
UPDATE: The best solution currently appears to be this one:
You need to add an ordered arbitrary field… … integer going from 0…N… and than not use pagination but filter over this… Yeah it is annoying… 😕 but no other way for now
The easy solution would be increasing the maximum query results, but this has dramatic impact in Weaviate memory consumption. Another quote:
if it works for you than OK, but be aware that there is a 10k limit to this approach (you can change it but it will consume a lot of memory) and also it’ll get slower the bigger i you will have… this might be you only way to do it, if your data are already saved and need to retrieve them. However i wouldn’t recommend it as a general solution. If you have more than 10k records, then you have problem… you can either change the default conf. variable (and make sure that you have enough memory) or split the dataset by some other field that you already know all the known values for.
To Reproduce First, load more then 10000 documents to Document Store
from haystack.document_stores import WeaviateDocumentStore
from haystack.nodes import EmbeddingRetriever
from haystack.schema import Document
wv_docstore = WeaviateDocumentStore(host='http://weaviate.weaviate.svc.cluster.local',port=80,index='news',embedding_dim=768)
wv_retriever = EmbeddingRetriever(document_store=wv_docstore,embedding_model='sentence-transformers/paraphrase-multilingual-mpnet-base-v2',model_format='sentence_transformers',use_gpu=True)
wv_docstore.update_embeddings(retriever=wv_retriever,batch_size=1000)
FAQ Check
- Have you had a look at our new FAQ page?
System:
- OS: Ubuntu
- GPU/CPU: T4
- Haystack version (commit or version number): 1.9.1
- DocumentStore: WeaviateDocumentStore
- Retriever: EmbeddingRetriever
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 18 (6 by maintainers)
@asanoop24 what worked for me was to embed the documents before writing to Weaviate:
Thank you, @bobvanluijt and @etiennedi for the support! 😃
This would allow using Weaviate into Haystack production scenarios without complex solutions.
@anakin87 The error message is similar, but unfortunately the error is because of the used logic and the pagination limits in Weaviate.