weaviate: BM25 returns no results in some situations (Original title: BM25/Tokenizer not working properly)
Hi –
I have an Article class, which contains many french legal articles.
Here is the class initiation :
class_obj = {
"class": "Article",
"description": "Articles des différentes codes de la loi.",
"invertedIndexConfig": {
"stopwords": {
"preset": "none",
#"additions": stopwords_fr
}
},
"vectorizer": "text2vec-transformers",
"properties": [
{
"name": "article_id",
"description": "Id unique de l'article.",
"dataType": [
"text"
],
"indexFilterable": True,
"indexSearchable": False,
},
{
"name": "source",
"description": "Titre de la source juridique (code, loi ou ordonnance) contenant l'article.",
"dataType": [
"text"
],
"tokenization": "lowercase",
"indexFilterable": True,
"indexSearchable": True,
},
{
"name": "titre",
"description": "Le titre de l'article.",
"dataType": [
"text"
],
"indexFilterable": True,
"indexSearchable": True,
"tokenization": "lowercase"
},
{
"name": "texte",
"description": "Le texte de l'article, en html.",
"dataType": [
"text"
],
"moduleConfig": {
"text2vec-transformers": {
"skip": False,
"vectorizePropertyName": False
}
},
"indexFilterable": True,
"indexSearchable": True,
},
{
"name": "etat",
"description": "Etat de l'article : en vigueur, abrogé...",
"dataType": [
"text"
],
"indexFilterable": True,
"indexSearchable": False,
},
{
"name": "path_title",
"description": "Chemin daccès à l'article",
"dataType": [
"text[]"
],
"indexFilterable": True,
"indexSearchable": True,
},
{
"name": "ref_textes",
"description": "Références avec d'autres textes.",
"dataType": [
"text"
],
"indexFilterable": True,
"indexSearchable": False,
},
{
"name": "order",
"description": "Ordre de l'article dans le code.",
"dataType": [
"int"
],
"indexFilterable": True,
"indexSearchable": False,
},
{
"name": "date_deb",
"description": "Date de début de l'article.",
"dataType": [
"int"
],
"indexFilterable": True,
"indexSearchable": False,
},
{
"name": "date_fin",
"description": "Date de fin de l'article.",
"dataType": [
"int"
],
"indexFilterable": True,
"indexSearchable": False,
},
]
}
client.schema.create_class(class_obj)
I want to let my users search for legal articles through their “titre” and “source” properties, which have been tokenized using lowercase
.
Here is an example of an article I’m trying to find :
{
"article_id": "JORFARTI000047663197",
"etat": "VIGUEUR",
"path_title": [
"Titre IER : DE LA NATURE DE L'ACTIVITÉ D'INFLUENCE COMMERCIALE PAR VOIE ÉLECTRONIQUE ET DES OBLIGATIONS AFFÉRENTES À SON EXERCICE",
"Chapitre III : Dispositions générales relatives à l'activité d'agent d'influenceur, aux contrats d'influence commerciale par voie électronique, à la responsabilité civile solidaire et à l'assurance civile professionnelle"
],
"source": "LOI n° 2023-451 du 9 juin 2023 visant à encadrer l'influence commerciale et à lutter contre les dérives des influenceurs sur les réseaux sociaux (1)",
"texte": ".....blablabla....",
"titre": "7"
},
Using the following query :
query {
Get {
Article(
limit: 5,
bm25: {
query: "LOI 9 juin 2023 visant à encadrer l'influence"
properties: ["source^3", "titre"]
}
) {
article_id
titre
path_title
texte
source
_additional {
score
}
}
}
}
…gives me the following results :
# only printing the "titre" and "source" in a list
[['2023', 'Code civil'], ["Annexe 9 à l'article A4241-50-2", 'Code des transports'], ["Annexe à l'article R*351-1, art. 9", 'Code des ports maritimes'], ['437 à 614-26', 'Code de commerce (ancien)'], ['L79 à L85', 'Code électoral']]
Am i missing something ? It seems like it should be able to find it because i’m literally copy/pasting the exact source name. Could it be an issue with the lowercase tokenizer ?
I’d be happy to provide you with further information if needed.
Thanks in advance.
About this issue
- Original URL
- State: closed
- Created 10 months ago
- Comments: 27 (10 by maintainers)
Commits related to this issue
- fix bm25 compaction bug fixes #3517 — committed to weaviate/weaviate by etiennedi 9 months ago
Please see PR #3592 for a detailed explanation of what caused this, why it only surfaced in v1.21, and how it’s fixed.
Update
v1.21.0
. All previous versions (e.g.v1.20.6
) are unaffected. All later versions are affected.Version bisecting
v1.19.0
✅v1.20.0
✅v1.20.6
(latest 1.20 patch) ✅v1.21.0
❌master
❌Reproduction
PERSISTENCE_FLUSH_IDLE_MEMTABLES_AFTER=3
(<-- this is important because the script from step 3 uses a 5 second pause to force a new segment and therefore compactions)Once the script indicates a corruption, the state stays corrupted as long as you don’t import any more objects. To prove this, we can look for a very common word, such as
Archive
.As the following screenshot shows,
Archive
is a very common word:However, it returns zero results for in a BM25 search:
The error is pretty consistent now. The dataset I’m using is around 100k simple texts and a number (artwork titles and artwork inventory number). If I drop the class and reindex everything it’s working fine, but after about a day the bm25 returns nothing. I’m using 1.21.1, running locally.
We are currently investigating the problem as a priority, and we hope to have a solution soon.
On Thu, 14 Sept 2023 at 08:12, Michael Hilhorst @.***> wrote:
A quick update - we have confirmed this bug and are still investigating it
What sounds a bit weird to me is how such a central feature has not been fixed yet. Maybe is everybody using weaviate only as a vector db? We had chosen it over its competitors just for its bm25 and hybrid support.
Hi @etiennedi that’s great news! Unfortunately I haven’t the time at this exact moment to test the script, but I can indeed confirm your points.
we are having the same issues. Bm25 does not return any results while the objects do exist and we are querying on the correct class. The problem seem to occur after some time when uploading a new object. When a fresh object is uploaded bm25 seem to work for some time.