weaviate: BM25 returns no results in some situations (Original title: BM25/Tokenizer not working properly)

Hi –

I have an Article class, which contains many french legal articles.

Here is the class initiation :

class_obj = {
    "class": "Article",
    "description": "Articles des différentes codes de la loi.",
    "invertedIndexConfig": {
        "stopwords": {
            "preset": "none",
            #"additions": stopwords_fr
        }
    },
    "vectorizer": "text2vec-transformers",
    "properties": [
        {
            "name": "article_id",
            "description": "Id unique de l'article.",
            "dataType": [
                "text"
            ],
            "indexFilterable": True,
            "indexSearchable": False, 
        },
        {
            "name": "source",
            "description": "Titre de la source juridique (code, loi ou ordonnance) contenant l'article.",
            "dataType": [
                "text"
            ],
            "tokenization": "lowercase",
            "indexFilterable": True,
            "indexSearchable": True, 
        },
        {
            "name": "titre",
            "description": "Le titre de l'article.",
            "dataType": [
                "text"
            ],
            "indexFilterable": True,
            "indexSearchable": True, 
            "tokenization": "lowercase"
        },
        {
            "name": "texte",
            "description": "Le texte de l'article, en html.",
            "dataType": [
                "text"
            ],
            "moduleConfig": {
                "text2vec-transformers": {
                    "skip": False,
                    "vectorizePropertyName": False
                }
            },
            "indexFilterable": True,
            "indexSearchable": True, 
        },
        {
            "name": "etat",
            "description": "Etat de l'article : en vigueur, abrogé...",
            "dataType": [
                "text"
            ],
            "indexFilterable": True,
            "indexSearchable": False, 
        },
        {
            "name": "path_title",
            "description": "Chemin daccès à l'article",
            "dataType": [
                "text[]"
            ],
            "indexFilterable": True,
            "indexSearchable": True, 
        },
        {
            "name": "ref_textes",
            "description": "Références avec d'autres textes.",
            "dataType": [
                "text"
            ],
            "indexFilterable": True,
            "indexSearchable": False, 
        },
        {
            "name": "order",
            "description": "Ordre de l'article dans le code.",
            "dataType": [
                "int"
            ],
            "indexFilterable": True,
            "indexSearchable": False,
        },
        {
            "name": "date_deb",
            "description": "Date de début de l'article.",
            "dataType": [
                "int"
            ],
            "indexFilterable": True,
            "indexSearchable": False,
        },
        {
            "name": "date_fin",
            "description": "Date de fin de l'article.",
            "dataType": [
                "int"
            ],
            "indexFilterable": True,
            "indexSearchable": False,
        },
    ]
}

client.schema.create_class(class_obj)

I want to let my users search for legal articles through their “titre” and “source” properties, which have been tokenized using lowercase.

Here is an example of an article I’m trying to find :

{
    "article_id": "JORFARTI000047663197",
    "etat": "VIGUEUR",
    "path_title": [
        "Titre IER : DE LA NATURE DE L'ACTIVITÉ D'INFLUENCE COMMERCIALE PAR VOIE ÉLECTRONIQUE ET DES OBLIGATIONS AFFÉRENTES À SON EXERCICE",
        "Chapitre III : Dispositions générales relatives à l'activité d'agent d'influenceur, aux contrats d'influence commerciale par voie électronique, à la responsabilité civile solidaire et à l'assurance civile professionnelle"
    ],
    "source": "LOI n° 2023-451 du 9 juin 2023 visant à encadrer l'influence commerciale et à lutter contre les dérives des influenceurs sur les réseaux sociaux (1)",
    "texte": ".....blablabla....",
    "titre": "7"
},

Using the following query :

query {
    Get {
    Article(
        limit: 5,
        bm25: {
            query: "LOI 9 juin 2023 visant à encadrer l'influence"
            properties: ["source^3", "titre"]
        }
        ) {
            article_id
            titre
            path_title
            texte
            source
            _additional {
                score
            }                
        }
    }
}

…gives me the following results :

# only printing the "titre" and "source" in a list
[['2023', 'Code civil'], ["Annexe 9 à l'article A4241-50-2", 'Code des transports'], ["Annexe à l'article R*351-1, art. 9", 'Code des ports maritimes'], ['437 à 614-26', 'Code de commerce (ancien)'], ['L79 à L85', 'Code électoral']]

Am i missing something ? It seems like it should be able to find it because i’m literally copy/pasting the exact source name. Could it be an issue with the lowercase tokenizer ?

I’d be happy to provide you with further information if needed.

Thanks in advance.

About this issue

  • Original URL
  • State: closed
  • Created 10 months ago
  • Comments: 27 (10 by maintainers)

Commits related to this issue

Most upvoted comments

Please see PR #3592 for a detailed explanation of what caused this, why it only surfaced in v1.21, and how it’s fixed.

Update

  1. We can reliably reproduce the bug. A script to do so is attached below.
  2. The bug appears from v1.21.0. All previous versions (e.g. v1.20.6) are unaffected. All later versions are affected.
  3. The bug is likely related to compactions. Turning them off makes the bug disappear.

Version bisecting

  • v1.19.0
  • v1.20.0
  • v1.20.6 (latest 1.20 patch) ✅
  • v1.21.0
  • master

Reproduction

  1. Download the following book from project Gutenberg in plain text form: https://www.gutenberg.org/ebooks/48871.txt.utf-8
  2. Start up Weaviate with PERSISTENCE_FLUSH_IDLE_MEMTABLES_AFTER=3 (<-- this is important because the script from step 3 uses a 5 second pause to force a new segment and therefore compactions)
  3. Run this script. It should typically fail anywhere between iteration 5 and 9

Once the script indicates a corruption, the state stays corrupted as long as you don’t import any more objects. To prove this, we can look for a very common word, such as Archive.

As the following screenshot shows, Archive is a very common word: Screenshot 2023-09-27 at 5 09 28 PM

However, it returns zero results for in a BM25 search:

Screenshot 2023-09-27 at 5 08 18 PM

The error is pretty consistent now. The dataset I’m using is around 100k simple texts and a number (artwork titles and artwork inventory number). If I drop the class and reindex everything it’s working fine, but after about a day the bm25 returns nothing. I’m using 1.21.1, running locally.

We are currently investigating the problem as a priority, and we hope to have a solution soon.

On Thu, 14 Sept 2023 at 08:12, Michael Hilhorst @.***> wrote:

we are having the same issues. Bm25 does not return any results while the objects do exist and we are querying on the correct class. The problem seem to occur after some time when uploading a new object. When a fresh object is uploaded bm25 seem to work for some time.

— Reply to this email directly, view it on GitHub https://github.com/weaviate/weaviate/issues/3517#issuecomment-1718822030, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABP7X2ECRDYXH226YDSA4HLX2KN5PANCNFSM6AAAAAA4QYXQEI . You are receiving this because you were mentioned.Message ID: @.***>

A quick update - we have confirmed this bug and are still investigating it

What sounds a bit weird to me is how such a central feature has not been fixed yet. Maybe is everybody using weaviate only as a vector db? We had chosen it over its competitors just for its bm25 and hybrid support.

Hi @etiennedi that’s great news! Unfortunately I haven’t the time at this exact moment to test the script, but I can indeed confirm your points.

  • we started noticing this bug after upgrading to 1.21
  • I was suspecting that it wasn’t happening directly on write, but with some sort of “offline/delayed” operation, because our current workflow consists in importing some data, updating it from time to time (we do it, so no updates from users), and querying it for most part of the time. And I had noticed that suddenly, it would just stop working, but not immediately after the writes. We don’t work with local weaviate instances, but on weaviate cloud only, so the flush interval is the default used on such instances.

we are having the same issues. Bm25 does not return any results while the objects do exist and we are querying on the correct class. The problem seem to occur after some time when uploading a new object. When a fresh object is uploaded bm25 seem to work for some time.