BERTopic: Possible "off by 1" bug in transform() when using reloaded model?

Hi,

While attempting to utilize the best practice for running inference on additional data with an existing bertopic model, I followed the advice to save and reload the model.

The code for the initial model looks like:

topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model, vectorizer_model=vectorizer_model, representation_model=representation_model, verbose=True, calculate_probabilities=False, 
                       n_gram_range=(1, 2), nr_topics=max_topics)
topics, probs = topic_model.fit_transform(docs, embeddings)

where max_topics is set to 150. This results in 150 topics being created, including the -1 outliers topic, so the regular topics are numbered from 0 to 148.

I then reduce outliers:

new_topics = topic_model.reduce_outliers(documents=docs, topics=topics, strategy="embeddings", embeddings=embeddings)
topic_model.update_topics(docs, topics=new_topics, n_gram_range=(1, 2), vectorizer_model=vectorizer_model, representation_model=representation_model)

after this there is no more -1 topic, we are left with topics numbered 0 to 148, which I’ve confirmed by printing out the topics:

topic_info = topic_model.get_topic_info()
log.debug(f"frequent topics:\n{topic_info.to_string()}")

So then I save and reload the model:

topic_model.save(path="saved_bertopic", serialization="safetensors", save_ctfidf=True, save_embedding_model=True)
topic_model = BERTopic.load("saved_bertopic", embedding_model=None)

I use the save_embedding_model=True despite using my own custom embeddings, due to a bug. I then load in more docs and more embeddings and use transform(), and check what is the minimum and maximum topic number assigned:

more_topics, more_probs = topic_model.transform(documents=more_docs, embeddings=more_embeddings)
log.debug(f"min_topic_number found in more_topics: {np.min(more_topics)}")
log.debug(f"max_topic_number found in more_topics: {np.max(more_topics)}")

This results in the log showing min number is 0, and max number is 149. The 149 topic number, is that incorrect? It causes my program to hit a bug later on when trying to access the topic name, because that index doesn’t exist.

Also I verified if I do not save/reload the model, but instead do the inference on the more_docs via the original model, then the min/max topic numbers are 0 and 148.

So it appears somehow transform() on a saved/reloaded model is generating 1 extra topic somehow? Please let me know if I’m using the software incorrectly, thank you.

About this issue

Original URL
State: closed
Created 5 months ago
Comments: 23 (7 by maintainers)

Commits related to this issue

Fix #1809 — committed to MaartenGr/BERTopic by MaartenGr 4 months ago
Fix #1809 (#1838) — committed to Pandinosaurus/BERTopic by MaartenGr 4 months ago

Most upvoted comments

Ok I checked self._outliers at 3 points in the process, here are the results:

On the initial model, just after fit_transform(): 1
After reducing outliers and update_topics(): 0
After saving/reloading the model: 0

A-Posthuman on Feb 17, 2024