BERTopic: AttributeError: Can't get attribute 'EuclideanDistance64' on

When I load the generated bertopic model, it give the following error traces:

/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/umap/distances.py:1063: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @numba.jit()
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/umap/distances.py:1071: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @numba.jit()
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/umap/distances.py:1086: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @numba.jit()
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/umap/umap_.py:660: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @numba.jit()
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
  arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
  arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
  arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
  arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
  arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
  arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
Traceback (most recent call last):
  File "/home/21zz42/Asset-Management-Topic-Modeling/Code/RQ1/best_model.py", line 24, in <module>
    topic_model = BERTopic.load(os.path.join(path_model, model_name))
  File "/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/bertopic/_bertopic.py", line 2998, in load
    topic_model = joblib.load(file)
  File "/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/joblib/numpy_pickle.py", line 648, in load
    obj = _unpickle(fobj)
  File "/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/joblib/numpy_pickle.py", line 577, in _unpickle
    obj = unpickler.load()
  File "/usr/lib/python3.10/pickle.py", line 1213, in load
    dispatch[key[0]](self)
  File "/usr/lib/python3.10/pickle.py", line 1538, in load_stack_global
    self.append(self.find_class(module, name))
  File "/usr/lib/python3.10/pickle.py", line 1582, in find_class
    return _getattribute(sys.modules[module], name)[0]
  File "/usr/lib/python3.10/pickle.py", line 331, in _getattribute
    raise AttributeError("Can't get attribute {!r} on {!r}"
AttributeError: Can't get attribute 'EuclideanDistance64' on <module 'sklearn.metrics._dist_metrics' from '/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/sklearn/metrics/_dist_metrics.cpython-310-x86_64-linux-gnu.so'>

When I am running the following code:

import os
import pickle
import pandas as pd

from bertopic import BERTopic

path_rq1 = os.path.join('Result', 'RQ1')
path_model = os.path.join(path_rq1, 'Model')

model_name = 'Challenge_preprocessed_gpt_summary_fzqzh0m6'
column = '_'.join(model_name.split('_')[:-1])

df = pd.read_json(os.path.join('Dataset', 'preprocessed.json'))
df['Challenge_topic'] = -1

indice = []
docs = []

for index, row in df.iterrows():
    if pd.notna(row[column]) and len(row[column]):
        indice.append(index)
        docs.append(row[column])
        
topic_model = BERTopic.load(os.path.join(path_model, model_name))
topic_number = topic_model.get_topic_info().shape[0] - 1
topics, probs = topic_model.transform(docs)

# persist the topic terms
with open(os.path.join(path_rq1, 'Topic terms.pickle'), 'wb') as handle:
    topic_terms = []
    for i in range(topic_number):
        topic_terms.append(topic_model.get_topic(i))
    pickle.dump(topic_terms, handle, protocol=pickle.HIGHEST_PROTOCOL)

fig = topic_model.visualize_topics()
fig.write_html(os.path.join(path_rq1, 'Topic visualization.html'))

fig = topic_model.visualize_barchart(top_n_topics=topic_number, n_words=10)
fig.write_html(os.path.join(path_rq1, 'Term visualization.html'))

fig = topic_model.visualize_heatmap()
fig.write_html(os.path.join(path_rq1, 'Topic similarity visualization.html'))

# This uses the soft-clustering as performed by HDBSCAN to find the best matching topic for each outlier document.
topics_new = topic_model.reduce_outliers(docs, topics, probabilities=probs, strategy="probabilities")

# persist the document topics
for index, topic in zip(indice, topics_new):
    df.at[index, 'Challenge_topic'] = topic

df = df[df.columns.drop(list(df.filter(regex=r'preprocessed|gpt_summary')))]
df.to_json(os.path.join(path_rq1, 'topics.json'), indent=4, orient='records')

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 15 (5 by maintainers)

Most upvoted comments

Is TfidfVectorizer better than CountVectorizer in terms of preprocessing?

You should actually not use the TfidfVectorizer since c-TF-IDF is applied on top of the vectorizer, which is expected to be a plain bag-of-words.

If I already trained the model with TfidfVectorizer , is there any way to load the trained model afterward with safetensor?

I think you can load the model if you remove all files belonging to the TfidfVectorizer. This would, however, create a more limited version of BERTopic. It would essentially be the same as using save_ctfidf=False.

It seems that this is a known issue for HDBSCAN which should already be fixed in their main branch. There is a new version of HDBSCAN but there are some commits after that. I believe this mostly relates to version controlling your environment when you pickle BERTopic. When using BERTopic v0.15, it is highly advised using either pytorch or safetensors to save the model. This is more robust to changing environments and corresponding dependencies.

With respect to a new release, I think it is best to wait until HDBSCAN is a bit more stable seeing as there are still some individuals experiencing some issues.