BERTopic: AttributeError: Can't get attribute 'EuclideanDistance64' on When I load the generated bertopic model, it give the following error traces:
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/umap/distances.py:1063: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
@numba.jit()
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/umap/distances.py:1071: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
@numba.jit()
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/umap/distances.py:1086: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
@numba.jit()
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/umap/umap_.py:660: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
@numba.jit()
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
Traceback (most recent call last):
File "/home/21zz42/Asset-Management-Topic-Modeling/Code/RQ1/best_model.py", line 24, in <module>
topic_model = BERTopic.load(os.path.join(path_model, model_name))
File "/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/bertopic/_bertopic.py", line 2998, in load
topic_model = joblib.load(file)
File "/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/joblib/numpy_pickle.py", line 648, in load
obj = _unpickle(fobj)
File "/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/joblib/numpy_pickle.py", line 577, in _unpickle
obj = unpickler.load()
File "/usr/lib/python3.10/pickle.py", line 1213, in load
dispatch[key[0]](self)
File "/usr/lib/python3.10/pickle.py", line 1538, in load_stack_global
self.append(self.find_class(module, name))
File "/usr/lib/python3.10/pickle.py", line 1582, in find_class
return _getattribute(sys.modules[module], name)[0]
File "/usr/lib/python3.10/pickle.py", line 331, in _getattribute
raise AttributeError("Can't get attribute {!r} on {!r}"
AttributeError: Can't get attribute 'EuclideanDistance64' on <module 'sklearn.metrics._dist_metrics' from '/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/sklearn/metrics/_dist_metrics.cpython-310-x86_64-linux-gnu.so'>
When I am running the following code:
import os
import pickle
import pandas as pd
from bertopic import BERTopic
path_rq1 = os.path.join('Result', 'RQ1')
path_model = os.path.join(path_rq1, 'Model')
model_name = 'Challenge_preprocessed_gpt_summary_fzqzh0m6'
column = '_'.join(model_name.split('_')[:-1])
df = pd.read_json(os.path.join('Dataset', 'preprocessed.json'))
df['Challenge_topic'] = -1
indice = []
docs = []
for index, row in df.iterrows():
if pd.notna(row[column]) and len(row[column]):
indice.append(index)
docs.append(row[column])
topic_model = BERTopic.load(os.path.join(path_model, model_name))
topic_number = topic_model.get_topic_info().shape[0] - 1
topics, probs = topic_model.transform(docs)
# persist the topic terms
with open(os.path.join(path_rq1, 'Topic terms.pickle'), 'wb') as handle:
topic_terms = []
for i in range(topic_number):
topic_terms.append(topic_model.get_topic(i))
pickle.dump(topic_terms, handle, protocol=pickle.HIGHEST_PROTOCOL)
fig = topic_model.visualize_topics()
fig.write_html(os.path.join(path_rq1, 'Topic visualization.html'))
fig = topic_model.visualize_barchart(top_n_topics=topic_number, n_words=10)
fig.write_html(os.path.join(path_rq1, 'Term visualization.html'))
fig = topic_model.visualize_heatmap()
fig.write_html(os.path.join(path_rq1, 'Topic similarity visualization.html'))
# This uses the soft-clustering as performed by HDBSCAN to find the best matching topic for each outlier document.
topics_new = topic_model.reduce_outliers(docs, topics, probabilities=probs, strategy="probabilities")
# persist the document topics
for index, topic in zip(indice, topics_new):
df.at[index, 'Challenge_topic'] = topic
df = df[df.columns.drop(list(df.filter(regex=r'preprocessed|gpt_summary')))]
df.to_json(os.path.join(path_rq1, 'topics.json'), indent=4, orient='records')
When I load the generated bertopic model, it give the following error traces:
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/umap/distances.py:1063: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
@numba.jit()
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/umap/distances.py:1071: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
@numba.jit()
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/umap/distances.py:1086: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
@numba.jit()
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/umap/umap_.py:660: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
@numba.jit()
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
Traceback (most recent call last):
File "/home/21zz42/Asset-Management-Topic-Modeling/Code/RQ1/best_model.py", line 24, in <module>
topic_model = BERTopic.load(os.path.join(path_model, model_name))
File "/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/bertopic/_bertopic.py", line 2998, in load
topic_model = joblib.load(file)
File "/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/joblib/numpy_pickle.py", line 648, in load
obj = _unpickle(fobj)
File "/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/joblib/numpy_pickle.py", line 577, in _unpickle
obj = unpickler.load()
File "/usr/lib/python3.10/pickle.py", line 1213, in load
dispatch[key[0]](self)
File "/usr/lib/python3.10/pickle.py", line 1538, in load_stack_global
self.append(self.find_class(module, name))
File "/usr/lib/python3.10/pickle.py", line 1582, in find_class
return _getattribute(sys.modules[module], name)[0]
File "/usr/lib/python3.10/pickle.py", line 331, in _getattribute
raise AttributeError("Can't get attribute {!r} on {!r}"
AttributeError: Can't get attribute 'EuclideanDistance64' on <module 'sklearn.metrics._dist_metrics' from '/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/sklearn/metrics/_dist_metrics.cpython-310-x86_64-linux-gnu.so'>
When I am running the following code:
import os
import pickle
import pandas as pd
from bertopic import BERTopic
path_rq1 = os.path.join('Result', 'RQ1')
path_model = os.path.join(path_rq1, 'Model')
model_name = 'Challenge_preprocessed_gpt_summary_fzqzh0m6'
column = '_'.join(model_name.split('_')[:-1])
df = pd.read_json(os.path.join('Dataset', 'preprocessed.json'))
df['Challenge_topic'] = -1
indice = []
docs = []
for index, row in df.iterrows():
if pd.notna(row[column]) and len(row[column]):
indice.append(index)
docs.append(row[column])
topic_model = BERTopic.load(os.path.join(path_model, model_name))
topic_number = topic_model.get_topic_info().shape[0] - 1
topics, probs = topic_model.transform(docs)
# persist the topic terms
with open(os.path.join(path_rq1, 'Topic terms.pickle'), 'wb') as handle:
topic_terms = []
for i in range(topic_number):
topic_terms.append(topic_model.get_topic(i))
pickle.dump(topic_terms, handle, protocol=pickle.HIGHEST_PROTOCOL)
fig = topic_model.visualize_topics()
fig.write_html(os.path.join(path_rq1, 'Topic visualization.html'))
fig = topic_model.visualize_barchart(top_n_topics=topic_number, n_words=10)
fig.write_html(os.path.join(path_rq1, 'Term visualization.html'))
fig = topic_model.visualize_heatmap()
fig.write_html(os.path.join(path_rq1, 'Topic similarity visualization.html'))
# This uses the soft-clustering as performed by HDBSCAN to find the best matching topic for each outlier document.
topics_new = topic_model.reduce_outliers(docs, topics, probabilities=probs, strategy="probabilities")
# persist the document topics
for index, topic in zip(indice, topics_new):
df.at[index, 'Challenge_topic'] = topic
df = df[df.columns.drop(list(df.filter(regex=r'preprocessed|gpt_summary')))]
df.to_json(os.path.join(path_rq1, 'topics.json'), indent=4, orient='records')
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 15 (5 by maintainers)
You should actually not use the
TfidfVectorizersincec-TF-IDFis applied on top of the vectorizer, which is expected to be a plain bag-of-words.I think you can load the model if you remove all files belonging to the
TfidfVectorizer. This would, however, create a more limited version of BERTopic. It would essentially be the same as usingsave_ctfidf=False.It seems that this is a known issue for HDBSCAN which should already be fixed in their main branch. There is a new version of HDBSCAN but there are some commits after that. I believe this mostly relates to version controlling your environment when you pickle BERTopic. When using BERTopic v0.15, it is highly advised using either pytorch or safetensors to save the model. This is more robust to changing environments and corresponding dependencies.
With respect to a new release, I think it is best to wait until HDBSCAN is a bit more stable seeing as there are still some individuals experiencing some issues.