scikit-learn: AttributeError: 'AgglomerativeClustering' object has no attribute 'distances_'
Description
While plotting a Hierarchical Clustering Dendrogram, I receive the following error:
AttributeError: ‘AgglomerativeClustering’ object has no attribute ‘distances_’
Steps/Code to Reproduce
plot_denogram
is a function from the example
similarity
is a cosine similarity matrix
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram
documents = (
"The sky is blue",
"The sun is bright",
"The sun in the sky is bright",
"We can see the shining sun, the bright sun",
"The cat stretched.",
"Jacob stood on his tiptoes.",
"The car turned the corner.",
"Kelly twirled in circles.",
"She opened the door.",
"Aaron made a picture."
)
vec = TfidfVectorizer()
X = vec.fit_transform(documents) # `X` will now be a TF-IDF representation of the data, the first row of `X` corresponds to the first sentence in `data`
# Calculate the pairwise cosine similarities (depending on the amount of data that you are going to have this could take a while)
sims = cosine_similarity(X)
similarity = np.round(sims, decimals = 5)
cluster = AgglomerativeClustering(n_clusters = 10, affinity = "cosine", linkage = "average")
cluster.fit(similarity)
def plot_dendrogram(model, **kwargs):
# Create linkage matrix and then plot the dendrogram
# create the counts of samples under each node
counts = np.zeros(model.children_.shape[0])
n_samples = len(model.labels_)
for i, merge in enumerate(model.children_):
current_count = 0
for child_idx in merge:
if child_idx < n_samples:
current_count += 1 # leaf node
else:
current_count += counts[child_idx - n_samples]
counts[i] = current_count
linkage_matrix = np.column_stack([model.children_, model.distances_,
counts]).astype(float)
# Plot the corresponding dendrogram
dendrogram(linkage_matrix, **kwargs)
# plot the top three levels of the dendrogram
plot_dendrogram(cluster, truncate_mode='level', p=2)
plt.xlabel("Number of points in node (or index of point if no parenthesis).")
plt.show()
Expected Results
A denogram
Actual Results
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-20-6255925aaa42> in <module>
21
22 # plot the top three levels of the dendrogram
---> 23 plot_dendrogram(cluster, truncate_mode='level', p=3)
24 plt.xlabel("Number of points in node (or index of point if no parenthesis).")
25 plt.show()
<ipython-input-20-6255925aaa42> in plot_dendrogram(model, **kwargs)
14 counts[i] = current_count
15
---> 16 linkage_matrix = np.column_stack([model.children_, model.distances_,
17 counts]).astype(float)
18
AttributeError: 'AgglomerativeClustering' object has no attribute 'distances_'
Versions
System: python: 3.7.6 (default, Jan 8 2020, 13:42:34) [Clang 4.0.1 (tags/RELEASE_401/final)] executable: /Users/libbyh/anaconda3/envs/belfer/bin/python machine: Darwin-19.3.0-x86_64-i386-64bit
Python dependencies: pip: 20.0.2 setuptools: 46.0.0.post20200309 sklearn: 0.22.1 numpy: 1.16.4 scipy: 1.3.1 Cython: None pandas: 1.0.1 matplotlib: 3.1.1 joblib: 0.14.1
Built with OpenMP: True
Extra Info
If I use a distance matrix instead, the denogram appears.
distance = 1 - similarity
cluster_dist = AgglomerativeClustering(distance_threshold=0, n_clusters=None, affinity = "precomputed", linkage = "average")
cluster_dist.fit(distance)
plot_dendrogram(cluster_dist, truncate_mode='level', p=2)
plt.xlabel("Number of points in node (or index of point if no parenthesis).")
plt.show()
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 2
- Comments: 21 (4 by maintainers)
I’m running into this problem as well. As @NicolasHug commented, the model only has .distances_ if distance_threshold is set. This does not solve the issue, however, because in order to specify n_clusters, one must set distance_threshold to None. I need to specify n_clusters. I must set distance_threshold to None. The example is still broken for this general use case.
Thanks all for the report. The
distances_
attribute only exists if thedistance_threshold
parameter is not None. This parameter was added in version 0.21.All the snippets in this thread that are failing are either using a version prior to 0.21, or don’t set
distance_threshold
.#17308 properly documents the
distances_
attribute.Encountered the error as well. Updating to version 0.23 resolves the issue. I first had version 0.21. And then upgraded it with: pip install -U scikit-learn
@libbyh the error looks like according to the documentation and code, both
n_cluster
anddistance_threshold
cannot be used together. https://github.com/scikit-learn/scikit-learn/blob/95d4f0841/sklearn/cluster/_agglomerative.py#L656@adrinjalali I wasn’t able to make a gist, so my example breaks the length recommendations, but I edited the original comment to make a copy+paste example.