scikit-learn: TSNE with correlation metric: ValueError: Distance matrix 'X' must be symmetric
from sklearn.manifold import TSNE
import numpy as np
np.random.seed(42)
data = np.random.rand(10, 3)
data[-1, :] = 0
model = TSNE(metric="correlation")
model.fit_transform(data)
TSNE raises an obscure error, when the data set contains rows with a standard deviation 0
and therefore undefined correlations:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-20-658142c1e315> in <module>()
1 model = TSNE(metric="correlation")
----> 2 res = model.fit_transform(data)
3 ran = model.fit_transform(ran_data)
/Users/ch/miniconda/envs/sci34/lib/python3.4/site-packages/sklearn/manifold/t_sne.py in fit_transform(self, X, y)
522 Embedding of the training data in low-dimensional space.
523 """
--> 524 self.fit(X)
525 return self.embedding_
/Users/ch/miniconda/envs/sci34/lib/python3.4/site-packages/sklearn/manifold/t_sne.py in fit(self, X, y)
447 self.training_data_ = X
448
--> 449 P = _joint_probabilities(distances, self.perplexity, self.verbose)
450 if self.init == 'pca':
451 pca = RandomizedPCA(n_components=self.n_components,
/Users/ch/miniconda/envs/sci34/lib/python3.4/site-packages/sklearn/manifold/t_sne.py in _joint_probabilities(distances, desired_perplexity, verbose)
52 P = conditional_P + conditional_P.T
53 sum_P = np.maximum(np.sum(P), MACHINE_EPSILON)
---> 54 P = np.maximum(squareform(P) / sum_P, MACHINE_EPSILON)
55 return P
56
/Users/ch/miniconda/envs/sci34/lib/python3.4/site-packages/scipy/spatial/distance.py in squareform(X, force, checks)
1479 raise ValueError('The matrix argument must be square.')
1480 if checks:
-> 1481 is_valid_dm(X, throw=True, name='X')
1482
1483 # One-side of the dimensions is set here.
/Users/ch/miniconda/envs/sci34/lib/python3.4/site-packages/scipy/spatial/distance.py in is_valid_dm(D, tol, throw, name, warning)
1562 if name:
1563 raise ValueError(('Distance matrix \'%s\' must be '
-> 1564 'symmetric.') % name)
1565 else:
1566 raise ValueError('Distance matrix must be symmetric.')
ValueError: Distance matrix 'X' must be symmetric
About this issue
- Original URL
- State: closed
- Created 9 years ago
- Comments: 18 (17 by maintainers)
Commits related to this issue
- #4475 : Add a safe_pairwise_distances function, dealing with zero variance samples when using correlation metric. The best fix would be to have the metric not returning NaN values, but as the correla... — committed to LowikC/scikit-learn by deleted user 9 years ago
- #4475: Fix directly in pairwise_distances, as suggested — committed to LowikC/scikit-learn by deleted user 9 years ago
I think we should close this issue. As I understand this metric does not work if one of the vectors has zero variance, returning Nan seems to be the chosen behaviour for this metric and no solution seems to be better. We could add a different assertion but I am not sure if we should.
So it seems the cosine bug has been fixed, but correlation is open.
For the cosine metric it should be fixed a posteriori by changings negative numbers with small absolute value (e.g. smaller than
10 * np.finfo(X.dtype).eps
) to be set to zero.For the correlation distance nans, it looks like a problem in scipy’s
pdist
.