scanpy: Returning cluster assignments as str conflicts with matplotlib color sequences

Currently, sc.tl.louvain etc return cluster assignments as a Categorical with dtype str resulting in incompatibility with matplotlib color sequences. For example, the following code raises a ValueError:

import numpy as np
import scanpy as sc
import matplotlib.pyplot as plt

adata = sc.AnnData(np.random.normal(size=(100,2)))
sc.pp.neighbors(adata)
sc.tl.louvain(adata)
plt.scatter(adata.X[:,0], adata.X[:,1], c=adata.obs['louvain'])

The error is: ValueError: RGBA values should be within 0-1 range. Funnily enough, this used to work due to a bug in matplotlib that was fixed in https://github.com/matplotlib/matplotlib/pull/13913.

Note, the following code works as intended:

plt.scatter(adata.X[:,0], adata.X[:,1], c=adata.obs['louvain'].astype(int))

I would have submitted a PR changing this behavior had I not noticed that returning cluster assignments as str is explicitly checked here:

https://github.com/theislab/scanpy/blob/78125e6355c0cd2c4ae930495829282eea6f4a52/scanpy/tools/_utils_clustering.py#L11-L23

This brings up a larger design question in scanpy / anndata: Why are arrays of numerics routinely converted to strings representing numbers?

In https://github.com/theislab/anndata/issues/311 I found a case where converting arrays of numerics to strings creates a bug when assigning to AnnData obsm with DataFrames with a RangeIndex. In that case, I understand there’s a desire to avoid ambiguity in positional vs label indexing, but that issue was solved in pandas with the .loc and .iloc conventions. Why not carry that forward?

In this case, why not just return cluster assignments as arrays of numerics as is done in sklearn.cluster?

I think following these conventions will make both tools much more accessible to the general Python data science community.

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Comments: 19 (13 by maintainers)

Most upvoted comments

I believe there’s a tutorial with this somewhere. Do you know where this was @LuckyMD?

I’m not so familiar with the scanpy tutorials, but I do show sub-clustering in the single-cell-tutorial notebook here

Hi, I just wanted to bring this back up again because I’ve been logging some of the issue’s I’ve encountered. It seems we’re at a bit of a philosophical divide, so perhaps it’s best for me to just register which use cases I have that AnnData / scanpy are personally causing me friction:

Instead of pasting all errors, I’m just going to paste code blocks I wish worked. Note, these are actual use cases I have regularly encountered.

1. Cannot pass AnnData to numpy or sklearn operators

import scanpy as sc
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import decomposition, cluster

data = np.random.normal(size=(100,10))
adata = sc.AnnData(data)

# All of the following raise errors
np.sqrt(adata)
adata[:, adata.var_names[0:3]] - adata[:, adata.var_names[3:6]]

adata.obsm['X_PCA'] = decomposition.PCA(2).fit_transform(adata)

To answer the question above, I think it should return the whole AnnData object, like how DataFrames return themselves. I don’t know if we think it should “update” the original AnnData. I’m also confused by how this results in a performance decrease? If I do adata = np.sqrt(adata) then isn’t this the same footprint as modifying inplace? If I do adata_sq = np.sqrt(adata) then my intention is to duplicate the adata object. In this case, it is my intention to create a duplicate object, and I would like AnnData to respect this intention. 2. Requirement to use .var_vector or .obs_vector for single columns

# This works as expected
adata[:, adata.var_names[0:3]]

# I wish this did as well.
adata[:, adata.var_names[0]]

3. .var_vector doesn’t return a Series

pdata = pd.DataFrame(data)
# Returns series
pdata[0]

# Returns ndarray
adata.var_vector[0]

4. Clusters as categories creates confusing scatterplots

sc.pp.neighbors(adata)
sc.tl.leiden(adata)

plt.scatter(adata.obs['leiden'], adata.X[:,0])

Produces the following plot. I would like it to have order 0-5 by default

image

5. Cannot pass clusters to c parameter in plt.scatter I would like this to just work. Instead it throws a huge error.

plt.scatter(adata.X[:,0], adata.X[:,1], c=adata.obs['leiden'])

6. Clusters as categories frustrate subclustering I understand this is a niche application, but like 4 and 5, this would be fixed by matching the output of sklearn.cluster operators.

sc.pp.neighbors(adata)
sc.tl.leiden(adata)

cluster_zero = adata[adata.obs['leiden'] == '0']
sub_clusters = cluster.KMeans(n_clusters=2).fit_predict(adata.X)

# Here I'm trying to break up cluster '0' into subclusters with 
# new names that don't clash with the existing clusters
# However, np.max() and the + operators aren't well defined for 
# cateogricals of strings
sub_clusters = sub_clusters + np.max(adata.obs['leiden'])

In https://github.com/theislab/anndata/issues/311 I found a case where converting arrays of numerics to strings creates a bug when assigning to AnnData obsm with DataFrames with a RangeIndex. In that case, I understand there’s a desire to avoid ambiguity in positional vs label indexing, but that issue was solved in pandas with the .loc and .iloc conventions. Why not carry that forward?

Also, the index of a dataframe in obsm must match obs_names.

An alternative would be to explicitly return a categorical from the clustering function, i.e. rather than ensuring that the clustering returns an array of str, ensure that it returns a categorical where the categories are ints.

We do explicitly return a categorical, it’s just a categorical of strings. There are a few reasons for this:

  • Some plotting libraries (more-so when the methods were written) don’t respect that a categorical array with numeric values is categorical. seaborn has some very weird behavior around this.
  • When subclustering within a cluster, the sub-cluster is named like: "{previous}-{sub}". You can’t do that with a numeric value. By just always using string categoricals we can be consistent about the type resulting from sc.tl.leiden this way.
  • Unclear what the advantage of using integer values for clustering names would be. Performance should be the same since they are categoricals.

About plt.scatter(adata.X[:,0], adata.X[:,1], c=adata.obs['louvain']), I think most of the scientific python ecosystem would like it if c could be categorical, and that it would mean categorical palette would be used. There’s even an issue by our own @flying-sheep about it https://github.com/matplotlib/matplotlib/issues/6214, but it sounds like it’s not gonna happen anytime soon.