anndata: inconsistent slicing of .X vs .layer
This is related to #74 for for slices and is behind the problem reported in https://github.com/theislab/scanpy/pull/555
from scipy.sparse import csr_matrix
adata = sc.AnnData(np.ones((3,3)))
adata.layers['test'] = adata.X.copy()
adata.layers['sparse'] = csr_matrix(adata.X)
print(adata[:, ['0']].X.shape,
adata[:, ['0']].layers['test'].shape,
adata[:, ['0']].layers['sparse'].shape)
print(adata[:, '0'].X.shape,
adata[:, '0'].layers['test'].shape,
adata[:, '0'].layers['sparse'].shape)
print(type(adata[:, ['0']].X),
type(adata[:, ['0']].layers['test']),
type(adata[:, ['0']].layers['sparse']))
print(type(adata[:, '0'].X),
type(adata[:, '0'].layers['test']),
type(adata[:, '0'].layers['sparse']))
The output is:
(3,) (3, 1) (3, 1)
(3,) (3,) (3, 1)
<class 'anndata.base.ArrayView'> <class 'numpy.ndarray'> <class 'scipy.sparse.csr.csr_matrix'>
<class 'anndata.base.ArrayView'> <class 'numpy.ndarray'> <class 'scipy.sparse.csr.csr_matrix'>
Ideally, while slicing .layers, the output should be the same as for .X
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 17 (15 by maintainers)
For completeness’ sake I’d like to enumerate the options:
.Xand flatten layers as well.ad[:, 1].Xorad[:, 'GeneA'].Xreturn a(n_obs,)array (1D),ad[:, [1]].Xorad[:, :1].Xreturn a(n_obs, 1)array (2D).len(ad[i, j].X.shape) == 2for all validis andjs.ad[int_or_str, int_or_str].Xis a special case that returns a single floatI also think “predictability”/“always-2d” is the best option.
ad[0, 'GeneA'].X), but suffers from the same problems as the “convenience” approach if not (ad = ad[0, 'GeneA']; <do stuff>; good_riddance(ad)).OK! Cool! This is exactly what @flying-sheep said from the very beginning about “always-2d”. 😃
My argument at the time (2 years ago) was that we need a convenient access to vectors. When anndata was “always-2d”, I had a lot of
adata[0, :].X.flatten()all over the place in the scanpy code, scripts and notebooks.Then I went ahead and moved the
.flatten()into the accessor. Which was a bad idea (now, I know, @flying-sheep 🙂).Anyway, I’m happy if we make this change if we make
anndata 1.0directly, which I think should be an option. A lot of code will break in people’s notebooks, and also within Scanpy, accessing.Xis assumed to be vector like in several instances. But I think, it’s probably worth it and it’s not crazy to fix it. It should go along withscanpy 1.5, I guess.Last question: I still think one needs convenient access to vectors along the data matrix. And as we have
.obs_vectoralready, it should be that. Maybe, in some distant future, we’d also have.X[:, gene][which would conform withadata[:, gene].Xand probably be most efficient], but that might be more tricky than the family ofArrayView,DataFrameViewetc. (if it still exists). However, if it’s easy enough to subclassnp.ndarray,sp.spmatrixandh5py.Dataset, to plugin the global AnnData indexer, this would be a great solution.Last but not least, communicating these changes properly will be key. Both in the release notes, docs, tutorials and github/twitter.