anndata: inconsistent slicing of .X vs .layer
This is related to #74 for for slices and is behind the problem reported in https://github.com/theislab/scanpy/pull/555
from scipy.sparse import csr_matrix
adata = sc.AnnData(np.ones((3,3)))
adata.layers['test'] = adata.X.copy()
adata.layers['sparse'] = csr_matrix(adata.X)
print(adata[:, ['0']].X.shape,
adata[:, ['0']].layers['test'].shape,
adata[:, ['0']].layers['sparse'].shape)
print(adata[:, '0'].X.shape,
adata[:, '0'].layers['test'].shape,
adata[:, '0'].layers['sparse'].shape)
print(type(adata[:, ['0']].X),
type(adata[:, ['0']].layers['test']),
type(adata[:, ['0']].layers['sparse']))
print(type(adata[:, '0'].X),
type(adata[:, '0'].layers['test']),
type(adata[:, '0'].layers['sparse']))
The output is:
(3,) (3, 1) (3, 1)
(3,) (3,) (3, 1)
<class 'anndata.base.ArrayView'> <class 'numpy.ndarray'> <class 'scipy.sparse.csr.csr_matrix'>
<class 'anndata.base.ArrayView'> <class 'numpy.ndarray'> <class 'scipy.sparse.csr.csr_matrix'>
Ideally, while slicing .layers, the output should be the same as for .X
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 17 (15 by maintainers)
For completeness’ sake I’d like to enumerate the options:
.X
and flatten layers as well.ad[:, 1].X
orad[:, 'GeneA'].X
return a(n_obs,)
array (1D),ad[:, [1]].X
orad[:, :1].X
return a(n_obs, 1)
array (2D).len(ad[i, j].X.shape) == 2
for all validi
s andj
s.ad[int_or_str, int_or_str].X
is a special case that returns a single floatI also think “predictability”/“always-2d” is the best option.
ad[0, 'GeneA'].X
), but suffers from the same problems as the “convenience” approach if not (ad = ad[0, 'GeneA']; <do stuff>; good_riddance(ad)
).OK! Cool! This is exactly what @flying-sheep said from the very beginning about “always-2d”. 😃
My argument at the time (2 years ago) was that we need a convenient access to vectors. When anndata was “always-2d”, I had a lot of
adata[0, :].X.flatten()
all over the place in the scanpy code, scripts and notebooks.Then I went ahead and moved the
.flatten()
into the accessor. Which was a bad idea (now, I know, @flying-sheep 🙂).Anyway, I’m happy if we make this change if we make
anndata 1.0
directly, which I think should be an option. A lot of code will break in people’s notebooks, and also within Scanpy, accessing.X
is assumed to be vector like in several instances. But I think, it’s probably worth it and it’s not crazy to fix it. It should go along withscanpy 1.5
, I guess.Last question: I still think one needs convenient access to vectors along the data matrix. And as we have
.obs_vector
already, it should be that. Maybe, in some distant future, we’d also have.X[:, gene]
[which would conform withadata[:, gene].X
and probably be most efficient], but that might be more tricky than the family ofArrayView
,DataFrameView
etc. (if it still exists). However, if it’s easy enough to subclassnp.ndarray
,sp.spmatrix
andh5py.Dataset
, to plugin the global AnnData indexer, this would be a great solution.Last but not least, communicating these changes properly will be key. Both in the release notes, docs, tutorials and github/twitter.