anndata: inconsistent slicing of .X vs .layer

This is related to #74 for for slices and is behind the problem reported in https://github.com/theislab/scanpy/pull/555

from  scipy.sparse import csr_matrix

adata = sc.AnnData(np.ones((3,3)))
adata.layers['test'] = adata.X.copy()
adata.layers['sparse'] = csr_matrix(adata.X)

print(adata[:, ['0']].X.shape, 
      adata[:, ['0']].layers['test'].shape,
      adata[:, ['0']].layers['sparse'].shape)
print(adata[:, '0'].X.shape, 
     adata[:, '0'].layers['test'].shape, 
     adata[:, '0'].layers['sparse'].shape)

print(type(adata[:, ['0']].X), 
      type(adata[:, ['0']].layers['test']), 
      type(adata[:, ['0']].layers['sparse']))

print(type(adata[:, '0'].X), 
      type(adata[:, '0'].layers['test']), 
      type(adata[:, '0'].layers['sparse']))

The output is:

(3,) (3, 1) (3, 1)
(3,) (3,) (3, 1)
<class 'anndata.base.ArrayView'> <class 'numpy.ndarray'> <class 'scipy.sparse.csr.csr_matrix'>
<class 'anndata.base.ArrayView'> <class 'numpy.ndarray'> <class 'scipy.sparse.csr.csr_matrix'>

Ideally, while slicing .layers, the output should be the same as for .X

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 17 (15 by maintainers)

Most upvoted comments

For completeness’ sake I’d like to enumerate the options:

“convenience”: Fix the current idea, i.e. leave the behavior for .X and flatten layers as well.
“precision”: Slice depending on the index type. ad[:, 1].X or ad[:, 'GeneA'].X return a (n_obs,) array (1D), ad[:, [1]].X or ad[:, :1].X return a (n_obs, 1) array (2D).
“predictability”/“always-2d”: len(ad[i, j].X.shape) == 2 for all valid is and js.
Some hybrid, e.g. that ad[int_or_str, int_or_str].X is a special case that returns a single float

I also think “predictability”/“always-2d” is the best option.

The invariant is already given if the storage mode is a sparse matrix
sometimes you don’t know if your dynamically computed index would retrieve 1 or more items.
Back then my concerns about it being more bug prone were purely hypothetical (if well-reasoned), but now we saw (as alex said) that the bugs actually happen.
“precision” or the described hybrid would result in X having an 1D or 0D shape. That’s OK if X is directly accessed afterwards (ad[0, 'GeneA'].X), but suffers from the same problems as the “convenience” approach if not (ad = ad[0, 'GeneA']; <do stuff>; good_riddance(ad)).

flying-sheep on Jun 26, 2019

OK! Cool! This is exactly what @flying-sheep said from the very beginning about “always-2d”. 😃

My argument at the time (2 years ago) was that we need a convenient access to vectors. When anndata was “always-2d”, I had a lot of adata[0, :].X.flatten() all over the place in the scanpy code, scripts and notebooks.

Then I went ahead and moved the .flatten() into the accessor. Which was a bad idea (now, I know, @flying-sheep 🙂).

Anyway, I’m happy if we make this change if we make anndata 1.0 directly, which I think should be an option. A lot of code will break in people’s notebooks, and also within Scanpy, accessing .X is assumed to be vector like in several instances. But I think, it’s probably worth it and it’s not crazy to fix it. It should go along with scanpy 1.5, I guess.

Last question: I still think one needs convenient access to vectors along the data matrix. And as we have .obs_vector already, it should be that. Maybe, in some distant future, we’d also have .X[:, gene] [which would conform with adata[:, gene].X and probably be most efficient], but that might be more tricky than the family of ArrayView, DataFrameView etc. (if it still exists). However, if it’s easy enough to subclass np.ndarray, sp.spmatrix and h5py.Dataset, to plugin the global AnnData indexer, this would be a great solution.

Last but not least, communicating these changes properly will be key. Both in the release notes, docs, tutorials and github/twitter.

falexwolf on Jun 26, 2019