STAR: Error reading Velocity matrix
Hi Alex,
Following your question on Twitter I started a comparison for the spliced/unspliced counts from velocyto, STARsolo and kallisto. Unfortunately, when trying to read in the “Velocyto” matrix generated by STARsolo, I’m getting this error:
Matrix::readMM(".../STARsolo/Velocyto/raw/matrix.mtx")
# Error: readMM(): row values 'i' are not in 1:nr
Any idea what the issue could be? The mtx
file is ~1 GB in size.
Using scanpy’s read_mtx
function works fine.
import scanpy as sc
sc.read_mtx(".../STARsolo/Velocyto/raw/matrix.mtx")
# AnnData object with n_obs × n_vars = 11720 × 60609
For those running into the same problem, you can load the .mtx
file generated by STARsolo and then save it again using scanpy/scipy functions as shown below. Afterwards, it will be readable by Matrix::readMM()
in R.
import scanpy as sc
import scipy
t = sc.read_mtx('.../STARsolo/Velocyto/raw/matrix.mtx')
scipy.io.mmwrite('.../STARsolo/Velocyto/raw/matrix_new_format.mtx', t.X)
Matrix::readMM(".../STARsolo/Velocyto/raw/matrix_new_format.mtx") %>% str()
# Formal class 'dgTMatrix' [package "Matrix"] with 6 slots
# ..@ i : int [1:48570060] 1 1 1 1 1 1 1 1 1 1 ...
# ..@ j : int [1:48570060] 6962 54995 69331 89585 121706 145759 228188 337667 383554 383896 ...
# ..@ Dim : int [1:2] 60609 6794880
# ..@ Dimnames:List of 2
# .. ..$ : NULL
# .. ..$ : NULL
# ..@ x : num [1:48570060] 0 0 0 0 0 0 0 0 0 0 ...
# ..@ factors : list()
Best, Roman
# R version 3.6.1 (2019-07-05)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Debian GNU/Linux 9 (stretch)
#
# Matrix products: default
# BLAS/LAPACK: /usr/lib/libopenblasp-r0.2.19.so
#
# locale:
# [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
# [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
# [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
# [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
# [9] LC_ADDRESS=C LC_TELEPHONE=C
# [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#
# other attached packages:
# [1] Matrix_1.2-17 forcats_0.4.0 stringr_1.4.0 dplyr_0.8.3
# [5] purrr_0.3.3 readr_1.3.1 tidyr_1.0.0 tibble_2.1.3
# [9] ggplot2_3.2.1 tidyverse_1.2.1
#
# loaded via a namespace (and not attached):
# [1] Rcpp_1.0.2 cellranger_1.1.0 pillar_1.4.2 compiler_3.6.1
# [5] tools_3.6.1 zeallot_0.1.0 jsonlite_1.6 lubridate_1.7.4
# [9] lifecycle_0.1.0 gtable_0.3.0 nlme_3.1-141 lattice_0.20-38
# [13] pkgconfig_2.0.3 rlang_0.4.1 cli_1.1.0 rstudioapi_0.10
# [17] haven_2.1.1 withr_2.1.2 xml2_1.2.2 httr_1.4.1
# [21] generics_0.0.2 vctrs_0.2.0 hms_0.5.2 grid_3.6.1
# [25] tidyselect_0.2.5 glue_1.3.1 R6_2.4.0 readxl_1.3.1
# [29] modelr_0.1.5 magrittr_1.5 backports_1.1.5 scales_1.0.0
# [33] rvest_0.3.4 assertthat_0.2.1 colorspace_1.4-1 stringi_1.4.3
# [37] lazyeval_0.2.2 munsell_0.5.0 broom_0.5.2 crayon_1.3.4
About this issue
- Original URL
- State: open
- Created 5 years ago
- Comments: 28 (5 by maintainers)
I found this discussion very useful, I just wanted to add something to @jenzopr code I would recomend to use
csr_matrix
in place ofcoo_matrix
since the latter does not properly supports indexing and this would cause some errorrs when triyng to subset the object or runningscvelo.utils.merge
Also, In place of manually specifing the shape of the matrix, I would extract it like that:
shape = np.loadtxt('matrix.mtx', skiprows=2, max_rows = 1, delimiter=' ')[0:2].astype(int)
I’ll post here the code the code we shared by email, maybe somebody could find it useful too.
Function to generate an anndata object from the STARsolo output folder
You just need to point to the STARsolo output folder
adata = buildAnndataFromStar('data/sample1.Solo.out/')
The following function is for older versions of STARsolo that generate only one sparse matrix with 3 value columns for spliced, unspliced and ambiguous.
After importing your andata object, you can eventually export it for a quicker load later.
Here you have the updated function from @JBreunig to load the 3 individual matrices generated in later versions of STARsolo
Just for reference/completion, here is a python way to do that:
The shape
(60558,737280)
is taken from the matrix header Thanks to @msbentsen 😃Thanks to @ftucos, I came up with this new way to implement our spliced, unspliced and ambiguous file to our data, just some very simple codes for the starters.
And many thanks for team STAR for this wonderful development so we don’t have to generate a loom file for scVelo.
With this data, we can use it for scanpy calculate as same as scVelo.
Cheers XIN
Hi Alex, that’s great and really helpful. I think adding the combined matrix is also quite useful.
Thanks a lot
Hi All,
the 2.7.8a release now outputs separate matrices for Velocyto spliced/unspliced/ambiguous counts. Not sure if this is very helpful, especially if you already have your pipelines running, but I thought that it would be a bit less confusing. If necessary, I could add an option to output the combined matrix.mtx.
Cheers Alex
I see high correlation between the results of velocyto (input was the Cell Ranger
.bam
file) and STARsolo, both for spliced and unspliced counts. Less so for kallisto. There are some interesting differences in the number of transcripts and detected genes.Number of transcripts:
Number of detected genes:
Correlation of spliced counts (by genes; 0’s are set to 0.1 for log-scale):
Correlation of unspliced counts (by genes; 0’s are set to 0.1 for log-scale):
Ohhhh, I see. I guess in that case I can reproduce the sparse matrix like this:
The matrix dimensions are taken from the header of the
matrix.mtx
file. It’s looking good so far. No idea what the scanpy method gave me then 😄@bapoorva It’s the label issue.
https://github.com/yyoshiaki/mishima_gassyuku/blob/master/README.md
I followed this one.
import loompy
ds = loompy.connect(“out.loom”) ds.row_attrs[‘Gene’] = ds.row_attrs[‘var_names’] ds.col_attrs[‘CellID’] = ds.col_attrs[‘obs_names’] ds.close()
Then it can be imported by Seurat.
Certainly, this works as well. What I like about the
scvelo.utils.merge
is, that it will get rid of the unmatched cells in spliced and unspliced. This way, you can use theGene/filtered
matrices…@Jachimdejonghe using the AnnData constructor, you can initialise the object with all three layers:
Afterwards you can merge with an existing anndata. Is this what you were looking for? Ref: https://anndata.readthedocs.io/en/stable/anndata.AnnData.html#anndata.AnnData
Hi Roman,
thanks for looking into it… it seems you are going to be the alpha-tester for this feature! 😃 Sorry for the lack of documentation, and for the breaking of the matrix market format. Basically, I dumped the spliced, unspliced and ambiguous counts as three columns in this file i.e. col3=spliced, col4=unspliced, col5=ambiguous
Cheers Alex