STAR: Error reading Velocity matrix

Hi Alex,

Following your question on Twitter I started a comparison for the spliced/unspliced counts from velocyto, STARsolo and kallisto. Unfortunately, when trying to read in the “Velocyto” matrix generated by STARsolo, I’m getting this error:

Matrix::readMM(".../STARsolo/Velocyto/raw/matrix.mtx")
# Error: readMM(): row     values 'i' are not in 1:nr

Any idea what the issue could be? The mtx file is ~1 GB in size.

Using scanpy’s read_mtx function works fine.

import scanpy as sc
sc.read_mtx(".../STARsolo/Velocyto/raw/matrix.mtx")
# AnnData object with n_obs × n_vars = 11720 × 60609

For those running into the same problem, you can load the .mtx file generated by STARsolo and then save it again using scanpy/scipy functions as shown below. Afterwards, it will be readable by Matrix::readMM() in R.

import scanpy as sc
import scipy
t = sc.read_mtx('.../STARsolo/Velocyto/raw/matrix.mtx')
scipy.io.mmwrite('.../STARsolo/Velocyto/raw/matrix_new_format.mtx', t.X)
Matrix::readMM(".../STARsolo/Velocyto/raw/matrix_new_format.mtx") %>% str()
# Formal class 'dgTMatrix' [package "Matrix"] with 6 slots
#   ..@ i       : int [1:48570060] 1 1 1 1 1 1 1 1 1 1 ...
#   ..@ j       : int [1:48570060] 6962 54995 69331 89585 121706 145759 228188 337667 383554 383896 ...
#   ..@ Dim     : int [1:2] 60609 6794880
#   ..@ Dimnames:List of 2
#   .. ..$ : NULL
#   .. ..$ : NULL
#   ..@ x       : num [1:48570060] 0 0 0 0 0 0 0 0 0 0 ...
#   ..@ factors : list()

Best, Roman

# R version 3.6.1 (2019-07-05)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Debian GNU/Linux 9 (stretch)
# 
# Matrix products: default
# BLAS/LAPACK: /usr/lib/libopenblasp-r0.2.19.so
# 
# locale:
#  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
#  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
# [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
#  [1] Matrix_1.2-17   forcats_0.4.0   stringr_1.4.0   dplyr_0.8.3    
#  [5] purrr_0.3.3     readr_1.3.1     tidyr_1.0.0     tibble_2.1.3   
#  [9] ggplot2_3.2.1   tidyverse_1.2.1
# 
# loaded via a namespace (and not attached):
#  [1] Rcpp_1.0.2       cellranger_1.1.0 pillar_1.4.2     compiler_3.6.1  
#  [5] tools_3.6.1      zeallot_0.1.0    jsonlite_1.6     lubridate_1.7.4 
#  [9] lifecycle_0.1.0  gtable_0.3.0     nlme_3.1-141     lattice_0.20-38 
# [13] pkgconfig_2.0.3  rlang_0.4.1      cli_1.1.0        rstudioapi_0.10 
# [17] haven_2.1.1      withr_2.1.2      xml2_1.2.2       httr_1.4.1      
# [21] generics_0.0.2   vctrs_0.2.0      hms_0.5.2        grid_3.6.1      
# [25] tidyselect_0.2.5 glue_1.3.1       R6_2.4.0         readxl_1.3.1    
# [29] modelr_0.1.5     magrittr_1.5     backports_1.1.5  scales_1.0.0    
# [33] rvest_0.3.4      assertthat_0.2.1 colorspace_1.4-1 stringi_1.4.3   
# [37] lazyeval_0.2.2   munsell_0.5.0    broom_0.5.2      crayon_1.3.4

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Comments: 28 (5 by maintainers)

Most upvoted comments

Just for reference/completion, here is a python way to do that:

import numpy as np
from scipy import sparse
mtx = np.loadtxt('matrix.mtx', skiprows=3, delimiter=' ')

spliced = sparse.coo_matrix((mtx[:,2], (mtx[:,0]-1, mtx[:,1]-1)), shape = (60558,737280))
unspliced = sparse.coo_matrix((mtx[:,3], (mtx[:,0]-1, mtx[:,1]-1)), shape = (60558,737280))
ambiguous = sparse.coo_matrix((mtx[:,4], (mtx[:,0]-1, mtx[:,1]-1)), shape = (60558,737280))

The shape (60558,737280) is taken from the matrix header Thanks to @msbentsen 😃

I found this discussion very useful, I just wanted to add something to @jenzopr code I would recomend to use csr_matrix in place of coo_matrix since the latter does not properly supports indexing and this would cause some errorrs when triyng to subset the object or running scvelo.utils.merge

Also, In place of manually specifing the shape of the matrix, I would extract it like that: shape = np.loadtxt('matrix.mtx', skiprows=2, max_rows = 1, delimiter=' ')[0:2].astype(int)

Hi All, the 2.7.8a release now outputs separate matrices for Velocyto spliced/unspliced/ambiguous counts. Not sure if this is very helpful, especially if you already have your pipelines running, but I thought that it would be a bit less confusing. If necessary, I could add an option to output the combined matrix.mtx. Cheers Alex

Might anyone have updated code for inputting the split matrices into anndata?

I’ve got the adata object set up but I’m struggling to add the separate Velocyto spliced/unspliced/ambiguous matrices from 2.7.9a as layers.

Update: solved with the help of ftucos 👍

I’ll post here the code the code we shared by email, maybe somebody could find it useful too.

Function to generate an anndata object from the STARsolo output folder

You just need to point to the STARsolo output folder adata = buildAnndataFromStar('data/sample1.Solo.out/')

# Load required libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy
from scipy import sparse
import anndata
import scanpy as sc
import scvelo as scv
scv.set_figure_params()
os.chdir('/path/to/work/dir')

The following function is for older versions of STARsolo that generate only one sparse matrix with 3 value columns for spliced, unspliced and ambiguous.

def buildAnndataFromStar(path):
    """Generate an anndata object from the STAR aligner output folder"""
    path=path
    # Load Read Counts
    X = sc.read_mtx(path+'Gene/raw/matrix.mtx')
    
    # Transpose counts matrix to have Cells as rows and Genes as cols as expected by AnnData objects
    X = X.X.transpose()
    
    # This matrix is organized as a sparse matrix with Row, Cols and 3 values columns for 
    # Spliced, Unspliced and Ambigous reads
    mtx = np.loadtxt(path+'Velocyto/raw/matrix.mtx', skiprows=3, delimiter=' ')
    # Extract sparse matrix shape informations from the third row
    shape = np.loadtxt(path+'Velocyto/raw/matrix.mtx', skiprows=2, max_rows = 1 ,delimiter=' ')[0:2].astype(int)

    # Read the sparse matrix with csr_matrix((data, (row_ind, col_ind)), shape=(M, N))
    # Subract -1 to rows and cols index because csr_matrix expects a 0 based index
    # Traspose counts matrix to have Cells as rows and Genes as cols as expected by AnnData objects
    spliced = sparse.csr_matrix((mtx[:,2], (mtx[:,0]-1, mtx[:,1]-1)), shape = shape).transpose()
    unspliced = sparse.csr_matrix((mtx[:,3], (mtx[:,0]-1, mtx[:,1]-1)), shape = shape).transpose()
    ambiguous = sparse.csr_matrix((mtx[:,4], (mtx[:,0]-1, mtx[:,1]-1)), shape = shape).transpose()
    
    # Load Genes and Cells identifiers
    obs = pd.read_csv(path+'Velocyto/raw/barcodes.tsv',
                  header = None, index_col = 0)
    # Remove index column name to make it compliant with the anndata format
    obs.index.name = None
    
    var = pd.read_csv(path+'Velocyto/raw/features.tsv', sep='\t',
                  names = ('gene_ids', 'feature_types'), index_col = 1)
    
    # Build AnnData object to be used with ScanPy and ScVelo
    adata = anndata.AnnData(X = X, obs = obs, var = var,
                        layers = {'spliced': spliced, 'unspliced': unspliced, 'ambiguous': ambiguous})
    adata.var_names_make_unique()
    
    # Subset Cells based on STAR filtering
    selected_barcodes = pd.read_csv(path+'Gene/filtered/barcodes.tsv', header = None)
    adata = adata[selected_barcodes[0]]
    
    return adata.copy()

After importing your andata object, you can eventually export it for a quicker load later.

adata.write('processed/sample_1.h5ad', compression = 'gzip')
adata = anndata.read_h5ad('processed/sample_1.h5ad')

Here you have the updated function from @JBreunig to load the 3 individual matrices generated in later versions of STARsolo

def buildAnndataFromStarCurr(path):
    """Generate an anndata object from the STAR aligner output folder"""
    path=path
    # Load Read Counts
    X = sc.read_mtx(path+'Gene/raw/matrix.mtx')

    # Transpose counts matrix to have Cells as rows and Genes as cols as expected by AnnData objects
    X = X.X.transpose()

    # Load the 3 matrices containing Spliced, Unspliced and Ambigous reads
    mtxU = np.loadtxt(path+'Velocyto/raw/unspliced.mtx', skiprows=3, delimiter=' ')
    mtxS = np.loadtxt(path+'Velocyto/raw/spliced.mtx', skiprows=3, delimiter=' ')
    mtxA = np.loadtxt(path+'Velocyto/raw/ambiguous.mtx', skiprows=3, delimiter=' ')

    # Extract sparse matrix shape informations from the third row
    shapeU = np.loadtxt(path+'Velocyto/raw/unspliced.mtx', skiprows=2, max_rows = 1 ,delimiter=' ')[0:2].astype(int)
    shapeS = np.loadtxt(path+'Velocyto/raw/spliced.mtx', skiprows=2, max_rows = 1 ,delimiter=' ')[0:2].astype(int)
    shapeA = np.loadtxt(path+'Velocyto/raw/ambiguous.mtx', skiprows=2, max_rows = 1 ,delimiter=' ')[0:2].astype(int)

    # Read the sparse matrix with csr_matrix((data, (row_ind, col_ind)), shape=(M, N))
    # Subract -1 to rows and cols index because csr_matrix expects a 0 based index
    # Traspose counts matrix to have Cells as rows and Genes as cols as expected by AnnData objects

    spliced = sparse.csr_matrix((mtxS[:,2], (mtxS[:,0]-1, mtxS[:,1]-1)), shape = shapeS).transpose()
    unspliced = sparse.csr_matrix((mtxU[:,2], (mtxU[:,0]-1, mtxU[:,1]-1)), shape = shapeU).transpose()
    ambiguous = sparse.csr_matrix((mtxA[:,2], (mtxA[:,0]-1, mtxA[:,1]-1)), shape = shapeA).transpose()

    # Load Genes and Cells identifiers
    obs = pd.read_csv(path+'Velocyto/raw/barcodes.tsv',
                  header = None, index_col = 0)

    # Remove index column name to make it compliant with the anndata format
    obs.index.name = None

    var = pd.read_csv(path+'Velocyto/raw/features.tsv', sep='\t',
                                    names = ('gene_ids', 'feature_types'), index_col = 1)
  
    # Build AnnData object to be used with ScanPy and ScVelo
    adata = anndata.AnnData(X = X, obs = obs, var = var,
                                                 layers = {'spliced': spliced, 'unspliced': unspliced, 'ambiguous': ambiguous})
    adata.var_names_make_unique()

    # Subset Cells based on STAR filtering
    selected_barcodes = pd.read_csv(path+'Gene/filtered/barcodes.tsv', header = None)
    adata = adata[selected_barcodes[0]]

    return adata.copy()

Just for reference/completion, here is a python way to do that:

import numpy as np
from scipy import sparse
mtx = np.loadtxt('matrix.mtx', skiprows=3, delimiter=' ')

spliced = sparse.coo_matrix((mtx[:,2], (mtx[:,0]-1, mtx[:,1]-1)), shape = (60558,737280))
unspliced = sparse.coo_matrix((mtx[:,3], (mtx[:,0]-1, mtx[:,1]-1)), shape = (60558,737280))
ambiguous = sparse.coo_matrix((mtx[:,4], (mtx[:,0]-1, mtx[:,1]-1)), shape = (60558,737280))

The shape (60558,737280) is taken from the matrix header Thanks to @msbentsen 😃

Thanks to @ftucos, I came up with this new way to implement our spliced, unspliced and ambiguous file to our data, just some very simple codes for the starters.

And many thanks for team STAR for this wonderful development so we don’t have to generate a loom file for scVelo.

 import scvelo as scv
 import scanpy as sc
 import sys
 import numpy as np
 from scipy import sparse

def creation_newadata(path):
'''load_adata is a repertory which contains all of our barcodes file, features file and matrix file'''

path=path
adata = sc.read_10x_mtx(path+'Gene/raw/')

spliced=np.loadtxt(path+'Velocyto/raw/spliced.mtx', skiprows=3, delimiter=' ')
shape = np.loadtxt(path+'Velocyto/raw/spliced.mtx', skiprows=2, max_rows = 1 ,delimiter=' ')[0:2].astype(int)

adata.layers['spliced']=sparse.csr_matrix((spliced[:,2], (spliced[:,0]-1, spliced[:,1]-1)), shape = (shape)).tocsr().T

unspliced=np.loadtxt(path+'Velocyto/raw/unspliced.mtx', skiprows=3, delimiter=' ')
adata.layers['unspliced']=sparse.csr_matrix((unspliced[:,2], (unspliced[:,0]-1, unspliced[:,1]-1)), shape = (shape)).tocsr().T

ambiguous= np.loadtxt(path+'Velocyto/raw/ambiguous.mtx', skiprows=3, delimiter=' ')
adata.layers['ambiguous']=sparse.csr_matrix((ambiguous[:,2], (ambiguous[:,0]-1, ambiguous[:,1]-1)), shape = (shape)).tocsr().T

return adata

With this data, we can use it for scanpy calculate as same as scVelo.

Cheers XIN

Hi Alex, that’s great and really helpful. I think adding the combined matrix is also quite useful.

Thanks a lot

Hi All,

the 2.7.8a release now outputs separate matrices for Velocyto spliced/unspliced/ambiguous counts. Not sure if this is very helpful, especially if you already have your pipelines running, but I thought that it would be a bit less confusing. If necessary, I could add an option to output the combined matrix.mtx.

Cheers Alex

I see high correlation between the results of velocyto (input was the Cell Ranger .bam file) and STARsolo, both for spliced and unspliced counts. Less so for kallisto. There are some interesting differences in the number of transcripts and detected genes.

Number of transcripts:

number_of_transcripts_spliced_unspliced

Number of detected genes:

number_of_expressed_genes_spliced_unspliced

Correlation of spliced counts (by genes; 0’s are set to 0.1 for log-scale):

correlation_genes_number_of_transcripts_spliced

Correlation of unspliced counts (by genes; 0’s are set to 0.1 for log-scale):

correlation_genes_number_of_transcripts_unspliced

Ohhhh, I see. I guess in that case I can reproduce the sparse matrix like this:

t <- readr::read_delim('matrix.mtx', delim = ' ', skip = 3, col_names = FALSE)

spliced <- Matrix::sparseMatrix(
  i = t$X1,
  j = t$X2,
  x = t$X3,
  dims = c(60609, 6794880)
)
Matrix::writeMM(spliced, 'spliced.mtx')

unspliced <- Matrix::sparseMatrix(
  i = t$X1,
  j = t$X2,
  x = t$X4,
  dims = c(60609, 6794880)
)
Matrix::writeMM(unspliced, 'unspliced.mtx')

The matrix dimensions are taken from the header of the matrix.mtx file. It’s looking good so far. No idea what the scanpy method gave me then 😄

@bapoorva It’s the label issue.

https://github.com/yyoshiaki/mishima_gassyuku/blob/master/README.md

I followed this one.

import loompy

ds = loompy.connect(“out.loom”) ds.row_attrs[‘Gene’] = ds.row_attrs[‘var_names’] ds.col_attrs[‘CellID’] = ds.col_attrs[‘obs_names’] ds.close()

Then it can be imported by Seurat.

Certainly, this works as well. What I like about the scvelo.utils.merge is, that it will get rid of the unmatched cells in spliced and unspliced. This way, you can use the Gene/filtered matrices…

@Jachimdejonghe using the AnnData constructor, you can initialise the object with all three layers:

adata = anndata.AnnData(layers = {'spliced': spliced, 'unspliced': unspliced, 'ambiguous': ambiguous})

Afterwards you can merge with an existing anndata. Is this what you were looking for? Ref: https://anndata.readthedocs.io/en/stable/anndata.AnnData.html#anndata.AnnData

Hi Roman,

thanks for looking into it… it seems you are going to be the alpha-tester for this feature! 😃 Sorry for the lack of documentation, and for the breaking of the matrix market format. Basically, I dumped the spliced, unspliced and ambiguous counts as three columns in this file i.e. col3=spliced, col4=unspliced, col5=ambiguous

Cheers Alex