kBET: High rejection rates for seemingly well integrated data

Hello there. I have been trying to get this package working for quite some time now but no matter what, the rejection rates are either 1 or close to 1 regardless of correction. I would gladly appreciate feedback on this issue.

The data before conversion to a matrix is a [cell by gene] data.frame with the last two columns giving the batch and celltypes as categorical data. The data is for a single celltype collected over three separate batches and each batch contains not too disimilar numbers of observations.

Example:

PCA before correction:

kBET before correction:

PCA after correction:

kBET after correction:

Code:

# KBET for uncorrected 
# coerce as matrix and remove categorical columns
data <- as.matrix(original[ , 1:(length(original)-2)])
#coerce batch labels(sample) to factor
batch <- as.factor(original$sample)
#compute and plot
k0=floor(mean(table(batch))) 
knn <- get.knn(data, k=k0, algorithm = 'cover_tree')
batch.estimate <- kBET(data, batch, plot=TRUE, do.pca = TRUE, dim.pca = 2)

#KBET for corrected 
# coerce as matrix and remove categorical columns
data <- as.matrix(corrected[ , 1:(length(corrected)-2)])
#coerce batch labels(sample) to factor
batch <- as.factor(corrected$sample)
#compute and plot
k0=floor(mean(table(batch))) #neighbourhood size: mean batch size 
knn <- get.knn(data, k=k0, algorithm = 'cover_tree')
batch.estimate <- kBET(data, batch, plot=TRUE, do.pca = TRUE, dim.pca = 2)

I also tried iterating over many different values of k(recomputing knn as suggested) yet the rejection rate saturates and remains high throughout.

Assessment over varying values of k (corrected data):

best regards, Dean

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 16

Most upvoted comments

Dear Ray,

Dear @mbuttner , I’ve read through several issues in this repo and I felt that we really need some clarifications from you.

In your publication, you showed in many figures that the so called “acceptance rate” decreased after the batch effect was corrected. But in the kBET package there is no reference of the so called “acceptance rate”, only “rejection rate”. How do I obtain a “acceptance rate” for my data?

Thank you for pointing this out. In the publication, we used acceptance rate = 1 - rejection rate. This is a rescaling such that low values (close to 0) correspond to “bad” results and high values (close to 1) correspond to “good” results. However, as the foundation of kBET is a statistical test (and therefore, there is no such thing as acceptance in hypothesis testing), the kBET package itself only reports rejection rates.

You mentioned kBET only accepts dense matrix or knn-graph as input (Dose kBET allow normalized matrix as input? #56), and you also mentioned we should use an assay from SingleCellExperiment object as input (Working on SingleCellExperiment object. #58). But typically, an assay from SingleCellExperiment IS a SPARSE matrix. In data matrix for kBET evaluation #49, someone asked about the data format to use. I feel this is by far the most ambiguous part in using kBET. In the kBET vignette, kBET was applied to count data, CPM normalized data, and limma batch-corrected data.What exactly should we use as the input data then?

Thank you for your question. In general, we implemented kBET such that it accepts several types of inputs and is not dependent on a feature matrix per se. The standard workflow is as follows: kBET requires a dense matrix as input and then computes a knn-graph from the dense matrix. The FNN package, which does the knn-computation, requires a dense matrix as input. This is indeed less than optimal. If you want to use an assay from SingleCellExperiment is a sparse matrix, please consider to convert it into a dense matrix. If, for some reason, you have a large data matrix and computing nearest neighbours with kBET (and FNN resp.) takes too long, you can use a different method to compute the knn-graph instead. If you give a knn-graph as additional input, then kBET skips the computation of the knn-graph and computes the rejection rates directly on the knn-graph (see section “Variations” in the Readme):

require('FNN')
# data: a matrix (rows: samples, columns: features (genes))
k0=floor(mean(table(batch))) #neighbourhood size: mean batch size 
knn <- get.knn(data, k=k0, algorithm = 'cover_tree')

#now run kBET with pre-defined nearest neighbours.
batch.estimate <- kBET(data, batch, k = k0, knn = knn)

Please note that a data matrix with the same size as your original data is required, but not used (apart from estimating number of rows and columns).

For my own purposes, in my single cell analysis, I performed batch correction only on a subset of genes (highly variable genes), thus my data is a reduced matrix with only a few thousand genes. Can I use this matrix as input?

Yes, you can use a reduced matrix as input. In fact, we observed that using highly variable genes tends to improve the batch effect correction in most methods.

Thanks, Ray

Please let me know when you have further questions.

Best, Maren

mbuttner on Jul 4, 2021