Score a corrected or uncorrected PCA to estimate batch mixing

A batch-wise kernel density estimate is computed for each dimension of the PCA. Then, the global area shared by all kernels is calculated and divided by a reference area (see Arguments and Details) as a proxy to batch mixing. Finally, the aforementioned proportions are weighted by each dimension's contribution to variance.

Usage

ScoreDensityPC(
  object,
  batch.var = NULL,
  reduction = "pca",
  dims = NULL,
  use.union = TRUE,
  bw.join = mean,
  bw.kernel = c("nrd0", "nrd", "ucv", "bcv", "sj"),
  weight.by = c("var", "stdev"),
  assay = NULL,
  layer = NULL
)

AddScoreDensityPC(
  object,
  integration,
  batch.var = NULL,
  reduction = "pca",
  dims = NULL,
  use.union = TRUE,
  bw.join = mean,
  bw.kernel = c("nrd0", "nrd", "ucv", "bcv", "sj"),
  weight.by = c("var", "stdev"),
  assay = NULL,
  layer = NULL
)

Arguments

object: A Seurat object
batch.var: The name of the batch variable (must be in the object metadata)
reduction: The name of the reduction to score
dims: The dimensions to consider. All dimensions are used by default
use.union: Whether to use a union of densities (constructs a fake density with pmax). The alternative is to use all density kernels and sum them
bw.join: A function to pick a joint bandwidth. The default is to use the mean(). You can also chose median for instance. If you want to use a distinct bandwidth for each batch, just set bw.join = c (not recommended)
bw.kernel: Bandwidth selector to use for Gaussian kernel density estimation. One of 'nrd0', 'nrd', 'ucv','bcv' or 'sj' (see bandwidths). nrd0 is the default
weight.by: one of 'var' (default) or 'stdev' (standing for variance and standard deviation respectively). Use the variance or the standard deviation explained by the principal components to weight the each PC's score.
assay: assay to use. Passed to Seurat to automatically construct the batch.var when not provided. Useless otherwise
layer: layer to use. Passed to Seurat to automatically construct the batch.var when not provided. Useless otherwise
integration: name of the integration to score

Value

ScoreDensityPC: A single float corresponding to the score of the given reduction

AddScoreDensityPC: the updated Seurat object with the density PCA score set for the integration.

Details

The score is computed as follow : $$\sum_{i=1}^{p} \left ( \cfrac{ \bigcup_{b=1}^n D_{bi} }{ Ref_i } * V_i \right )$$

For a PCA with p dimensions and n batchs. $D_{bi}$ is the vector of density of batch b on the dimension i. $V_i$ is the proportion of variance explained by the $PC_i$.

with $Ref = \bigcap_{i=1}^n D_{bi}$ or $Ref = \sum_{i=1}^n D_{bi}$ when use.union is TRUE and FALSE respectively, for a given dimension $i$.

use.union = FALSE tends to result in lower score since the reference is the sum of kernel densities instead of maximum

Note

This score is an adaptation of the principal component regression (PCR) score from Luecken M.D. et al., 2022.

References

Luecken, M. D., Büttner, M., Chaichoompu, K., Danese, A., Interlandi, M., Mueller, M. F., Strobl, D. C., Zappia, L., Dugas, M., Colomé-Tatché, M. & Theis, F. J. Benchmarking atlas-level data integration in single-cell genomics. Nat Methods 19, 41–50 (2021). DOI

Examples

if (FALSE) { # \dontrun{
obj <- SeuratData::LoadData("pbmcsca")
obj[["RNA"]] <- split(obj[["RNA"]], f = obj$Method)
obj <- NormalizeData(obj)
obj <- FindVariableFeatures(obj)
obj <- ScaleData(obj)
obj <- RunPCA(obj)

score.union <- ScoreDensityPC(obj, "Method", "pca", dim = 1:30)
score.sum <- ScoreDensityPC(obj, "Method", "pca", dim = 1:30, use.union = FALSE)

score.union   # ~ 0.1511
score.sum     # ~ 0.0319
} # }