Integration example with of the pbmcsca data • SeuratIntegrate

library(SeuratIntegrate)

Install and load the data

We will use the pbmcsca dataset available from the package SeuratData.

# install `SeuratData` package (if not yet)
if (! requireNamespace("SeuratData", quietly = TRUE)) {
  devtools::install_github('satijalab/seurat-data')
}

# increase download timeout and install the dataset
options(timeout = 300)
SeuratData::InstallData('pbmcsca')

# load the dataset (take 1,000 first cells to speed-up execution)
seu <- SeuratData::LoadData('pbmcsca')[,1:1e3]

## Warning: Assay RNA changing from Assay to Assay

## Warning: Assay RNA changing from Assay to Assay5

Inspect the dataset

Have a look at the metadata:

# rmarkdown::paged_table -> prints data frames in a less ugly way than default
rmarkdown::paged_table(head(seu[[]], n = 10))

The column Method provides information about the single-cell technology used to acquire each cell’s trancriptome. We consider it as the batch of origin.

table(seu$Method)

## 
## 10x Chromium (v2) A            CEL-Seq2          Smart-seq2 
##                 494                 253                 253

The column CellType contains cell-types annotation that we will use later on.

rmarkdown::paged_table(as.data.frame(table(seu$CellType)))

Let’s now define the batch variable and the cell-type variable(s):

batch.var <- 'Method'   # available in the metadata of the object
cell.var <- 'CellType'  # available in the metadata of the object

Seurat data processing - Standard workflow

The current Seurat object is not split, meaning that each layer contains cells from all batches.

cat('Layers before split:', paste(Layers(seu), collapse = ", "), '\n')

## Layers before split: counts, data

We select the column Method to separate the cells into batch-specific layers. This step is indispensable to run integration methods and enables to normalise each batch (i.e. layer) independently.

seu[['RNA']] <- split(seu[['RNA']], f = seu$Method)

cat('Layers after split:', paste(Layers(seu), collapse = ", "), '\n')

## Layers after split: counts.Smart-seq2, counts.CEL-Seq2, counts.10x_Chromium_v2_A, data.Smart-seq2, data.CEL-Seq2, data.10x_Chromium_v2_A

Then, we proceed to the standard Seurat workflow until we obtain the PCA reduction.

seu <- SCTransform(seu)
seu <- RunPCA(seu, verbose = F)
seu <- FindNeighbors(seu, reduction = "pca", dims = 1:30, k.param = 20L)

We now further process the object until we can visualise the dispersion of cells on a UMAP dimension reduction.

seu <- FindClusters(seu, graph.name = 'SCT_snn', resolution = .5)
seu <- RunUMAP(seu, dims = 1:30, reduction = 'pca')

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 1000
## Number of edges: 43784
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8717
## Number of communities: 8
## Elapsed time: 0 seconds

DimPlot(seu, label = T) + NoLegend() + ggplot2::coord_fixed(ratio = .7)

UMAP of unintegrated pbmcsca data with cells colored by cluster

Let’s colour cells according to batch and cell-type label.

DimPlot(seu, group.by = batch.var) +
  DimPlot(seu, group.by = cell.var) & ggplot2::coord_fixed(ratio = .7)

UMAP of unintegrated pbmcsca data with cells colored by sequencing method (i.e. batch) and cell type label

Albeit moderately, the dispersion of cells on the UMAP seems to be influenced by a batch effect. According to the “CellType” variable, B cells from 10x and CEL-Seq2 are not gathered together and monocyte cells do not fully overlap.

Now, we want to correct these technical differences.

Integrate batches - SeuratIntegrate workflow

The integration commands are to some extent similar to the Seurat V5 vignette. The purpose of this package is to extend the set of available integration methods. See the bottom of ?IntegrateLayers to have a comprehensive list of relevant methods.

Many methods supported by SeuratIntegrate are implemented in Python, and the wrappers provided rely on the reticulate package and conda environments. If you are not familiar with the CondaEnvManager, have a look at the vignette("setup_and_tips").

Here are some important considerations before performing an integration:

use the interface DoIntegrate() from SeuratIntegrate to integrate the cell batches (see the next section)
check the method-specific arguments (e.g. ?bbknnIntegration)
set up conda environments to use Python-based methods:
- with the CondaEnvManager via UpdateEnvCache() (check the vignette("setup_and_tips"))
- with a custom conda environment by passing its path - or its name - to the conda_env parameter. This overrides the default behaviour by loading the specified environment instead of fetching the CondaEnvManager’s cache.

Some methods expect a raw count matrix, others expect the scaled counts of the variable features, etc. To help you with the choice, look at the table below

Method	Layer	Features
ComBat	data	any but better with restricted number of features (e.g. variable)
Harmony	N/A (PCA reduction)	N/A (PCA reduction)
MNN	data	any but better with restricted number of features (e.g. variable)
bbknn	N/A (PCA reduction)	N/A (PCA reduction)
scVI / scANVI	counts	all
Scanorama	counts or data	any
trVAE	data (recon loss “mse”) counts (otherwise)	all

Layers:

counts: raw counts
data: normalised counts
scale.data: scaled normalised counts of variable features

/!\ IMPORTANT /!\ To use all features when calling an integration method with IntegrateLayers(): IntegrateLayers(object, features = Features(object), scale.layer = NULL). Does not work for a SCTAssay.

DoIntegrate philosophy (#do_integrate)

The function DoIntegrate() is a very handy way to run multiple integrations in a single command. Note that:

... is the place to specify the integration commands to run.
integration commands are expected to be function calls, i.e. of the form FooIntegration(), BarIntegration(), etc.
Calls accept method-specific argument (e.g. FooIntegration(layers = "data"))
use.hvg = TRUE results in using variable features
use.future = TRUE is useful to run Python-based methods (a normal R session cannot load more than one conda instance, while future enables to launch background sessions, preventing the main on to load any conda environment.). It is highly recommended to set it to FALSE for R-based methods.

Most integration methods can be used without modifying default parameters. In this vignette, we will change some arguments to meet our needs. Notably, we will change the number of cores allocated to each method (where possible).

ncores <- parallel::detectCores() - 2

In this vignette, we are going to use 3 Python-based methods, namely BBKNN, Scanorama and scANVI from the scvi-tools suite. Let’s make sure they are available straight away:

# BBKNN
if (! isValid(getCache()$bbknn)) {
  UpdateEnvCache("bbknn")
}
# Scanorama
if (! isValid(getCache()$scanorama)) {
  UpdateEnvCache("scanorama")
}
# scvi-tools
if (! isValid(getCache()$scanvi)) {
  UpdateEnvCache("scanvi")
}

Let’s proceed to a few batch-effect corrections:

Integration part 1) Using R-based methods

seu <- DoIntegrate(
  object = seu,
  SeuratIntegrate::HarmonyIntegration(orig = "pca", dims = 1:30,
                                      ncores = ncores),
  CCAIntegration(orig = "pca", dims = 1:30 , new.reduction = "cca.integrated",
                 normalization.method = "SCT"),
  RPCAIntegration(orig = "pca", dims = 1:30, new.reduction = "rpca.integrated",
                  normalization.method = "SCT"),
  use.future = FALSE  # R-based
)

## Integration 1 in 3: integrating using 'SeuratIntegrate::HarmonyIntegration'
## Integration 2 in 3: integrating using 'CCAIntegration'
## Integration 3 in 3: integrating using 'RPCAIntegration'

Note: We use SeuratIntegrate:: before HarmonyIntegration to avoid any confusion with Seurat::HarmonyIntegration().

Integration part 2) Using python-based methods

seu <- DoIntegrate(
  object = seu,
  bbknnIntegration(orig = "pca", layers = "data", ndims = 30),
  ScanoramaIntegration(orig = "pca", ncores = ncores),
  scANVIIntegration(groups = seu[[]], groups.name = "Method",
                    labels.name = "CellType", layers = "counts",
                    torch.intraop.threads = ncores,
                    torch.interop.threads = ncores,
                    max_epochs = 20L),
  use.future = TRUE,  # Python-based
  use.hvg = c(TRUE, TRUE, FALSE)
)

## Integration 1 in 3: integrating using 'bbknnIntegration'
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 1000
## Number of edges: 6848
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.6753
## Number of communities: 10
## Elapsed time: 0 seconds
## Integration 2 in 3: integrating using 'ScanoramaIntegration'
## Integration 3 in 3: integrating using 'scANVIIntegration'

Note: set max_epochs = 20L for scANVIIntegration is only to save time ! The default number of epochs (400) results in a superior integration.

If we take a look at our Seurat object, we can note that it has been enriched with many objects:

print(seu)

## An object of class Seurat 
## 69594 features across 1000 samples within 4 assays 
## Active assay: SCT (16450 features, 3000 variable features)
##  3 layers present: counts, data, scale.data
##  3 other assays present: RNA, bbknn.ridge, scanorama.reconstructed
##  8 dimensional reductions calculated: pca, umap, harmony, cca.integrated, rpca.integrated, pca.bbknn, integrated.scanorama, integrated.scANVI

cat("Graph objects:", paste(Graphs(seu), collapse = ", "), "\n")
cat("Neighbor objects:", paste(Neighbors(seu), collapse = ", "), "\n")
cat("Reduction dimensions:", paste(Reductions(seu), collapse = ", "), "\n")
cat("Assays:", paste(Assays(seu), collapse = ", "), "\n")

## Graph objects: SCT_nn, SCT_snn, bbknn_scale.data_connectivities, bbknn_scale.data_distances, bbknn_ridge.residuals_connectivities, bbknn_ridge.residuals_distances 
## Neighbor objects:  
## Reduction dimensions: pca, umap, harmony, cca.integrated, rpca.integrated, pca.bbknn, integrated.scanorama, integrated.scANVI 
## Assays: RNA, SCT, bbknn.ridge, scanorama.reconstructed

Great! We have successfully performed several integrations! However, stopping here would be unsatisfactory because we still need to process each integration’s output(s) to obtain at least one UMAP projection for each. Here, we will also aim at generating assessable representations to score.

Score integrations

Process outputs

Several objects can be produced by each integration algorithm, namely a layer in a new assay (i.e. corrected counts), a dimension reduction (corrected embedding), or a knn network. Some even produce more than one output (for instance Scanorama produces corrected counts and a dimension reduction).

The type of output is important to consider, because scoring metrics are not compatible with all output types. The simplest strategy is to process each output separately in order to obtain at least a PCA out of it, or even a knn graph (essential to compute clusters). Note that most scores cannot be computed on knn graphs, hence knn graph outputs (e.g. BBKNN) can only be evaluated by a reduced set of metrics.

Below is a summary of post-processing steps for each output type (bracketed steps are not always necessary):

corrected counts: [ScaleData()] -> RunPCA() -> [FindNeighbors() -> FindOptimalClusters()]
dimension reduction: [RunPCA()] -> [FindNeighbors() -> FindOptimalClusters()]
knn graph: [FindOptimalClusters()]

RunPCA() is sometimes run even on dimension reduction objects (within scoring functions) because some scores require a variance associated with each dimension.

Let’s process all the outputs. Here, we will go through all the steps for a more exhaustive demonstration. However, it is to be noted that skipping the final step FindOptimalClusters() makes the neighbour graph computation step (FindNeighbors()) unnecessary. In such a case however, one will forgo two scoring metrics, namely ARI and NMI.

Here, we will use SymmetrizeKnn() between FindNeighbors() and FindOptimalClusters() because we set return.neighbor = TRUE in FindNeighbors(). This is useful to keep the distances between cells in the KNN graph rather than what FindNeighbors() does by default, which is converting the KNN graph to an adjacency matrix with 0/1s and to a SNN network with values bounded between 0 and 1. This not compulsory, but this is used to stay in line with BBKNN’s output. To prevent the community detection algorithm to output a high fraction of singletons, we “symmetrize” the matrix which makes the graph “undirected”.

# corrected counts outputs
DefaultAssay(seu) <- "scanorama.reconstructed"
seu <- ScaleData(seu)
seu <- RunPCA(seu, npcs = 50L, reduction.name = "pca.scanorama_counts")
seu <- FindNeighbors(seu, reduction = "pca.scanorama_counts", dims = 1:30, return.neighbor = TRUE,
                     graph.name = "knn.scanorama_counts")
seu <- SymmetrizeKnn(seu, graph.name = "knn.scanorama_counts")
seu <- FindOptimalClusters(seu, graph.name = "knn.scanorama_counts_symmetric",
                           cluster.name = "scanorama_counts_{cell.var}_{metric}",
                           cell.var = cell.var,
                           optimisation.metric = c("nmi", "ari")) # default, compute both


# dimension reduction outputs
DefaultAssay(seu) <- "SCT"
seu <- FindNeighbors(seu, reduction = "pca", dims = 1:30, k.param = 20L,
                     return.neighbor = TRUE, graph.name = "knn.unintegrated")
seu <- SymmetrizeKnn(seu, graph.name = "knn.unintegrated")
seu <- FindOptimalClusters(seu, graph.name = "knn.unintegrated_symmetric",
                           cluster.name = "unintegrated_{cell.var}_{metric}",
                           cell.var = cell.var)

seu <- FindNeighbors(seu, reduction = "integrated.scanorama", dims = 1:30,
                     return.neighbor = TRUE, graph.name = "knn.scanorama_reduction")
seu <- SymmetrizeKnn(seu, graph.name = "knn.scanorama_reduction")
seu <- FindOptimalClusters(seu, graph.name = "knn.scanorama_reduction_symmetric",
                           cluster.name = "scanorama_reduction_{cell.var}_{metric}",
                           cell.var = cell.var)

seu <- FindNeighbors(seu, reduction = "harmony", dims = 1:30,
                     return.neighbor = TRUE, graph.name = "knn.harmony")
seu <- SymmetrizeKnn(seu, graph.name = "knn.harmony")
seu <- FindOptimalClusters(seu, graph.name = "knn.harmony_symmetric", cell.var = cell.var,
                           cluster.name = "harmony_{cell.var}_{metric}")

seu <- FindNeighbors(seu, reduction = "cca.integrated", dims = 1:30,
                     return.neighbor = TRUE, graph.name = "knn.cca")
seu <- SymmetrizeKnn(seu, graph.name = "knn.cca")
seu <- FindOptimalClusters(seu, graph.name = "knn.cca_symmetric", cell.var = cell.var,
                           cluster.name = "cca_{cell.var}_{metric}")

seu <- FindNeighbors(seu, reduction = "rpca.integrated", dims = 1:30,
                     return.neighbor = TRUE, graph.name = "knn.rpca")
seu <- SymmetrizeKnn(seu, graph.name = "knn.rpca")
seu <- FindOptimalClusters(seu, graph.name = "knn.rpca_symmetric", cell.var = cell.var,
                           cluster.name = "rpca_{cell.var}_{metric}")

seu <- FindNeighbors(seu, reduction = "integrated.scANVI",
                     return.neighbor = TRUE, graph.name = "knn.scanvi")
seu <- SymmetrizeKnn(seu, graph.name = "knn.scanvi")
seu <- FindOptimalClusters(seu, graph.name = "knn.scanvi_symmetric", cell.var = cell.var,
                           cluster.name = "scanvi_{cell.var}_{metric}")


# graph outputs
seu <- SymmetrizeKnn(seu, graph.name = "bbknn_ridge.residuals_distances")
seu <- FindOptimalClusters(seu, graph.name = "bbknn_ridge.residuals_distances_symmetric",
                           cell.var = cell.var, cluster.name = "bbknn_{cell.var}_{metric}")

Note: Instead of sticking with the default FindNeighbors(return.neighbors = FALSE) at the beginning, we could have switched it to TRUE right away, process the KNN graph with SymmetrizeKnn() and use it for subsequent steps (umap, clustering, etc.)

FindOptimalClusters() adds metadata columns with the clustering results that maximized a metric (NMI or ARI) to the Seurat object:

rmarkdown::paged_table(seu[[]][1:10, grep("CellType_[arinm]{3}$", colnames(seu[[]]))])

Add scores to the Seurat object

Now that we have computed objects ready to be scored, we will proceed with assessing each integration output. For this step, one can either use Score[score_name]() and save scores in separate variables or use AddScore[score_name]() to directly store scores within the Seurat object. The latter is far more convenient and allows to compare scores graphically. This is the strategy we are going to adopt here.

Last but not least, a cell-type variable is used in most scores, hence it is highly recommended to have an estimate of each cell’s type (or produce such an estimate) stored in the Seurat object as a column in the metadata.

Note that if one doesn’t have a variable with cell-labels (used as ground truth in multiple scores) he will have to produce it or do without several scores. Alternatively, one can use automatic cell annotation algorithms (e.g. the Azimuth package). One can also have multiple cell label variables (e.g. Azimuth typically returns as many cell label variables as levels of annotations contained in the reference). Scores requiring cell type annotations always accept multiple column names.

There is a risk of confusion between cell annotations when using automatic cell annotation tools. Furthermore, in the case of using Azimuth to annotate cells, there is a specific risk of biasing results by favouring RPCA integration because Azimuth uses RPCA to integrate the query dataset onto the reference.

First of all, let’s organise outputs in lists:

reductions <- list(
  unintegrated = "pca",
  scanorama_counts = "pca.scanorama_counts",
  scanorama_reduction = "integrated.scanorama",
  harmony = "harmony",
  cca = "cca.integrated",
  rpca = "rpca.integrated",
  scanvi = "integrated.scANVI",
  bbknn = NULL
)

graphs <- list(
  unintegrated = "knn.unintegrated_symmetric",
  scanorama_counts = "knn.scanorama_counts_symmetric",
  scanorama_reduction = "knn.scanorama_reduction_symmetric",
  harmony = "knn.harmony_symmetric",
  cca = "knn.cca_symmetric",
  rpca = "knn.rpca_symmetric",
  scanvi = "knn.scanvi_symmetric",
  bbknn = "bbknn_ridge.residuals_distances_symmetric"
)

integrations <- names(reductions)

Let’s finalise our preparations: :

Make sure you have the kBET package installed to run ScoreKBET()
To benefit from a faster implementation of the lisi algorithm, install the lisi package
To benefit from a faster implementation of the euclidean distance computation, install the distances package
To run ScoreRegressPC.CellCycle() (score of cell cycle conservation), run CellCycleScoringPerBatch() beforehand (next chunk)

seu <- CellCycleScoringPerBatch(seu, batch.var = batch.var,
                                s.features = cc.genes$s.genes,
                                g2m.features = cc.genes$g2m.genes)

Let’s now loop through the integration outputs:

for (integ in integrations) {
  reduc <- reductions[[integ]]
  graph <- graphs[[integ]]
  clust.var.ari <- paste(integ, cell.var, "ari", sep = "_") # produced by `FindOptimalClusters()`
  clust.var.nmi <- paste(integ, cell.var, "nmi", sep = "_") # produced by `FindOptimalClusters()`
  if (! is.null(reduc)) {   # all TRUE except bbknn
    seu <- AddScoreASW(seu, integration = integ, cell.var = cell.var,
                       what = reduc, dist.package = "distances")
    seu <- AddScoreASWBatch(seu, integration = integ, batch.var = batch.var,
                            cell.var = cell.var, what = reduc,
                            dist.package = "distances")
    seu <- AddScoreDensityPC(seu, integration = integ, batch.var = batch.var,
                             reduction = reduc)
    seu <- AddScoreRegressPC(seu, integration = integ, batch.var = batch.var,
                             reduction = reduc)
    seu <- AddScoreRegressPC.CellCycle(seu, integration = integ,
                                       batch.var = batch.var, what = reduc,
                                       compute.cc = FALSE) # because CellCycleScoringPerBatch was ran before
  }
  seu <- AddScoreARI(seu, integration = integ, cell.var = cell.var,
                     clust.var = clust.var.ari)
  seu <- AddScoreNMI(seu, integration = integ, cell.var = cell.var,
                     clust.var = clust.var.nmi)
  seu <- AddScoreConnectivity(seu, integration = integ, graph.name = graph,
                              cell.var = cell.var)
  seu <- AddScoreKBET(seu, integration = integ, batch.var = batch.var,
                      cell.var = cell.var, what = reduc %||% sub("_symmetric$", "", graph),
                      graph.type = "distances", verbose = FALSE)
  seu <- AddScoreLISI(seu, integration = integ, batch.var = batch.var,
                      cell.var = cell.var, reduction = reduc,
                      graph.name = if (is.null(reduc)) sub("_symmetric$", "", graph) else NULL,
                      graph.type = "distances", ncores = ncores)
  
}

Now that we have computed multiple scores, we can look at them using IntegrationScores():

rmarkdown::paged_table(IntegrationScores(seu))

Compare the integrations

Note that those scores are raw. Let’s scale them (and make them comparable):

seu <- ScaleScores(seu)

Once again, we can print them:

rmarkdown::paged_table(IntegrationScores(seu, scaled = TRUE))

To readily compare the integrations, let’s plot the scores:

PlotScores(seu)

Dot plot to compare performance of integrations based on scores

One might notice a difference in scale for some of the scores, if comparing the plot with the table just before. This is the case for the PCA density score for instance.

print(sort(IntegrationScores(seu, scaled = TRUE)$PCA.density))

## [1] 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.06046605 0.12517481

Indeed, PlotScores() rescales the scores using min-max normalisation by default (rescale = TRUE). One might chose to disable it:

PlotScores(seu, rescale = FALSE)

Dot plot of scores without min-max rescaling

We notice that PCA based methods output very low scores in this case. Since they cannot be computed on knn graphs, scores are biased in favour of BBKNN. You can exclude some scores (and recompute overall scores on the fly)

PlotScores(seu, rescale = FALSE, exclude.score = c("pca.density", "pca.regression"))

Dot plot of scores without min-max rescaling, excluding non-informative scores

You can chose a different type of plot (radar or lollipop):

library(ggplot2)
PlotScores(seu, plot.type = "radar") +
  # reduce overlap between axis names and figures 
  theme(legend.position = "top", panel.spacing = unit(3, "cm"),
        plot.margin = margin(r = 3, l = 3, unit = "cm"),
        axis.text.x = element_text(size = 10))

## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_point()`).

Radar plot of scores

In this last plot, we also exclude the non-integrated dataset. Since the rescale argument is true by default, the scores are rescaled without the excluded integration’s scores.

  PlotScores(seu, plot.type = "lollipop",
             exclude.integration = "unintegrated")

## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_errorbarh()`).

Lollipop plot of scores without the unintegrated case

We want to compare the UMAP embeddings. For this, we first compute the dimension reductions:

for (integ in integrations) {
  reduc <- reductions[[integ]]
  if (! is.null(reduc)) { # all except BBKNN
    seu <- RunUMAP(seu, dims = 1:min(30, ncol(seu[[reduc]])), reduction = reduc,
                   reduction.name = paste0(integ, ".umap"))
  }
}

As BBKNN’s output is a graph, we need to use the umap Python package:

if (! reticulate::condaenv_exists('umap_0.5.4')) {
  reticulate::conda_create('umap_0.5.4', packages = 'umap-learn=0.5.4')
}

library(future)
plan(multisession)
seu %<-% {
  reticulate::use_condaenv('umap_0.5.4')
  RunUMAP(seu, graph = "bbknn_ridge.residuals_connectivities", umap.method = "umap-learn",
          n.epochs = 200L, reduction.name = "bbknn.umap") }
seu

## Warning: UNRELIABLE VALUE: Future (NULL) unexpectedly generated random numbers
## without specifying argument 'seed'. There is a risk that those random numbers
## are not statistically sound and the overall results might be invalid. To fix
## this, specify 'seed=TRUE'. This ensures that proper, parallel-safe random
## numbers are produced. To disable this check, use 'seed=NULL', or set option
## 'future.rng.onMisuse' to "ignore".

plan(sequential)

library(ggplot2)
plot_list <- sapply(integrations, function(integ) {
  DimPlot(seu, reduction = paste0(integ, ".umap"), group.by = batch.var) +
    ggtitle(integ) +
    theme(axis.title = element_blank())
}, simplify = FALSE)

patchwork::wrap_plots(plot_list, guides = "collect")

UMAPs of unintegrated and integrated pbmcsca data with cells colored by sequencing method (i.e. batch)

library(ggplot2)
plot_list <- sapply(integrations, function(integ) {
  DimPlot(seu, reduction = paste0(integ, ".umap"), group.by = cell.var) +
    ggtitle(integ) +
    theme(axis.title = element_blank())
}, simplify = FALSE)

patchwork::wrap_plots(plot_list, guides = "collect")

UMAPs of unintegrated and integrated pbmcsca data with cells colored by cell type label

There are several observations to be made here, that require further explanations:

some cells seem to be assigned a wrong label in CellType, highlighting the importance of having cell annotations of sufficient quality to be considered suitable as ground truth (which is actually not the case here)
All scaled PCA regression scores are set to zero. This is because the unintegrated dataset has the lowest raw PCA regression score. It is very likely the consequence of the SCT normalisation, which is much more efficient at masking batch-specific differences than the classical LogNorm. Thus, the inter-batch differences are not driving the principal components.
ScaleScores() produce scores that can have different scales (as long as rescale = FALSE) . Thus, min-max rescaling is used by default in PlotScores(), to balance each score’s contribution to the overall scores. This is especially suited for comparing a large number of integrations. However, it has some drawbacks: it can heavily distort the scale of scores when their maximum or minimum are far from 1 or 0 respectively (e.g. all cLISI scores are above 0.9). Hence, the final decision on whether to use min-max rescaling is left to the user’s discretion.

Session info

  ## R version 4.5.0 (2025-04-11)
  ## Platform: x86_64-pc-linux-gnu
  ## Running under: Ubuntu 24.04.2 LTS
  ## 
  ## Matrix products: default
  ## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
  ## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
  ## 
  ## locale:
  ##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
  ##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
  ##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
  ## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
  ## 
  ## time zone: UTC
  ## tzcode source: system (glibc)
  ## 
  ## attached base packages:
  ## [1] stats     graphics  grDevices utils     datasets  methods   base     
  ## 
  ## other attached packages:
  ## [1] ggplot2_3.5.2         future_1.49.0         SeuratIntegrate_0.4.1
  ## [4] Seurat_5.3.0          SeuratObject_5.1.0    sp_2.2-0             
  ## 
  ## loaded via a namespace (and not attached):
  ##   [1] RcppAnnoy_0.0.22            splines_4.5.0              
  ##   [3] later_1.4.2                 batchelor_1.24.0           
  ##   [5] tibble_3.2.1                polyclip_1.10-7            
  ##   [7] XML_3.99-0.18               fastDummies_1.7.5          
  ##   [9] lifecycle_1.0.4             edgeR_4.6.2                
  ##  [11] globals_0.18.0              lattice_0.22-6             
  ##  [13] MASS_7.3-65                 backports_1.5.0            
  ##  [15] magrittr_2.0.3              limma_3.64.0               
  ##  [17] plotly_4.10.4               sass_0.4.10                
  ##  [19] rmarkdown_2.29              jquerylib_0.1.4            
  ##  [21] yaml_2.3.10                 httpuv_1.6.16              
  ##  [23] glmGamPoi_1.20.0            sctransform_0.4.2          
  ##  [25] spam_2.11-1                 spatstat.sparse_3.1-0      
  ##  [27] reticulate_1.42.0           DBI_1.2.3                  
  ##  [29] cowplot_1.1.3               pbapply_1.7-2              
  ##  [31] RColorBrewer_1.1-3          ResidualMatrix_1.18.0      
  ##  [33] abind_1.4-8                 Rtsne_0.17                 
  ##  [35] GenomicRanges_1.60.0        purrr_1.0.4                
  ##  [37] BiocGenerics_0.54.0         tweenr_2.0.3               
  ##  [39] rappdirs_0.3.3              sva_3.56.0                 
  ##  [41] GenomeInfoDbData_1.2.14     IRanges_2.42.0             
  ##  [43] S4Vectors_0.46.0            ggrepel_0.9.6              
  ##  [45] irlba_2.3.5.1               listenv_0.9.1              
  ##  [47] spatstat.utils_3.1-4        genefilter_1.90.0          
  ##  [49] pbmcsca.SeuratData_3.0.0    goftest_1.2-3              
  ##  [51] RSpectra_0.16-2             spatstat.random_3.4-1      
  ##  [53] annotate_1.86.0             fitdistrplus_1.2-2         
  ##  [55] parallelly_1.44.0           pkgdown_2.1.2              
  ##  [57] DelayedMatrixStats_1.30.0   codetools_0.2-20           
  ##  [59] DelayedArray_0.34.1         ggforce_0.4.2              
  ##  [61] scuttle_1.18.0              tidyselect_1.2.1           
  ##  [63] UCSC.utils_1.4.0            farver_2.1.2               
  ##  [65] ScaledMatrix_1.16.0         matrixStats_1.5.0          
  ##  [67] stats4_4.5.0                spatstat.explore_3.4-2     
  ##  [69] jsonlite_2.0.0              BiocNeighbors_2.2.0        
  ##  [71] progressr_0.15.1            ggridges_0.5.6             
  ##  [73] survival_3.8-3              systemfonts_1.2.3          
  ##  [75] tools_4.5.0                 ragg_1.4.0                 
  ##  [77] ica_1.0-3                   Rcpp_1.0.14                
  ##  [79] glue_1.8.0                  gridExtra_2.3              
  ##  [81] SparseArray_1.8.0           mgcv_1.9-1                 
  ##  [83] xfun_0.52                   MatrixGenerics_1.20.0      
  ##  [85] GenomeInfoDb_1.44.0         dplyr_1.1.4                
  ##  [87] withr_3.0.2                 fastmap_1.2.0              
  ##  [89] digest_0.6.37               rsvd_1.0.5                 
  ##  [91] R6_2.6.1                    mime_0.13                  
  ##  [93] textshaping_1.0.1           scattermore_1.2            
  ##  [95] tensor_1.5                  kBET_0.99.6                
  ##  [97] RSQLite_2.3.11              spatstat.data_3.1-6        
  ##  [99] RhpcBLASctl_0.23-42         tidyr_1.3.1                
  ## [101] generics_0.1.4              data.table_1.17.2          
  ## [103] FNN_1.1.4.1                 httr_1.4.7                 
  ## [105] htmlwidgets_1.6.4           S4Arrays_1.8.0             
  ## [107] uwot_0.2.3                  pkgconfig_2.0.3            
  ## [109] gtable_0.3.6                blob_1.2.4                 
  ## [111] lmtest_0.9-40               SingleCellExperiment_1.30.1
  ## [113] XVector_0.48.0              htmltools_0.5.8.1          
  ## [115] dotCall64_1.2               scales_1.4.0               
  ## [117] Biobase_2.68.0              SeuratData_0.2.2.9002      
  ## [119] png_0.1-8                   lisi_1.0                   
  ## [121] harmony_1.2.3               spatstat.univar_3.1-3      
  ## [123] knitr_1.50                  reshape2_1.4.4             
  ## [125] nlme_3.1-168                cachem_1.1.0               
  ## [127] zoo_1.8-14                  stringr_1.5.1              
  ## [129] KernSmooth_2.23-26          parallel_4.5.0             
  ## [131] miniUI_0.1.2                AnnotationDbi_1.70.0       
  ## [133] desc_1.4.3                  pillar_1.10.2              
  ## [135] grid_4.5.0                  vctrs_0.6.5                
  ## [137] RANN_2.6.2                  promises_1.3.2             
  ## [139] BiocSingular_1.24.0         distances_0.1.12           
  ## [141] beachmat_2.24.0             xtable_1.8-4               
  ## [143] cluster_2.1.8.1             evaluate_1.0.3             
  ## [145] locfit_1.5-9.12             cli_3.6.5                  
  ## [147] compiler_4.5.0              rlang_1.1.6                
  ## [149] crayon_1.5.3                future.apply_1.11.3        
  ## [151] labeling_0.4.3              plyr_1.8.9                 
  ## [153] forcats_1.0.0               fs_1.6.6                   
  ## [155] stringi_1.8.7               viridisLite_0.4.2          
  ## [157] deldir_2.0-4                BiocParallel_1.42.0        
  ## [159] Biostrings_2.76.0           lazyeval_0.2.2             
  ## [161] spatstat.geom_3.4-1         Matrix_1.7-3               
  ## [163] RcppHNSW_0.6.0              patchwork_1.3.0            
  ## [165] bit64_4.6.0-1               sparseMatrixStats_1.20.0   
  ## [167] statmod_1.5.0               KEGGREST_1.48.0            
  ## [169] shiny_1.10.0                SummarizedExperiment_1.38.1
  ## [171] ROCR_1.0-11                 memoise_2.0.1              
  ## [173] igraph_2.1.4                broom_1.0.8                
  ## [175] bslib_0.9.0                 bit_4.6.0