Multi-condition single-cell RNA-seq analysis with LEMUR

Author

Constantin Ahlmann-Eltze, Tim Herbermann & Wolfgang Huber

Published

May 27, 2026

What you will learn in this lab

Our objective is to compare single cell RNA-seq data from multiple tissue samples across conditions. An obvious approach is to divide the constituent cells into clusters meant to represent cell types and, for each cluster separately, do pseudo-bulk differential expression with methods such as edgeR or DESEq2. t However, such discrete categorization tends to be an unsatisfactory model of the underlying biology.

Here, we use latent embedding multivariate regression (LEMUR) (Constantin Ahlmann-Eltze and Huber (2025)), a model that operates without or before commitment to discrete cell type categorization. LEMUR fits a regression model that predicts for each cell the expression of its genes in each of the conditions. Note that for any given cell, the expression of its genes was only observed in one replicate of one condition. The predictions enable us to compute any contrast¹ of interest, such as: how would the expression of a certain gene in a particular cell differ between treatment and control condition? It even lets us inter- and extrapolate between conditions, or consider more complex experimental designs such as factorial analysis, blocking factors, or continuous covariates. Given such contrasts, we can do biclustering to find groups of genes that behave similarly across a group of cells. Such a group of genes may then be interpreted in terms of pathways or a coordinated gene programme. Groups of cells is not necessarily associated to a single cell type, but also combinations of several cell types, or subsets within cell types.

¹ the technical term for differential expression effects

Figure 1 from Constantin Ahlmann-Eltze and Huber (2025).

Getting started

To start, we attach the tidyverse and SingleCellExperiment packages:

library("tidyverse")
library("SingleCellExperiment")
set.seed(0xbebe)

set.seed sets the initial conditions (“seed”) of R’s random number generator. This step is optional. Here, it ensures that the results are exactly the same for everyone. Some of the methods here use stochastic sampling, but the expectation is that even with different random seeds, the essence of the results will be the same. You can verify this by omiting the call to set.seed or calling it with a different argument.

Example data

We consider a dataset by Kang et al. (2018). The authors measured the effect of interferon-$\beta$ stimulation on blood cells from eight patients. The muscData package provides the data as a SingleCellExperiment. If downloading the data with the muscData package fails, you can also directly download the file from http://oc.embl.de/index.php/s/tpbYcH5P9NfXeM5 and load it into R using sce = readRDS("PATH_TO_THE_FILE").

sce <- muscData::Kang18_8vs8()

see ?muscData and browseVignettes('muscData') for documentation

loading from cache

sce

class: SingleCellExperiment 
dim: 35635 29065 
metadata(0):
assays(1): counts
rownames(35635): MIR1302-10 FAM138A ... MT-ND6 MT-CYB
rowData names(2): ENSEMBL SYMBOL
colnames(29065): AAACATACAATGCC-1 AAACATACATTTCC-1 ... TTTGCATGGTTTGG-1
  TTTGCATGTCTTAC-1
colData names(5): ind stim cluster cell multiplets
reducedDimNames(1): TSNE
mainExpName: NULL
altExpNames(0):

Question: How many genes and cells are in the data? How do you find the metadata about each cell / gene?

The numbers of genes and cells are mentioned when printing the summary of the sce. Alternatively, you can use nrow(sce), ncol(sce), or dim(sce) to directly access these values.

In a SingleCellExperiment object, the metadata (“data about data”) on the cells are accessed with colData(sce), and those on the genes with rowData(sce). The SingleCellExperiment class inherits from (extends) the SummarizedExperiment class, thus shares its methods, including colData and rowData.

As always, we first visually explore the data. To this end, we logarithm transform them to account for the heteroskedasticity of the counts (C. Ahlmann-Eltze and Huber (2023)), perform PCA to obtain for each cell a 50-dimensional vector (also known as an “embedding”), and then use UMAP to map these vectors into a two-dimensional plane for visualization on the screen. Here, we use the scrapper package to compute a new slot called logcounts, and the scater package to add the slot reducedDims(sce) with PCA and UMAP results. Equivalent functionality exists in other packages, e.g., in Seurat and scanpy.

library("sparseMatrixStats")                # provides the rowVars method needed below
sce <- scrapper::normalizeRnaCounts.se(sce)
hvg <- order(MatrixGenerics::rowVars(logcounts(sce)), decreasing = TRUE)
sce <- sce[hvg[1:500], ]
sce <- scater::runPCA(sce, ncomponents = 50)
sce <- scater::runUMAP(sce, dimred = "PCA")
reducedDims(sce)

List of length 3
names(3): TSNE PCA UMAP

reducedDims(sce) |> sapply(class)

     TSNE     PCA      UMAP    
[1,] "matrix" "matrix" "matrix"
[2,] "array"  "array"  "array"

reducedDims(sce) |> sapply(dim)

      TSNE   PCA  UMAP
[1,] 29065 29065 29065
[2,]     2    50     2

In the below two UMAP plots, the cells’ positions strongly separate by treatment status (stim), as well as by annotated cell type (cell). One goal of this tutorial is to understand how different approaches use these covariates to integrate the data into shared embedding spaces or representations.

In this tutorial, we use direct calls to functions in the ggplot2 package to visualize our data. This is a little more verbose than calling higher-level, specialized wrapper functions (e.g., scater::plotReducedDim(sce, "UMAP", color_by = "stim")), on the other hand, you see more directly what is going on.

ggum <- as_tibble(colData(sce)) |>
      mutate(umap = reducedDim(sce, "UMAP")) |>
        ggplot(aes(x = umap[, 1], y = umap[, 2])) + coord_fixed()

ggum + geom_point(aes(colour = stim), size = 0.3)
ggum + geom_point(aes(colour = cell), size = 0.3)

Figure 1: UMAP of log transformed counts

Figure 2: UMAP of log transformed counts

Question: The colData(sce)$ind column lists the patient identifier for each cell. Can you use this to create a separate plot for each patient?

One solution could be to use a for loop and filter the data for each patient, but there is a much simpler solution! Adding facet_wrap to the ggplot call will automatically create a subpanel for each patient:

ggum + geom_point(aes(colour = stim), size = 0.3) + facet_wrap(vars(ind), ncol = 4)

Figure 3: Like Figure 1, but separately per patient ID

Question: Why do we use coord_fixed() in the above plots?

Dimension reduction methods such as UMAP put in considerable effort to approximate the distance relationships between the points by arranging them in the Euclidean 2D plane, so it would be a shame to distort this by incidental figure size settings and by distinct scalings of the $x$- and the $y$-axis.

This dataset already comes with cell type annotations. These may help us later in interpreting the results. Note that the LEMUR analysis does not need them.

centroids <- as_tibble(colData(sce)) |>
  mutate(umap = reducedDim(sce, "UMAP")) |>
  summarize(umap = matrix(colMedians(umap), nrow = 1), .by = c(stim, cell))

# Use `geom_text_repel` to avoid overlapping annotations
ggum + geom_point(aes(colour = paste(stim, cell, sep = ": ")), size = 0.3) +
    ggrepel::geom_text_repel(data = centroids, aes(label = paste(stim, cell, sep = ": ")))

LEMUR analysis

LEMUR is a tool to analyze multi-condition single-cell data. A typical analysis workflow with LEMUR goes through five steps (Figure 5):

Covariate-adjusted dimensionality reduction.
Prediction of the differential expression for each cell per gene.
Identification of groups (or: “neighbourhoods”) of cells with similar differential expression patterns.
Pseudobulk differential expression testing per neighbourhood.
Biclustering of cells and genes with similar differential expression patterns

A basic workflow might look like this²:

² Future versions might simplify this into a single end-to-end function.

library("lemur")
fit <- lemur(sce, design = ~ stim, n_embedding = 30)  
fit <- align_by_grouping(fit, fit$colData$cell) |>
          test_de(contrast = cond(stim = "stim") - cond(stim = "ctrl"))
nei <- find_de_neighborhoods(fit, group_by = vars(ind, stim))

In the following, we go through each step in a little more detail.

LEMUR Integration (Step 1)

Tools like MNN and Harmony take a PCA embedding and remove the effects of the specified covariates. You can learn more about them in the Appendix (Section 6). However, the drawback is that there is no way to go back from the integrated embedding to the original gene space. This means that we cannot ask the counterfactual what the expression of a cell from the control condition would have been, had it been treated.

LEMUR achieves such counterfactuals by matching the subspaces of each condition (Constantin Ahlmann-Eltze and Huber 2025). Figure 6 illustrates the principle.

Figure 6: Schematic picture of data from two conditions with the respective linear subspace.

The lemur function takes as input a SingleCellExperiment object, the specification of the experimental design, and the number of latent dimensions. To refine the embedding, we call align_by_grouping and tell it to use the provided cell type annotations.

fit <- lemur(sce, design = ~ stim, n_embedding = 30, verbose = FALSE)
fit <- align_by_grouping(fit, fit$colData$cell, verbose = FALSE)

Making a UMAP plot of the embedding by lemur suggests successful integration of the conditions (Figure 7)³.

³ Note the line with arrange in the code below, which randomly reorders the rows of the dataframe. This way, the inevitable overplotting in such UMAP plots does not favour data points that come later in the original dataframe.

lemur_umap <- uwot::umap(t(fit$embedding))

gglem <- as_tibble(colData(sce)) |>
  mutate(umap = lemur_umap) |>
  arrange(runif(length(stim))) |>
  ggplot(aes(x = umap[, 1], y = umap[, 2])) + coord_fixed()

gglem + geom_point(aes(colour = stim), size = 0.3)

We can cross-check our embedding with the provided cell type annotations by colouring each cell by cell type (using the information stored in the colData(sce)$cell column).

gglem + geom_point(aes(colour = cell), size = 0.3)

Figure 8: Cell from the same cell types are close together after the integration

Challenge: How to can the embedding of lemur be refined with an automated tool?

If there are no cell type labels present, lemur uses a modified version of the maximum diversity clustering of harmony to automatically infer the cell relationships across conditions. Currently (lemur 1.9, May 2026), lemur needs harmony version 1.2.4 or earlier, and does not work with version 2.x.x . This is not great, and we hope to fix it.

fit2 <- lemur(sce, design = ~ stim, n_embedding = 30, verbose = FALSE)
# don't run:
# fit2 <- align_harmony(fit2, verbose = FALSE)

Differential expression analysis (Step 2)

An advantage of the LEMUR integration is that we can predict what the expression profile of a cell from the control condition would have been, had it been stimulated, and vice versa. Contrasting those predictions tells us how much the gene expression changes for that cell in any gene.

Differential expression with an invertible integration can be inferred as the difference between the prediction for a position from one condition and the prediction at the same position in the other condition.

We call the test_de function of lemur to compare the expression values in the stim and ctrl conditions.

fit <- test_de(fit, contrast = cond(stim = "stim") - cond(stim = "ctrl"))

We can now pick an individual gene (PLSCR1) and plot the predicted log fold change for each cell to show how it varies as a function of the underlying gene expression values (Figure 9). The code below defines a helper function mysequentialcolourscale to get a suitable colour scale: white corresponds to zero and blue to maxval.

mysequentialcolourscale <- function(maxval)
  scale_colour_continuous(
     limits = c(0, 1) * maxval,
     oob = scales::squish,
     palette = grDevices::colorRampPalette(RColorBrewer::brewer.pal(9, "Blues"))(256)
   )

darkerbackground <- theme(panel.background = element_rect(fill = "#c0c0c0"))

ggplscr1 <- as_tibble(colData(sce)) |>
  mutate(
    umap = lemur_umap,
    expr = logcounts(fit)["PLSCR1", ],
    de   = assay(fit, "DE")["PLSCR1", ]
  ) |>
  ggplot(aes(x = umap[, 1], y = umap[, 2])) + coord_fixed()
  
 ggplscr1 + geom_point(aes(colour = expr), size = 0.2) + mysequentialcolourscale(4) +
    facet_wrap(vars(stim)) + darkerbackground

Figure 9: Expression of *PLSCR1* in control and stim condition

We set up another colour scale for showing log fold changes, with blue corresponding to -maxval, red to +maxval and an extended range of white in the middle (around 0).

mydivergentcolourscale <- function(maxval)
  scale_colour_continuous(
     limits = c(-1, 1) * maxval,
     oob = scales::squish,
     palette = grDevices::colorRampPalette(RColorBrewer::brewer.pal(11, "RdBu")[rev(c(1:5, 6, 6, 6, 6, 7:11))])(256)
   )

ggplscr1 + geom_point(aes(colour = de), size = 0.2) + mydivergentcolourscale(1.5) + darkerbackground

Figure 10: Counterfactual: `lemur` predicts *PLSCR1* differential expression between control and stimulus for each cell

Find neighbourhoods (Step 3) and test for differential expression (Step 4)

As a step towards identifying the most differential expression patterns, lemur can screen the results of each gene to find a set of cells (which we call neighbourhood) with the most prominent coherent pattern of differential gene expression for that gene. Subsequently, to conduct statistically valid differential expression inference, lemur uses the pseudobulk approach to compute a gene count for each replicate (here: patient) on the test data that were previously set aside. The output is a data.frame with one row for each gene. The neighborhood column contains a list, where each list element is a vector with the IDs of the cells that are inside the neighbourhood. The other columns contain statistical measures of significance.

Schematic of the neighbourhood inference for a gene

nei <- find_de_neighborhoods(fit, group_by = vars(ind, stim))
head(nei)

    name neighborhood n_cells sel_statistic     pval adj_pval f_statistic df1
1    FTL AAACATAC....   29024          -256 1.55e-02 1.94e-02        7.19   1
2  ISG15 AAACATAC....   18205          1123 2.11e-17 2.44e-15     1138.06   1
3   FTH1 AAACATAC....   25782          -448 4.12e-07 1.45e-06       60.96   1
4  TIMP1 AAACATAC....    8498          -153 2.04e-04 3.67e-04       21.79   1
5 CXCL10 AAACATAC....    8438           300 5.69e-09 3.51e-08      109.69   1
6   CCL2 AAACATAC....    6882           301 2.50e-06 7.26e-06       46.58   1
   df2    lfc did_pval did_adj_pval did_lfc
1 17.5 -0.363 0.997095      1.00000  10.000
2 17.5  4.773 0.000854      0.00409   1.582
3 17.5 -0.894 0.396784      0.56043   0.259
4 17.5 -0.884 0.798763      0.89347   0.156
5 17.5  7.545 0.157695      0.27473  -1.963
6 17.5  2.071 0.838165      0.92960   0.142

We can sort the table either by pval or did_pval. The p value pval is from a test for expression change between control and stimulated condition for all cells inside the neighbourhood. The p value did_pval is from a test for expression difference between the cells inside versus outside the neighbourhood.

slice_min(nei, pval, n = 3)

     name neighborhood n_cells sel_statistic     pval adj_pval f_statistic df1
1 EIF2AK2 AAACATAC....   29063           563 2.53e-18 7.46e-16        1454   1
2   HERC5 AAACATAC....   28266           492 2.98e-18 7.46e-16        1427   1
3  NT5C3A AAACATAC....    8609           360 1.16e-17 1.93e-15        1220   1
   df2  lfc did_pval did_adj_pval did_lfc
1 17.5 3.37 1.00e+00     1.00e+00  10.000
2 17.5 4.52 7.54e-01     8.61e-01  -0.418
3 17.5 4.56 2.02e-08     1.01e-06  -1.734

slice_min(nei, did_pval, n = 3)

   name neighborhood n_cells sel_statistic     pval adj_pval f_statistic df1
1   SSB AAACATAC....    7744           335 1.97e-15 1.09e-13       672.0   1
2 CXCR4 AAACATAC....   27433          -134 1.35e-06 4.18e-06        51.1   1
3  ENO1 AAACATAC....    7775          -186 5.93e-11 7.06e-10       195.8   1
   df2    lfc did_pval did_adj_pval did_lfc
1 17.5  2.647 5.98e-15     2.99e-12   -2.24
2 17.5 -0.526 3.28e-14     8.20e-12   -2.93
3 17.5 -2.021 4.16e-13     6.94e-11    2.18

top_did_gene <- slice_min(nei, did_pval)

as_tibble(colData(sce), rownames = "cell_name") |>
  mutate(umap = lemur_umap) |>
  mutate(de = assay(fit, "DE")[top_did_gene$name, ]) |>
  ggplot(aes(x = umap[, 1], y = umap[, 2])) +
    geom_point(aes(colour = de), size = 0.3) +
    mydivergentcolourscale(1) +
    coord_fixed() + darkerbackground +
    labs(title = paste("The top differentially expressed gene by did_pval:", top_did_gene$name))

As you can see in the above plot, there is strong differential expression of this gene in one cluster of cells, and essentially none in the other cells.

To see which cells lemur included in the neighbourhood for each gene, we will first define a small helper function that converts nei—a dataframe that contains one row per gene, and the neighbourhood of that gene as a vector of cell identifiers in the neighborhood columns–into another dataframe that has one row for each gene and for each cell, and a column inside with a logical variable that indicates whether this cell is in the neighbourhood for that gene.

Future versions of lemur may contain this functionality inside, so you dont need to worry about it.

neighbourhoods_to_adjacency_matrix <- function(x, fit, out = c("matrix", "longdataframe")) {
  nei <- x[["neighborhood"]] 
  gene_names <- x[["name"]]
  cell_names <- unique(unlist(nei))  
  if (!missing(fit)) 
    cell_names <- union(cell_names, colnames(fit))
  stopifnot(!is.null(nei), !is.null(gene_names), !is.null(cell_names))
  
  mat <- matrix(0L, nrow = length(cell_names), ncol = length(gene_names),  # integer matrix cells x genes
                    dimnames = list(cell = cell_names, gene = gene_names))

  for (i in seq_along(gene_names))
    mat[nei[[i]], gene_names[i]] <- as.integer(sign(x$lfc[i]))
  
  switch(out,
    matrix = mat, # matrix of integers -1, 0, +1 of size cells x genes
    longdataframe = as_tibble(as.data.frame.table(mat != 0, responseName = "inside",
                              stringsAsFactors = FALSE))  # inside: column of logical
  )
}

With this function, we can annotate for any gene if a cell is included in its lemur neighbourhood. Let’s redo Figure 11, but now separately for the cells inside and outside:

cells_inside_df <- neighbourhoods_to_adjacency_matrix(top_did_gene, fit, out = "longdataframe")

as_tibble(colData(sce), rownames = "cell_name") |>
  mutate(umap = lemur_umap) |>
  mutate(de = assay(fit, "DE")[top_did_gene$name,]) |>
  left_join(cells_inside_df, by = c("cell_name" = "cell")) |>
  ggplot(aes(x = umap[, 1], y = umap[, 2])) +
    geom_point(aes(colour = de), size = 0.3) +
    mydivergentcolourscale(1) +
    coord_fixed() + darkerbackground +
    labs(title = paste("The top differentially expressed gene by did_pval:", top_did_gene$name)) +
    facet_wrap(vars(inside), labeller = label_both)

Brain teaser: How are lemur neighbourhoods different from the cluster-based differential expression tests?

The lemur neighbourhoods are adaptive to the underlying expression patterns for each gene. This means that they detect differential expression not just for a predefined set of cells, but for that set of cells for which the gene expression changes most consistently.

The plot below illustrates this for the top three DE genes by pval, which all have very different neighbourhoods. Note that the neighbourhoods are convex in the low-dimensional embedding space, even though in the UMAP embedding they might not look like it. This is due to the non-linear nature of the UMAP dimensionality reduction.

cells_inside_df_top3 <- neighbourhoods_to_adjacency_matrix(slice_min(nei, pval, n = 3), fit, "longdataframe")

as_tibble(colData(sce), rownames = "cell_name") |>
  mutate(umap = lemur_umap) |>
  mutate(de = as_tibble(t(assay(fit, "DE")[slice_min(nei, pval, n = 3)$name,]))) |>
  unpack(de, names_sep = "_") |>
  pivot_longer(starts_with("de_"), names_sep = "_", names_to = c(".value", "gene_name")) |>
  left_join(cells_inside_df_top3, by = c("gene_name" = "gene", "cell_name" = "cell")) |>
  ggplot(aes(x = umap[, 1], y = umap[, 2])) +
    geom_point(aes(colour = de), size = 0.3) +
    mydivergentcolourscale(2) +
    coord_fixed() + darkerbackground +
    facet_grid(vars(inside), vars(gene_name), labeller = label_both) +
    labs(title = "Top 3 DE genes by pval")

Step 5: Simultaneous clustering of genes and cells (biclustering)

So far, we have identified cell neighbourhoods for individual genes. While this is useful, the results for hundreds of genes can be overwhelming, and we may want to go a step further and know whether certain sets of genes have quite similar neighbourhoods, i.e., are co-differentially expressed. In this step, which was not yet described by Constantin Ahlmann-Eltze and Huber (2025), we use biclustering to address this question.

Clustering neighborhoods is not enough

As we already determined neighborhoods (i.e. subsets of cells) with coherent expression of a single gene, it seems only natural to further process these to identify superseeding neighborhoods, where the subset of cells shows coherent expression (or suppression) of multiple genes.

Indeed, taking the previously determined neighborhoods, we can easily quantify their similarity using a simple Jaccard index ($J(A,B) = \frac{|A \cap B|}{|A \cup B|}$).

top_gene <- slice_min(nei, did_pval, n = 1, with_ties = FALSE)
ref_cells <- top_gene$neighborhood[[1]]

# Function to compute Jaccard similarity
jaccard_index <- function(a, b) {
  a <- unique(a)
  b <- unique(b)
  if (length(a) == 0 && length(b) == 0) return(NA_real_)
  length(intersect(a, b)) / length(union(a, b))
}

jaccard_df <- nei |> filter(adj_pval < 0.15) |>
  mutate(
    jaccard = map_dbl(neighborhood, ~ jaccard_index(.x, ref_cells)),
    mean_expr_inside = map2_dbl(
      name,
      neighborhood,
      ~ mean(assay(fit, "DE")[.x, .y], na.rm = TRUE)
    )
  ) |>
  select(name, jaccard, mean_expr_inside) |>
  arrange(desc(jaccard))

head(jaccard_df, 10)

     name jaccard mean_expr_inside
1  FAM26F   1.000            1.288
2     SSB   1.000            0.816
3   MS4A7   0.998            0.325
4    GPX1   0.992           -0.658
5   RPL12   0.975           -0.353
6  CXCL11   0.956            2.223
7   PFDN5   0.955           -0.619
8     JTB   0.953           -0.224
9  S100A6   0.953           -0.654
10  PPP4C   0.952           -0.182

As we can see, many neighbourhoods show strong overlap, and the mean expression indicates that we find both up- and downregulated genes. Setting a threshold of, say, $J > 0.8$, we obtain:

recruited <- jaccard_df |>
  filter(jaccard > 0.8) |>
  arrange(desc(jaccard))

recruited_genes <- recruited$name

recruited_neighborhoods <- nei |>
  filter(name %in% recruited_genes) |>
  arrange(match(name, recruited_genes)) |>
  pull(neighborhood)

joined_neighborhood <- recruited_neighborhoods |>
  as.list() |>
  purrr::reduce(intersect)

joined_result <- list(
  genes = recruited_genes,
  neighborhood = joined_neighborhood
)

print(length(joined_result$genes))

[1] 124

print(length(joined_result$neighborhood))

[1] 4953

joined_inside_df <- tibble(
  cell_name = colnames(fit),
  inside = cell_name %in% joined_result$neighborhood
)

p_joined_neighborhood <- as_tibble(colData(sce), rownames = "cell_name") |>
  mutate(umap = lemur_umap) |>
  left_join(joined_inside_df, by = "cell_name") |>
  ggplot(aes(x = umap[, 1], y = umap[, 2])) +
  geom_point(aes(colour = inside), size = 0.3) +
  scale_colour_manual(values = c(`FALSE` = "goldenrod", `TRUE` = "royalblue3")) +
  coord_fixed() +
  labs(title = "Joined neighborhood", colour = "Inside")

p_joined_neighborhood

Figure 12: Joined neighbourhood from repeated intersection of gene neighbourhoods of similar Jaccard index

We now have a list of 124 genes showing significant differential expression within a neighbourhood of 4953 cells. To visualise the result, we can colour all cells belonging to the joined neighbourhood as shown in Figure 12.

Challenge: Confirming neighborhood expression patterns on a UMAP

genes4 <- joined_result$genes[1:4]

gene_expr_df <- as.data.frame(t(assay(fit, "DE")[genes4, , drop = FALSE])) |>
  rownames_to_column("cell_name") |>
  pivot_longer(
    cols = -cell_name,
    names_to = "gene",
    values_to = "de"
  )

p_four_genes <- as_tibble(colData(sce), rownames = "cell_name") |>
  mutate(umap = lemur_umap) |>
  left_join(gene_expr_df, by = "cell_name") |>
  ggplot(aes(x = umap[, 1], y = umap[, 2])) +
  geom_point(aes(colour = de), size = 0.3) +
  scale_colour_gradient2(
    low = scales::muted("blue"),
    high = scales::muted("red")
  ) +
  coord_fixed() +
  facet_wrap(vars(gene), ncol = 2) +
  labs(colour = "DE")

p_four_genes

Figure 13: Differential expression of selected genes associated to the joined neighbourhood we constructed earlier.

We can visually confirm that the identified joined neighbourhood indeed shows coherent up- or downregulation across these genes. For example, taking the first four genes we identified and plotting their differential expression on the UMAP in Figure 13.

Note that gene expression need not be restricted to that joined neighbourhood we constructed. Indeed, the intersection approach only guarantees to identify a neighbourhood of coherent expression patterns – any gene belonging to this “bicluster” may also show a very different expression pattern in other cell neighbourhoods. This could partially be remedied by also including subsets of cells instead of almost perfect overlap. However, this ansatz is still plagued by choosing somewhat arbitrary selection thresholds, and cuts tend to be overly conservative and therefore limit statistical power.

Ideally, we want to process the differential expression from LEMUR in an unsupervised fashion. Also, we want to identify non-exclusive and non-exhaustive biclusters, since cells can react to a changing condition by changing multiple programmes, and genes themselves are not exclusive to individual response programmes.

In the next section, we illustrate the usefulness of matrix factorization techniques to extract biclusters that are non-exhaustive and non-exclusive by design.

Challenge: Quantifying the coherence of “biclusters”

The goal of identifying biclusters and neighbourhoods of co-expression is not only to find genes that are, for example, significantly up- or downregulated within a neighbourhood, but rather to identify genes that show a coherent expression pattern. For example, if there is a gradient of differential expression within that neighbourhood, it would be interesting to see if other genes from the bicluster follow the same trend. In that sense, coherence is a stronger condition than simply averaged regulation over a given neighbourhood.

If all genes are indeed showing perfectly coherent expression, the bicluster submatrix has exactly one axis of variation. In other words, for each cell and gene combination the signal decomposes into a gene and a cell intrinsic part $X_{gc} = G_g C_c$. Equivalently, the submatrix simply has $\mathrm{rank}(X) = 1$.

This can be quantified using SVD. Let $\sigma_k$ denote the (sorted) singular values of the bicluster submatrix $X_\mathrm{sub}$. Then, $\alpha = \sigma^2/{\lVert X \rVert}_F^2$ quantifies how close the submatrix is to being rank one.

X_sub <- assay(fit, "DE")[joined_result$genes,
                          joined_result$neighborhood,
                          drop = FALSE]

sv <- svd(X_sub)

sv_df <- tibble(
  component = seq_along(sv$d[1:4]),
  proportion = sv$d[1:4]^2 / sum(sv$d[1:4]^2)
)

p_spectrum <- ggplot(
    sv_df |> slice_head(n = 10),
    aes(component, proportion)
  ) +
  geom_point(
    size = 2.5,
    colour = "#2166AC"
  ) +
  scale_x_continuous(breaks = 1:10) +
  scale_y_continuous(
    labels = scales::percent_format(accuracy = 1)
  ) +
  labs(
    title = "Singular value spectrum of bicluster",
    x = "Singular component",
    y = expression(sigma[k]^2 / sum(sigma[k]^2))
  ) +
  theme_bw(base_size = 12) +
  theme(
    panel.grid.minor = element_blank(),
    plot.title = element_text(face = "bold")
  )

print(p_spectrum)

Spectrum of singular values of the “bicluster” we constructed.

Biclustering from matrix factorization

This last section is more advanced topic. It works out a complete example of how biclusters can be found using matrix factorization techniques. Many of the ideas presented here are also the basis fo the dedicated package under development. The biclusters could be inferred directly from the differential expression matrix we found in Step 2. However, it can be convenient to preprocess that matrix in a way that highlights the bicluster structure. Here we use the neighborhoods found in Step 4 to indiciate if a gene is significantly expressed (in the neighborhood) or not (outside).

This binary matrix corresponds to the adjacency matrix of a bipartite graph of genes and cells. A bipartite graph has two types of nodes, and edges only between nodes of different types. In our case, the nodes are genes and cells, and an edge exists between a cell and a gene if the cell is in the gene’s neighbourhood. Let’s visualise it.

adj <- neighbourhoods_to_adjacency_matrix(dplyr::filter(nei, adj_pval < 0.15), fit, out = "matrix")
dim(adj)

[1] 29065   441

The conversion from the nei object to the adjacency matrix adj has lost / garbled the original order of cells in the sce data object. For comparisons below, and to avoid bugs, we align the orders.

stopifnot(setequal(colnames(sce), rownames(adj)))
adj = adj[colnames(sce), ]

Let’s look at a heatmap rendering of adj. We use the Heatmap function from the ComplexHeatmap package. Calling it needs a litany of parameters to customise the plot as we want it, but nothing complicated. Note the random subsetting of rows by csub. This is because the 29065 rows of adj are too many to see on the screen anyway (your screen will likely not have more than a few thousand pixels in vertical direction), and the subsetting saves computation time.

library("ComplexHeatmap")
library("circlize")
require("magick")
csub <- sample(nrow(adj), 2048) 
Heatmap(adj[csub, ], name = "adj", col = colorRamp2(c(-1, 0, 1), c("#FFD700", "#f0f0f0", "#0057B7")),
  row_title = "cells", column_title = "genes", use_raster = TRUE,
  show_row_names = FALSE, show_column_names = FALSE, show_row_dend = FALSE, show_column_dend = FALSE)

Figure 14: Adjacency matrix of the bipartite graph of genes and cells. Rows: cells, columns: genes. Yellow: inside the neighbourhood, LFC<0; blue: inside the neighbourhood, LFC>0; light grey: outside (no edge).

Looking at Figure 14, we immediately see that there is a lot of structure, and that bi- or coclustering of this matrix should be a fruitful approach to unravel that structure. That is, we are going to look for sets of genes who have similar neighbourhoods (cells). We may be able to interpret these as biological processes or regulatory programmes. Similarly, we can look for sets of cells that show differential co-expression. We can use these to refine our notions of cell type and state defined from static expression maps, to discover new subpopulations, or commonalities between known populations.

Biclustering

Biclustering or coclustering is a data analysis technique that simultaneously groups rows and columns of a matrix to identify subsets of rows that exhibit coherent patterns within subsets of columns.

A dedicated package for biclustering of lemur predictions is currently under development by the authors of this lab, and we aim to release it soon. The approach is based on low-rank (and possibly sparse) matrix decomposition of the predicted differential expression matrix (as in Figure 14) followed by an algorithm to assign cells and genes to biclusters. It is designed such that a cell or a gene can be part of multiple biclusters. This is motivated by the biological observation that a gene can contribute to multiple biological processes, and cells can respond to condition changes through up- or downregulation of multiple processes.

In the following we use non-negative matrix factorization (NMF) instead of the more widely known SVD. NMF factors a matrix as $X = WH$, where both $W$ and $H$ are subject to a non-negativity constraint. The main reason is that this makes the structure we find in $X$ additive, which is well-motivated from a biological perspective.

j <- 2
adjnmf <- RcppML::nmf(abs(adj), k = 10, verbose = FALSE, tol = 9.54e-07, maxit = 256)
adj_jn <- with(adjnmf, w[, j, drop = FALSE] %*% h[j,, drop = FALSE] * d[j])  # the matrix approximation from the j-th NMF component

Challenge: Recruiting cells and genes to a bicluster

The NMF decomposition finds continuous loadings of genes $W$ and cells $H$. Ideally, we wish to find a criterion to determine if a gene or cells should be attributed to a bicluster or not.

Let us begin by plotting the histograms of the j-th NMF components adjnmf$h[j, ] and adjnmf$w[, j]).

ggplot(tibble(w = adjnmf$w[, j]), aes(x = w)) + geom_histogram(bins = 100)
ggplot(tibble(h = adjnmf$h[j, ]), aes(x = h)) + geom_histogram(bins = 30)

Figure 15: Distribution of NMF coefficients: w

Figure 16: Distribution of NMF coefficients: h

Here we opt for a simple thresholding criterion. The choice of the thresholds can be motivated by the histograms above, where we observe roughly two peaks, the one closer to zero loading represents unwanted noise-modelling. The peak at larger values is our signal. Here we automate this by fitting a two-component mixture model with the $k$-means algorithm with $k=2$, and taking the midpoint.

find_threshold <- function(x) { mean(kmeans(x, centers = 2)$centers) }
threshold_h <- find_threshold(adjnmf$h[j, ])
threshold_w <- find_threshold(adjnmf$w[, j])
c(threshold_h, threshold_w )

[1] 2.95e-03 5.34e-05

bic_genes = adjnmf$h[j, , drop = FALSE] > threshold_h
bic_cells = adjnmf$w[, j, drop = FALSE] > threshold_w

A quick summary of the above procedure shows:

table(bic_genes)

bic_genes
FALSE  TRUE 
  272   169

table(bic_cells)

bic_cells
FALSE  TRUE 
20300  8765

Now we are able to visualize bicluster memberships of genes and cells on top of a heatmap like in Figure 14.

Heatmap bueraucracy

Most of the code below is Heatmap bureaucracy, but it also contains a call to the seriate function for ordering the rows and columns of the matrix using seriation, i.e., a one-dimensional embedding of multivariate objects. For this, we use a travelling salesperson algorithm as implemented in the seriation package. The seriation is dominated by bicluster membership (the LARGE term), and then, by the finer structure within mat (the absolute value of the subsampled adjacency matrix) itself. The blue bars at the top and right sides of the matrix indicate membership in our bicluster (j=2).

library("seriation")
LARGE <- 10
mat <- abs(adj)[csub, ] 
ser <- seriate(mat + LARGE * (bic_cells[csub, , drop = FALSE] %*% bic_genes), method = "BEA_TSP")
mycolours <- c(`FALSE` = "#f0f0f0", `TRUE` = "royalBlue4")
mycolourramp <- colorRamp2(c(0, 1), c("#f0f0f0", "royalblue2"))

Heatmap(mat, name = "adj", col = mycolourramp,
  row_order = get_order(ser, 1), column_order = get_order(ser, 2),
  row_title = "cells", column_title = "genes", use_raster = TRUE,
  show_row_names = FALSE, show_column_names = FALSE, cluster_rows = FALSE, cluster_columns = FALSE, show_heatmap_legend = FALSE,
  right_annotation = HeatmapAnnotation(inside = as.factor(bic_cells[csub]), which = "row", col = list(inside = mycolours), show_legend = FALSE),
  top_annotation   = HeatmapAnnotation(inside = as.factor(bic_genes),    which = "column", col = list(inside = mycolours), show_legend = FALSE)
)

Figure 17: Heatmap of the adjacency matrix with indicated bicluster.

We see in Figure 17 that the matrix factorization has indeed identified a coherent block in the adjacency matrix. In other words, we identified a set of cells that co-expresses many different genes. The entire procedure was unsupervised. At no point did we have to use potential prior knowledge about cells or genes other than the expression data predicted from lemur.

We can now revisit Figure 8 and compare it with our bicluster. We can notice two things about its location on the UMAP.

The bicluster forms a connected region. Thus, cells belonging to the bicluster are similar in their overall expression profile. Note that this does not necessarily have to hold. A bicluster could also be disconnected, as the a bicluster indicates a shared subset of the gene expression profile only.
The cell type composition is not homogenous. The bicluster contains different cell types.

ord <- order(bic_cells)
gglem <- as_tibble(colData(sce)) |>
  mutate(
    umap = lemur_umap,
    bic = bic_cells) |>
  arrange(bic) |>    
  ggplot(aes(x = umap[, 1], y = umap[, 2])) + coord_fixed()

gglem + geom_point(aes(colour = bic), size = 0.3) +
  scale_colour_discrete(palette = mycolours) + darkerbackground

mosaicplot(table(celltypeannotation = colData(sce)$cell, inbicluster = bic_cells), 
           main = "", las = 2, color =  c("#0057B7", "#FFD700"))

Figure 19: Cell type annotations versus membership in bicluster `j`. How would you biologically interpret this bicluster?

To close this section, let us briefly dive more into this last point. The exact cell type composition of the bicluster can be seen in Figure 19. Indeed, this bicluster consists mostly of three cell types, and not all of these, but just fractions. This highlights one of the key advantages of LEMUR. The subsequent biclustering lets us identify a coherent gene programmes as a response to the condition contrast. This programme is found to be relevant in a connected subset of cells encompassing different cell types. Thus, we do not only find subpopulations in a given cell type with a unique response. We even find subpopulations of other cell types that share the same response programme!

Outlook

lemur can handle arbitrary design matrices as input, which means that it can also handle more complicated experiments than the simple treatment / control comparison shown here. Just as with limma, edgeR or DESeq2, you can model

multiple covariates, e.g., the treatment of interest as well as known confounders (experiment batch, sex, age) or blocking variables,
interactions, e.g., drug treatment in wild type and mutant,
continuous covariates, e.g., time courses, whose effects can also be modeled using smooth non-linear functions (spline regression).

For more details, see Constantin Ahlmann-Eltze and Huber (2025). For any questions, whether technical/practical or conceptual, please post to the Bioconductor forum using the lemur tag and follow the posting guide.

lemur provides a label-free differential expression prediction at true single cell resolution. In its current form, lemur can automatically infer neighborhoods (i.e. subsets) of cells that show a consistent differential expression pattern. These neighborhoods can span multiple cell types, and are more generally complementary to other types of meta data. The combination of neighborhoods, as well as a detailed analysis of their composition allows to easily identify response patterns that are otherwise easier to miss in traditional pseudobulk analysis.

We hope to release a dedicated package for biclustering of the predicted differential expression patterns. It will provide not just subsets of differentially expressing cells for individual genes, but rather subsets of genes that show coherent up- or downregulation in a subset of cells. Thus, it will be possible to identify coherent gene response programmes and which cells are using them in an unsupervised manner. Just as for the individual gene neighborhoods, these biclusters are agnostic to cell types and other labels and, by design, allow to find response patterns independent of meta data labels that later can help in guiding biological interpretation.

Appendix: Variations and extensions to the workflow

In this appendix we collect some additional information, challenges and thoughts that are related to topics discussed in the main part, but somewhat tangential to the main narrative.

Preprocessing

What is the problem with using too few components for PCA? Is there also a problem with using too many?

Too few dimensions for PCA mean that it cannot capture enough of the relevant variation in the data. This leads to a loss of subtle differences between cell states.

Too many dimensions for PCA can also pose a problem. The objective of doing PCA with a number of components that is smaller than that of the original data, is to smooth out the sampling noise in the data. If too many PCA components are included, the smoothing is too weak, and the additional dimensions capture noise that can obscure biologically relevant differences. For more details see Fig. 2d in C. Ahlmann-Eltze and Huber (2023).

Pseudobulk analyses

Calculate the differential expression for each cell type using glmGamPoi.

glmGamPoi is a differential expression tool inspired by edgeR and DESeq2 that is specifically designed for single cell RNA-seq data.

psce <- glmGamPoi::pseudobulk(fit, group_by = vars(cell, ind, stim))
# Filter out NAs
psce <- psce[, !is.na(psce$cell)]
glm_fit <- glmGamPoi::glm_gp(psce, design = ~ stim * cell)
glm_de <- glmGamPoi::test_de(glm_fit, contrast = cond(stim = "stim", cell = "B cells") - cond(stim = "ctrl", cell = "B cells"))
glm_de |>
  arrange(pval) |>
  head()

    name         pval     adj_pval f_statistic df1      df2       lfc
1  HERC5 9.370681e-36 4.685341e-33    329.5872   1 116.4645  5.152089
2 NT5C3A 7.953371e-35 1.988343e-32    313.5509   1 116.4645  3.760837
3 PLSCR1 1.370641e-32 2.284401e-30    277.2727   1 116.4645  4.032858
4 DYNLT1 1.018124e-31 1.272655e-29    263.9920   1 116.4645  2.913860
5   IRF7 2.424490e-30 2.424490e-28    243.9083   1 116.4645  3.543921
6   SPI1 7.996678e-30 6.663898e-28    236.6273   1 116.4645 -2.440995

For each cell type compare the lemur predictions against the observed expression of a gene:

We use the pseudobulk function from glmGamPoi to calculate the average expression and latent embedding position per condition and cell type. The mean embedding and colData is then fed to the predict function of lemur.

# You could choose any gene
gene_oi <- "HERC5"

reduced_fit <- glmGamPoi::pseudobulk(fit, group_by = vars(stim, cell))
lemur_pred_per_cell <- predict(fit, newdata = colData(reduced_fit), embedding = t(reducedDim(reduced_fit, "embedding")))

as_tibble(colData(reduced_fit)) |>
  mutate(obs = logcounts(reduced_fit)[gene_oi,]) |>
  mutate(lemur_pred =  lemur_pred_per_cell[gene_oi,]) |>
  ggplot(aes(x = obs, y = lemur_pred)) +
    geom_abline() +
    geom_point(aes(colour = stim)) +
    labs(title = "`lemur` predictions averaged per cell type and condition")

Data integration

When we made Figure 1, we discussed that the cells’ embedding locations separate strongly both by treatment status and cell type. For many analyses, we would like to overlay the corresponding cells from the stimulated condition with the cells from the control condition. For example, for cell type assignment, we might want to annotate both conditions together, irrespective of treatment. This aspiration is called integration. The goal is a mathematical representation (e.g., a low-dimensional embedding) of the cells where the treatment status does not visibly affect the positions overall, and all the visible variance represents different cell types. Figure 20 shows an example.

There are many approaches to this end. The question is somewhat ill-defined and can be formalized into a mathematical algorithm in many different ways. Luecken et al. (2022) benchmarked several approaches. Here, we first show how to do an integration manually, then with harmony. Later we will compare both to lemur. If you are just interested in using lemur, you can skip ahead to Section 4.

Question: In this tutorial we will just look at plots to assess integration success. Why is that suboptimal? How can we do better?

A 2D embedding like UMAP or tSNE gives us a visual impression if variation that comes from “uninteresting” covariates (here: treatment condition) is eliminated and variation that comes from covariates we care about is retained, but the results are not quantitative, which means we cannot objectively assess or compare the quality of the result.

One simple metric to measure integration success is to see how mixed the cells from the two conditions are. For example, we can count for each cell how many of its $k$ nearest neighbours come from the same condition and how many come from the other condition(s), and then look at the distribution of this fraction. For more, see Luecken et al. (2022).

Follow-up challenge: Write a function to calculate the mixing metric.

Follow-up questions: Can you write an integration function that scores perfectly on this mixing metric? Hint: it can be biologically completely useless. What else would you need to measure to protect against such silly solutions?

Manual Projection

Figure 21: Schematic picture of data from two conditions. The data from the treated condition is projected onto the subspace spanned by the control condition.

In transcriptomic data analysis, each cell is characterized by its gene expression profile, which here we operationalize as the vector of counts for the 500 most variable genes. Accordingly, each cell is a point in a 500-dimensional gene expression space. Principal component analysis (PCA) finds a $P$-dimensional space ($P\le500$) such that overall, distance relationships between cells in the original space are optimally approximated by those in the reduced, $P$-dimensional space.

Although there are over 20,000 genes in the human genome, the sensitivity of the assay that was used for our dataset is such that only a much smaller number of genes are well-measured, i.e., have counts high enough to be informative. Whether we use the top 500 genes by mean or by standard deviation makes no big difference.

To make these concepts more tangible, consider the cartoon in Figure 21. The orange blob represents an optimal two-dimensional embedding of all cells from the control condition. Similarly, the blue blob represents the treated cells. Each blob is contained within a two-dimensional subspace (plane) of the ambient space, but in general, these are different. To integrate both conditions, we can map the two planes into each other, so that each point from the blue blob is mapped into the orange subspace, and vice versa.

We can implement this procedure in a few lines of R code:

We create a matrix for the cells from the control and treated conditions,
we fit PCA on the control cells,
we center the data, and finally
we project the cells from both conditions onto the subspace of the control condition

# Each column from the matrix is the coordinate of a cell in the 500-dimensional space
ctrl_mat <- as.matrix(logcounts(sce)[,sce$stim == "ctrl"])
stim_mat <- as.matrix(logcounts(sce)[,sce$stim == "stim"])

ctrl_centers <- rowMeans(ctrl_mat)
stim_centers <- rowMeans(stim_mat)

# `prcomp` is R's name for PCA and IRLBA is an algorithm to calculate it.
ctrl_pca <- irlba::prcomp_irlba(t(ctrl_mat - ctrl_centers), n = 20)

# With a little bit of linear algebra, we project the points onto the subspace of the control cells
integrated_mat <- matrix(NA, nrow = 20, ncol = ncol(sce))
integrated_mat[,sce$stim == "ctrl"] <- t(ctrl_pca$rotation) %*% (ctrl_mat - ctrl_centers)
integrated_mat[,sce$stim == "stim"] <- t(ctrl_pca$rotation) %*% (stim_mat - stim_centers)

We check how successful we were by looking at a UMAP of the integrated embedding.

int_mat_umap <- uwot::umap(t(integrated_mat))

as_tibble(colData(sce)) |>
  mutate(umap = int_mat_umap) |>
  ggplot(aes(x = umap[, 1], y = umap[, 2])) +
    geom_point(aes(colour = stim), size = 0.3) +
    coord_fixed()

Figure 22: UMAP of log transformed counts

The overlap is not perfect, but better than in Figure 1.

Challenge: What happens if you project on the "stim" subspace instead?

Schematic picture of data from two conditions using the stimulated condition as reference.

The projection is orthogonal onto the subspace, which means it matters which condition is chosen as reference.

stim_pca <- irlba::prcomp_irlba(t(stim_mat - stim_centers), n = 20, center = FALSE)

integrated_mat2 <- matrix(NA, nrow = 20, ncol = ncol(sce))
integrated_mat2[,sce$stim == "ctrl"] <- t(stim_pca$rotation) %*% (ctrl_mat - ctrl_centers)
integrated_mat2[,sce$stim == "stim"] <- t(stim_pca$rotation) %*% (stim_mat - stim_centers)

int_mat_umap2 <- uwot::umap(t(integrated_mat2))

as_tibble(colData(sce)) |>
  mutate(umap = int_mat_umap2) |>
  ggplot(aes(x = umap[, 1], y = umap[, 2])) +
    geom_point(aes(colour = stim), size = 0.3) +
    coord_fixed()

For this example, using the "stim" condition as the reference leads to a worse integration.

Brain teaser: How can you make the manual projection approach work for any complex experimental designs?

The projection approach consists of three steps:

Centering the data (e.g., ctrl_mat - ctrl_centers).
Choosing a reference condition and calculating the subspace that approximates the data from the reference condition (irlba::prcomp_irlba(t(stim_mat - stim_centers))$rotation).
Projecting the data from the other conditions onto that subspace (t(ctrl_pca$rotation) %*% (stim_mat - stim_centers)).

For arbitrary experimental designs, we can perform the centering with a linear model fit. The second step remains the same. And after calculating the centered matrix, the third step is also straight forward. Below are the outlines how such a general procedure would work.

# A complex experimental design
lm_fit <- lm(t(logcounts(sce)) ~ condition + batch, data = colData(sce))
# The `residuals` function returns the coordinates minus the mean at the condition.
centered_mat <- t(residuals(lm_fit))
# Assuming that `is_reference_condition` contains a selection of the cells
ref_pca <- irlba::prcomp_irlba(centered_mat[,is_reference_condition], ...)
int_mat <- t(ref_pca$rotation) %*% centered_mat

Harmony

Harmony is a popular tool for data integration (Korsunsky et al. 2019). It is relatively fast and can handle more complex experimental designs than just a treatment/control setup. It is built around maximum diversity clustering (Figure 23). Like many clustering algorithms, maximum diversity clustering aims to minimize the distance of each data point to its respective cluster center, and in addition it also aims to maximize the mixing of conditions (or “batches”) assigned to each cluster.

Initializing centroids

harm_umap <- uwot::umap(harm_mat)

as_tibble(colData(sce)) |>
  mutate(umap = harm_umap) |>
  ggplot(aes(x = umap[, 1], y = umap[, 2])) +
    geom_point(aes(colour = stim), size = 0.3) +
    coord_fixed()

Challenge: How do you do integration in Seurat?

The integration method of Seurat is based on mutual nearest neighbours (MNN) by Haghverdi et al. (2018). However, Seurat calls the mutual nearest neighbours integration anchors. The main difference is that Seurat uses canonical correspondence analysis (CCA) instead of principal component analysis (PCA) for dimensionality reduction (Butler et al. 2018).

# This code only works with Seurat v5!
seur_obj2 <- Seurat::as.Seurat(muscData::Kang18_8vs8(), data = NULL)

see ?muscData and browseVignettes('muscData') for documentation

loading from cache

Warning: Feature names cannot have underscores ('_'), replacing with dashes
('-')
Warning: Feature names cannot have underscores ('_'), replacing with dashes
('-')

Warning: Keys should be one or more alphanumeric characters followed by an
underscore, setting key from tsne to tsne_

seur_obj2[["originalexp"]] <- split(seur_obj2[["originalexp"]], f = seur_obj2$stim)

Warning: Input is a v3 assay and `split()` only works for v5 assays; converting
• to a v5 assay

Warning: Assay originalexp changing from Assay to Assay5

seur_obj2 <- Seurat::NormalizeData(seur_obj2, normalization.method = "LogNormalize", scale.factor = 10000)

Normalizing layer: counts.ctrl

Normalizing layer: counts.stim

seur_obj2 <- Seurat::FindVariableFeatures(seur_obj2, selection.method = "vst", nfeatures = 500)

Finding variable features for layer counts.ctrl

Finding variable features for layer counts.stim

seur_obj2 <- Seurat::ScaleData(seur_obj2)

Centering and scaling data matrix

seur_obj2 <- Seurat::RunPCA(seur_obj2, verbose = FALSE)

seur_obj2 <- Seurat::IntegrateLayers(object = seur_obj2, method = Seurat::CCAIntegration, orig.reduction = "pca", new.reduction = "integrated.cca", verbose = FALSE)

Warning: Adding a dimensional reduction (integrated.cca) without the associated
assay being present

seur_obj2 <- Seurat::RunUMAP(seur_obj2, dims = 1:30, reduction = "integrated.cca")

Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session

09:11:14 UMAP embedding parameters a = 0.9922 b = 1.112

09:11:14 Read 29065 rows and found 30 numeric columns

09:11:14 Using Annoy for neighbor search, n_neighbors = 30

09:11:14 Building Annoy index with metric = cosine, n_trees = 50

0%   10   20   30   40   50   60   70   80   90   100%

[----|----|----|----|----|----|----|----|----|----|

**************************************************|
09:11:15 Writing NN index file to temp file /var/folders/vj/8z0c57g57572nts4n20799jc0000gr/T//RtmppmkYBN/file12aaf66d15e2f
09:11:15 Searching Annoy index using 1 thread, search_k = 3000
09:11:19 Annoy recall = 100%
09:11:20 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
09:11:20 Initializing from normalized Laplacian + noise (using RSpectra)
09:11:21 Commencing optimization for 200 epochs, with 1361706 positive edges
09:11:21 Using rng type: pcg
09:11:29 Optimization finished

Seurat::DimPlot(seur_obj2, reduction = "umap", group.by = "stim")

Adjacency matrix

Challenge: Plotting the adjacency matrix

Plot adj_jn, i.e., the matrix spanned by the j-th NMF components adjnmf$h[j, ] and adjnmf$w[, j]), as a heatmap similar to Figure 14.

Heatmap(adj_jn[csub, ][order(adjnmf$w[csub, j]), order(adjnmf$h[j, ])], 
  name = "adj", col = colorRamp2(c(0, 1), c("#ffffff", "RoyalBlue")),
  row_title = "cells", column_title = "genes", use_raster = TRUE,
  show_row_names = FALSE, show_column_names = FALSE, cluster_rows = FALSE, cluster_columns = FALSE
)

Figure 25: The matrix approximation from the j-th NMF component

Note that here we use a sequential colour map rather than a diverging one, since the values are all positive. To order the rows and columns in this plot, we use the size of the NMF coefficients w and h. This is unlike in Figure 14, where we used, in lack of better alternatives, a dendrogram ordering⁴.

Similar plots can easily be produced for other values of j.

⁴ See, e.g., https://jokergoo.github.io/ComplexHeatmap-reference/book/a-single-heatmap.html Section 2.3.4.

Bicluster coverage

Challenge: Looking at the coverage of the biclusters

Create a plot like Figure 25 for the linear combination of all 10 factors, i.e., visualize the low-rank (k=10) approximation and compare it to Figure 14.

dichotomise <- function(x) {
  # `cluster` is either 1 or 2, `centers` are the coordinates of the cluster centers, the `sign` function returns values 
  # of -1 or +1 depending on the relative position of the centers along the real axis. The expression below gives
  # `TRUE` if a point is in the cluster with the larger center, `FALSE` otherwise.
  with(kmeans(as.vector(x), centers = 2), 3 + sign(as.vector(diff(centers))) == 2 * cluster)
}

wd <- apply(adjnmf$w, 2, dichotomise)
hd <- apply(adjnmf$h, 1, dichotomise) |> t()

# We use this for visualization below.
adjappr     <- wd %*% hd

# We use this for seriation below. BEA_TSP seems to be sensitive to the absolute scale of the data (?), hence, the division of the geometric mean of d. 
adjappr4ser <- wd %*% diag(adjnmf$d) %*% hd  / exp(mean(log(adjnmf$d))) 

# Subsampling for visualisation
csub        <- sample(nrow(adj), 2048)
adjs        <- abs(adj[csub, ])
adjapprs    <- adjappr[csub, ]
adjappr4ser <- adjappr4ser[csub, ]

ser <- seriate(adjappr4ser, method = "BEA_TSP")
myhm <- function(x, name = deparse(substitute(x)), col = colorRamp2(range(x), c("#ffffff", "RoyalBlue"))) {
  Heatmap(x, name = name, col = col, row_title = "cells", column_title = "genes", 
      row_order = get_order(ser, 1), column_order = get_order(ser, 2),
      show_row_names = FALSE, show_column_names = FALSE, use_raster = TRUE)
}
myhm(adjs)
myhm(adjapprs)

Note how some biclusters are overlapping, as indicated by elements of adjappr having values above 1:

table(adjappr)

adjappr
      0       1       2       3       4 
6249200 4128178 2188054  244958    7275

for (j in seq_len(ncol(wd)))
  print(myhm(wd[csub, j, drop = FALSE]  %*% hd[j, , drop = FALSE], 
             name = sprintf("j=%d", j),
             col = colorRamp2(c(0, 1), c("#d0d0d0", "RoyalBlue"))))

Session Info

sessionInfo()

R version 4.6.0 (2026-04-24)
Platform: aarch64-apple-darwin23
Running under: macOS Tahoe 26.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.6/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.6/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Rome
tzcode source: internal

attached base packages:
[1] grid      stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] future_1.70.0               seriation_1.5.8            
 [3] magick_2.9.1                circlize_0.4.18            
 [5] ComplexHeatmap_2.28.0       lemur_1.9.0                
 [7] sparseMatrixStats_1.24.0    muscData_1.26.0            
 [9] ExperimentHub_3.2.0         AnnotationHub_4.2.0        
[11] BiocFileCache_3.2.0         dbplyr_2.5.2               
[13] SingleCellExperiment_1.34.0 SummarizedExperiment_1.42.0
[15] Biobase_2.72.0              GenomicRanges_1.64.0       
[17] Seqinfo_1.2.0               IRanges_2.46.0             
[19] S4Vectors_0.50.1            BiocGenerics_0.58.1        
[21] generics_0.1.4              MatrixGenerics_1.24.0      
[23] matrixStats_1.5.0           lubridate_1.9.5            
[25] forcats_1.0.1               stringr_1.6.0              
[27] dplyr_1.2.1                 purrr_1.2.2                
[29] readr_2.2.0                 tidyr_1.3.2                
[31] tibble_3.3.1                ggplot2_4.0.3              
[33] tidyverse_2.0.0            

loaded via a namespace (and not attached):
 [1] DBI_1.3.0                 httr2_1.2.2              
 [3] rlang_1.2.0               magrittr_2.0.5           
 [5] clue_0.3-68               GetoptLong_1.1.1         
 [7] otel_0.2.0                compiler_4.6.0           
 [9] RSQLite_3.52.0            DelayedMatrixStats_1.34.0
[11] png_0.1-9                 vctrs_0.7.3              
[13] pkgconfig_2.0.3           shape_1.4.6.1            
[15] crayon_1.5.3              fastmap_1.2.0            
[17] XVector_0.52.0            labeling_0.4.3           
[19] ca_0.71.1                 rmarkdown_2.31           
[21] tzdb_0.5.0                bit_4.6.0                
[23] xfun_0.57                 cachem_1.1.0             
[25] beachmat_2.28.0           jsonlite_2.0.0           
[27] blob_1.3.0                DelayedArray_0.38.1      
[29] cluster_2.1.8.2           parallel_4.6.0           
[31] R6_2.6.1                  stringi_1.8.7            
[33] RColorBrewer_1.1-3        parallelly_1.47.0        
[35] Rcpp_1.1.1-1.1            iterators_1.0.14         
[37] knitr_1.51                splines_4.6.0            
[39] Matrix_1.7-5              timechange_0.4.0         
[41] tidyselect_1.2.1          rstudioapi_0.18.0        
[43] dichromat_2.0-0.1         abind_1.4-8              
[45] yaml_2.3.12               TSP_1.2.7                
[47] doParallel_1.0.17         codetools_0.2-20         
[49] curl_7.1.0                listenv_0.10.1           
[51] lattice_0.22-9            withr_3.0.2              
[53] KEGGREST_1.52.0           S7_0.2.2                 
[55] evaluate_1.0.5            Biostrings_2.80.0        
[57] pillar_1.11.1             BiocManager_1.30.27      
[59] filelock_1.0.3            foreach_1.5.2            
[61] BiocVersion_3.23.1        hms_1.1.4                
[63] scrapper_1.6.2            scales_1.4.0             
[65] globals_0.19.1            glue_1.8.1               
[67] tools_4.6.0               BiocNeighbors_2.6.0      
[69] registry_0.5-1            AnnotationDbi_1.74.0     
[71] colorspace_2.1-2          cli_3.6.6                
[73] rappdirs_0.3.4            S4Arrays_1.12.0          
[75] gtable_0.3.6              glmGamPoi_1.24.0         
[77] digest_0.6.39             SparseArray_1.12.2       
[79] rjson_0.2.23              htmlwidgets_1.6.4        
[81] farver_2.1.2              memoise_2.0.1            
[83] htmltools_0.5.9           lifecycle_1.0.5          
[85] httr_1.4.8                GlobalOptions_0.1.4      
[87] bit64_4.8.2

References

Ahlmann-Eltze, C., and Wolfgang Huber. 2023. “Comparison of Transformations for Single-Cell RNA-Seq Data.” Nature Methods, February, 665–72. https://doi.org/10.1038/s41592-023-01814-1.

Ahlmann-Eltze, Constantin, and Wolfgang Huber. 2025. “Analysis of Multi-Condition Single-Cell Data with Latent Embedding Multivariate Regression.” Nature Genetics. https://doi.org/10.1038/s41588-024-01996-0.

Butler, Andrew, Paul Hoffman, Peter Smibert, Efthymia Papalexi, and Rahul Satija. 2018. “Integrating Single-Cell Transcriptomic Data Across Different Conditions, Technologies, and Species.” Nature Biotechnology 36 (5): 411–20. https://doi.org/10.1038/nbt.4096.

Haghverdi, Laleh, Aaron T L Lun, Michael D Morgan, and John C Marioni. 2018. “Batch Effects in Single-Cell RNA-Sequencing Data Are Corrected by Matching Mutual Nearest Neighbors.” Nature Biotechnology 36 (5): 421–27. https://doi.org/10.1038/nbt.4091.

Kang, Hyun Min, Meena Subramaniam, Sasha Targ, Michelle Nguyen, Lenka Maliskova, Elizabeth McCarthy, Eunice Wan, et al. 2018. “Multiplexed Droplet Single-Cell RNA-Sequencing Using Natural Genetic Variation.” Nature Biotechnology 36 (1): 89–94. https://doi.org/10.1038/nbt.4042.

Korsunsky, Ilya, Nghia Millard, Jean Fan, Kamil Slowikowski, Fan Zhang, Kevin Wei, Yuriy Baglaenko, Michael Brenner, Po-ru Loh, and Soumya Raychaudhuri. 2019. “Fast, Sensitive and Accurate Integration of Single-Cell Data with Harmony.” Nature Methods 16 (12): 1289–96. https://doi.org/10.1038/s41592-019-0619-0.

Luecken, Malte D., M. Büttner, K. Chaichoompu, A. Danese, M. Interlandi, M. F. Mueller, D. C. Strobl, et al. 2022. “Benchmarking Atlas-Level Data Integration in Single-Cell Genomics.” Nature Methods 19 (1): 41–50. https://doi.org/10.1038/s41592-021-01336-8.