Over the last decade, we have seen exponential growth of the scale of scRNA-seqdatasets to millions of cells sequenced in a single study. This has enabledresearchers to characterize the gene expression profiles of various cell typesacross tissues. The rapid growth of scRNA-seq data has also created an uniqueset of challenges, for instance, there is a pressing need for scalableapproaches for scRNA-seq data visualization.

This vignette introduces *scBubbletree*, a transparent workflowfor quantitative exploration of single cell RNA-seq data.

In short, the algorithm of *scBubbletree* performs clusteringto identify clusters (“bubbles”) of transcriptionally similar cells, and thenvisualizes these clusters as leafs in a hierarchical dendrogram (“bubbletree”)which describes their natural relationships. The workflow comprises foursteps: 1. determining the clustering resolution, 2. clustering, 3. hierarchicalcluster grouping and 4. visualization. We explain each step in the followingusing real scRNA-seq dataset of five cancer cell lines.

To run this vignette we need to load a few R-packages:

`library(scBubbletree)library(ggplot2)library(ggtree)library(patchwork)`

Here we will analyze a scRNA-seq dataset containing a mixture of 3,918 cellsfrom five human lung adenocarcinoma cell lines (HCC827, H1975, A549, H838 andH2228). The dataset is available here2 https://github.com/LuyiTian/sc_mixology/blob/master/data/sincell_with_class_5cl.RData.

The library has been prepared with 10x Chromium platform and sequenced withIllumina NextSeq 500 platform. Raw data has been processed with Cellranger.The tool demuxlet has been used to predict the identity of each cell based onknown genetic differences between the different cell lines.

Data processing was performed with R-package *Seurat*. Geneexpressions were normalized with the function *SCTransform*using default parameters, and principal component analysis (PCA) was performedwith function *RunPCA* based on the 5,000 most variable genes in the datasetidentified with the function *FindVariableFeatures*.

In both datasets we saw that the first 15 principal components capture mostof the variance in the data, and the proportion of variance explained by eachsubsequent principal component was negligible. Thus, we used the single cellprojections (embeddings) in 15-dimensional feature space, \(A^{3,918\times 15}\).

`# # This script can be used to generate data("d_ccl", package = "scBubbletree")# # # create directory# dir.create(path = "case_study/")# # # download the data from:# https://github.com/LuyiTian/sc_mixology/raw/master/data/# sincell_with_class_5cl.RData# # # load the data# load(file = "case_study/sincell_with_class_5cl.RData")# # # we are only interested in the 10x data object 'sce_sc_10x_5cl_qc'# d <- sce_sc_10x_5cl_qc# # # remove the remaining objects (cleanup)# rm(sc_Celseq2_5cl_p1, sc_Celseq2_5cl_p2, sc_Celseq2_5cl_p3, sce_sc_10x_5cl_qc)# # # get the meta data for each cell# meta <- colData(d)[,c("cell_line_demuxlet","non_mt_percent","total_features")]# # # create Seurat object from the raw counts and append the meta data to it# d <- Seurat::CreateSeuratObject(counts = d@assays$data$counts,# project = '')# # # check if all cells are matched between d and meta# # table(rownames(d@meta.data) == meta@rownames)# d@meta.data <- cbind(d@meta.data, meta@listData)# # # cell type predictions are provided as part of the meta data# table(d@meta.data$cell_line)# # # select 5,000 most variable genes# d <- Seurat::FindVariableFeatures(object = d,# selection.method = "vst",# nfeatures = 5000)# # # Preprocessing with Seurat: SCT transformation + PCA# d <- SCTransform(object = d,# variable.features.n = 5000)# d <- RunPCA(object = d,# npcs = 50,# features = VariableFeatures(object = d))# # # perform UMAP + t-SNE# d <- RunUMAP(d, dims = 1:15)# d <- RunTSNE(d, dims = 1:15)# # # save the preprocessed data# save(d, file = "case_study/d.RData")# # # save the PCA matrix 'A', meta data 'm' and# # marker genes matrix 'e'# d <- get(load(file ="case_study/d.RData"))# A <- d@reductions$pca@cell.embeddings[, 1:15]# m <- d@meta.data# e <- t(as.matrix(d@assays$SCT@data[# rownames(d@assays$SCT@data) %in%# c("ALDH1A1",# "PIP4K2C",# "SLPI",# "CT45A2",# "CD74"), ]))# # d_ccl <- list(A = A, m = m, e = e)# save(d_ccl, file = "data/d_ccl.RData")`

Load the processed PCA matrix and the meta data

`data("d_ccl", package = "scBubbletree")A <- d_ccl$Am <- d_ccl$me <- d_ccl$e`

We will analyze this data with *scBubbletree*.

As first input *scBubbletree* uses matrix \(A^{n\times f}\) whichrepresents a low-dimensional projection of the original scRNA-seq data, with\(n\) rows as cells and \(f\) columns as low-dimension features.

We will use the PCA data generated by *Seurat* as \(A\). Inparticular, we will use the first 15 principal components (PCs) as everyadditional PC explains negligible amount of variance in the data.

**Important remark about \(A\)**: the *scBubbletree* workflowworks directly with the numeric matrix \(A^{n\times f}\) and is agnostic tothe initial data processing protocol. This enables seamless integrationof *scBubbletree* with computational pipelines using objectsgenerated by the R-packages *Seurat* and*SingleCellExperiment*. The users simply have to extract \(A\)from the corresponding *Seurat* or*SingleCellExperiment* objects.

`# A has n=cells as rows, f=features as columns (e.g. from PCA)dim(A)`

`FALSE [1] 3918 15`

The *scBubbletree* workflow performs the following steps:

- determine the clustering resolution (clusters \(k\) or resolution \(r\))
- graph-based community detection (e.g.with Louvain) or k-means clustering
- hierarchical organization of clusters (bubbles)
- visualization

## 4.1 1. determine the clustering resolution

If we use graph-based community detection (GCD, **recommended for scRNA-seq**)with e.g.the Louvain or Leiden method, then we need to find appropriate valuefor the resolution parameter \(r\). Otherwise, we can use the simpler k-meansclustering algorithm in which case we need to find an appropriate value of thenumber of clusters \(k\). In the next we will try both, GCD and k-means,clustering.

### 4.1.1 1.1 determining the number of clusters \(k\)

How many clusters (cell types) are there are in the data? Can we guess areasonable value of \(k\)?

To find a reasonable value of \(k\) we can study the literature or databasessuch as the human protein atlas database (HPA). We can also use the function`get_k`

for data-driven inference of \(k\) based on the Gap statistic and thewithin-cluster sum of squares (WCSS).

As this is a toy dataset, we will skip the first approach and perform adata-driven search for \(k\) using `get_k`

. As input we need to provide thematrix \(A\) as input and a vector of \(k\)s. The output will be the Gapstatistic and WCSS estimates for each \(k\).

Lets run `get_k`

now:

`b_k <- get_k(B_gap = 5, ks = 1:10, x = A, n_start = 50, iter_max = 200, kmeans_algorithm = "MacQueen", cores = 1)`

The Gap statistic and WCSS curves have a noticeable knee (elbow) at \(k=5\).Hence, \(k\)=5 appears to be reasonable first choice of \(k\). Means (points) and95% confidence intervals are shown for the Gap statistic at each \(k\) computedusing `B_gap`

=5 MCMC simulations.

`g0 <- ggplot(data = b_k$gap_stats_summary)+ geom_line(aes(x = k, y = gap_mean))+ geom_point(aes(x = k, y = gap_mean), size = 1)+ geom_errorbar(aes(x = k, y = gap_mean, ymin = L95, ymax = H95), width = 0.1)+ ylab(label = "Gap")|ggplot(data = b_k$wcss_stats_summary)+ geom_line(aes(x = k, y = wcss_mean))+ geom_point(aes(x = k, y = wcss_mean), size = 1)+ ylab(label = "WCSS")+ scale_y_log10()+ annotation_logticks(base = 10, sides = "l")`

`g0`

### 4.1.2 1.2 determining the resolution parameter \(r\) for Louvain clustering

For Louvain clustering we need to select a clustering resolution \(r\). Higherresolutions lead to more communities and lower resolutions lead to fewercommunities. We can use the same strategy as before to find a reasonablereasonable value of \(r\).

Lets use the function `get_r`

for data-driven estimation of \(r\) based onthe Gap statistic and WCSS. As input we need to provide the matrix \(A\) anda vector of \(r\)s. The output will be the Gap statistic and WCSS estimatefor each \(r\) (or the number of communities \(k'\) detected at resolution \(r\)).

`b_r <- get_r(B_gap = 5, rs = 10^seq(from = -4, to = 0, by = 0.35), x = A, n_start = 10, iter_max = 50, algorithm = "original", knn_k = 50, cores = 1)`

The Gap statistic and WCSS curves have noticeable knees (elbows) at \(k'=5\)(\(r=0.0025\)). Means (points) and 95% confidence intervals are shown for theGap statistic at each \(k\) computed using `B_gap`

=5 MCMC simulations.

`g0_r <- (ggplot(data = b_r$gap_stats_summary)+ geom_line(aes(x = k, y = gap_mean))+ geom_point(aes(x = k, y = gap_mean), size = 1)+ geom_errorbar(aes(x = k, y = gap_mean, ymin = L95, ymax = H95), width = 0.1)+ ylab(label = "Gap")+ xlab(label = "k'")|ggplot(data = b_r$gap_stats_summary)+ geom_line(aes(x = r, y = gap_mean))+ geom_point(aes(x = r, y = gap_mean), size = 1)+ geom_errorbar(aes(x = r, y = gap_mean, ymin = L95, ymax = H95), width = 0.1)+ ylab(label = "Gap")+ xlab(label = "r")+ scale_x_log10()+ annotation_logticks(base = 10, sides = "b"))/(ggplot(data = b_r$wcss_stats_summary)+ geom_line(aes(x = k, y = wcss_mean))+ geom_point(aes(x = k, y = wcss_mean), size = 1)+ ylab(label = "WCSS")+ xlab(label = "k'")|ggplot(data = b_r$wcss_stats_summary)+ geom_line(aes(x = r, y = wcss_mean))+ geom_point(aes(x = r, y = wcss_mean), size = 1)+ ylab(label = "WCSS")+ xlab(label = "r")+ scale_x_log10()+ annotation_logticks(base = 10, sides = "b"))`

`g0_r`

A range of resolutions yields \(k=5\) number of communities, i.e.\(r = 0.0025\)and \(r = 0.125\) result in \(k=5\). Lets use \(r=0.1\) for clustering

`ggplot(data = b_r$gap_stats_summary)+ geom_point(aes(x = r, y = k), size = 1)+ xlab(label = "r")+ ylab(label = "k'")+ scale_x_log10()+ annotation_logticks(base = 10, sides = "b")`

`knitr::kable(x = b_r$gap_stats_summary[b_r$gap_stats_summary$k == 5, ], digits = 4, row.names = FALSE)`

gap_mean | r | k | gap_SE | L95 | H95 |
---|---|---|---|---|---|

2.1660 | 0.0025 | 5 | 0.0057 | 2.1548 | 2.1771 |

2.1656 | 0.0056 | 5 | 0.0051 | 2.1556 | 2.1757 |

2.1641 | 0.0126 | 5 | 0.0052 | 2.1540 | 2.1742 |

2.1644 | 0.0282 | 5 | 0.0075 | 2.1498 | 2.1791 |

2.1653 | 0.0631 | 5 | 0.0032 | 2.1590 | 2.1715 |

## 4.2 2. clustering

### 4.2.1 2.1 with k-means

Now that we found out that \(k=5\) is a reasonable choice based on the data,we will perform k-means clustering with \(k=5\) and \(A\) as inputs. For thiswe will use the function kmeans (R-package stats) which offers variousvariants of k-means. Here we will use MacQueen’s k-means variant andperform \(n_\textit{start} = 1000\) (default in *scBubbletree*)random starts and a maximum number of iterations \(iter_\textit{max}=300\).

**Important remark**: for smaller datasets (e.g.\(n<50,000\)) \(n_{start}=1000\)and \(n_{iter} = 300\) are unnecessarily high, however for larger datasets thisis necessary to make sure that k-means converges.

## 4.3 3. hierarchical grouping

After the clustering is complete we will organize the bubbles in a naturalhierarchy. For this we perform \(B\) bootstrap iterations (default \(B=200\)).In iteration \(b\) the algorithm draws a random subset of \(N_\textit{eff}\)(default \(N_\textit{eff}=200\)) cells with replacement from each cluster andcomputes the average inter-cluster Euclidean distances. This data is used topopulate the distance matrix (\(D^{k\times k}_b\)), which is provided as inputfor hierarchical clustering with average linkage to generate a hierarchicalclustering dendrogram \(H_b\).

The collection of distance matrices that are computed during \(B\) iterations areused to compute a consensus (average) distance matrix (\(\hat{D}^{k\times k}\))and from this a corresponding consensus hierarchical dendrogram (bubbletree;\(\hat{H}\)) is constructed. The collection of dendrograms are used to quantifythe robustness of the bubbletree topology, i.e.to count the number of timeseach branch in the bubbletree is found among the topologies of the bootstrapdendrograms. Branches can have has variable degrees of support ranging between0 (no support) and \(B\) (complete support). Distances between bubbles (inter-bubble relationships) are described quantitatively in the bubbletree as sumsof branch lengths.

Steps 2.1 and 3. are performed next

`k5_kmeans <- get_bubbletree_kmeans( x = A, k = 5, cores = 1, B = 200, N_eff = 200, round_digits = 1, show_simple_count = FALSE, kmeans_algorithm = "MacQueen")`

… and plot the bubbletree

`k5_kmeans$tree`

Lets describe the bubbletree:

**bubbles**: The bubbletree has `k=5`

bubbles (clusters) shown as leaves. Theabsolute and relative cell frequencies in each bubble and the bubble IDs areshown as labels. Bubble radii scale linearly with absolute cell count in eachbubble, i.e.large bubbles have many cells and small bubbles contain few cells.

Bubble 1 is the largest one in the dendrogram and contains 1,253 cells(\(\approx\) 32% of all cells in the dataset). Bubble 4 is the smallest oneand contains only 436 cells (\(\approx\) 11% of all cells in the dataset).

We can access the bubble data shown in the bubbletree

`knitr::kable(k5_kmeans$tree_meta, digits = 2, row.names = FALSE)`

label | Cells | n | p | pct | lab_short | lab_long | tree_order |
---|---|---|---|---|---|---|---|

5 | 436 | 3918 | 0.11 | 11.1 | 5 (0.4K, 11.1%) | 5 (436, 11.1%) | 5 |

3 | 593 | 3918 | 0.15 | 15.1 | 3 (0.6K, 15.1%) | 3 (593, 15.1%) | 4 |

4 | 1253 | 3918 | 0.32 | 32.0 | 4 (1.3K, 32%) | 4 (1253, 32%) | 3 |

2 | 760 | 3918 | 0.19 | 19.4 | 2 (0.8K, 19.4%) | 2 (760, 19.4%) | 2 |

1 | 876 | 3918 | 0.22 | 22.4 | 1 (0.9K, 22.4%) | 1 (876, 22.4%) | 1 |

**topology**: inter-bubble distances are represented by sums of branchlengths in the dendrogram. Branches of the bubbletree are annotated withtheir bootstrap support values (red branch labels). The branch supportvalue tells us how manytimes a given branch from the bubbletree was foundamong the \(B\) bootstrap dendrograms. We ran `get_bubbletree_kmeans`

with\(B=200\). All but one branch have complete (200 out of 200) support, andone branch has lower support of 179 (85%). This tells us that the branchbetween bubbles (3, 4) and 1 is not as robust.

### 4.3.1 2.2 with Louvain

Lets also perform clustering with the Louvain algorithm (function FindClusters,R-package *Seurat*) and resolution parameter \(r=0.1\). There arenumerous variants of the Louvain algorithm. Here we will use the originalimplementation. We will do clustering with \(n_\textit{start} = 20\) randomstarts and a maximum number of iterations \(iter_\textit{max} = 100\).

Steps 2.2 and 3. (hierarchical clustering) are performed next

`k5_louvain <- get_bubbletree_graph(x = A, r = 0.0025, n_start = 20, iter_max = 100, algorithm = "original", knn_k = 50, cores = 1, B = 200, N_eff = 200, round_digits = 1, show_simple_count = FALSE)`

… and plot the bubbletree. We see nearly identical dendrogram as the onegenerated by kmeans clustering. The bubble IDs are different but we see similarbubble sizes, topology and branch robustness values.

`k5_louvain$tree`

The two dendrograms shown side-by-side:

`k5_kmeans$tree|k5_louvain$tree`

Given the high degree of similarity between the two clustering solutions weproceed in the next with the k-means results.

## 4.4 4. visualization

To extract biologically useful information from the bubbletree (and also for2D UMAP or t-SNE plots) we need to adorn it with biologically relevant cellfeatures. This includes both **numeric** and **categorical** cell features.

Numeric cell features:

- gene expression
- % of mitochondrial transcripts
- number of UMIs, genes detected
- …

Categorical cell features:

- cell type label (e.g.B-cells, T-cells, moncytes, …)
- cell cycle phase (e.g.S, M, G1, …)
- sample name (e.g.S1, S2, S3, …)
- treatment group (e.g.cancer vs.control cell)
- multiplet status (e.g.singlet, doublet or multiplet)
- …

In the next two paragraph we will explain how to ‘attach’ numeric andcategorical features to the bubbletree using *scBubbletree*.

## 4.5 Attaching categorical features

Categorical cell features can be ‘attached’ to the bubbletree using the function`get_cat_tiles`

. Here we will show the relative frequency of cell type labelsacross the bubbles (parameter `integrate_vertical=TRUE`

).

Interpretation of the figure below:

- we see high degree of co-occurrence between cell lines and bubbles, i.e.each bubble is made up of cells from a distinct cell line
- for instance, 99.8% of cells that have feature HCC827 are found in bubble 2
- columns in the tile plot integrate to 100%

`w1 <- get_cat_tiles(btd = k5_kmeans, f = m$cell_line_demuxlet, integrate_vertical = TRUE, round_digits = 1, x_axis_name = 'Cell line', rotate_x_axis_labels = TRUE, tile_text_size = 2.75)(k5_kmeans$tree|w1$plot)+ patchwork::plot_layout(widths = c(1, 1))`

We can also show the inter-bubble cell type composition, i.e.the relativefrequencies of different cell types in a specific bubble (with parameter`integrate_vertical=FALSE`

).

Interpretation of the figure below:

- the bubbles appear to be “pure” \(\rightarrow\) made up of cells fromdistinct cell lines
- the cell line composition of bubble 2 is: 0.1% H838, 99.9% H2228, 0.1%A549, 0.1% H1975 and 0% HCC827 cells
- rows integrate to 100%

`w2 <- get_cat_tiles(btd = k5_kmeans, f = m$cell_line_demuxlet, integrate_vertical = FALSE, round_digits = 1, x_axis_name = 'Cell line', rotate_x_axis_labels = TRUE, tile_text_size = 2.75)(k5_kmeans$tree|w2$plot)+ patchwork::plot_layout(widths = c(1, 1))`

*scBubbletree* uses R-package *ggtree* tovisualize the bubbletree, and *ggplot2* to visualizeannotations. Furthermore, R-package *patchwork* is used tocombine plots.

`(k5_kmeans$tree|w1$plot|w2$plot)+ patchwork::plot_layout(widths = c(1, 2, 2))+ patchwork::plot_annotation(tag_levels = "A")`

## 4.6 Gini impurity index

To quantify the purity of a cluster (or bubble) \(i\) with \(n_i\) number of cells,each of which carries one of \(L\) possible labels (e.g.cell lines), we cancompute the Gini impurity index:

\(\textit{GI}_i=\sum_{j=1}^{L} \pi_{ij}(1-\pi_{ij})\),

with \(\pi_{ij}\) as the relative frequency of label \(j\) in cluster \(i\). Inhomogeneous (`pure`

) clusters most cells carry a distinct label. Hence, the\(\pi\)’s are close to either 1 or 0, and *GI* takes on a small value close tozero. In `impure`

clusters cells carry a mixture of different labels. Inthis case most \(\pi\) are far from either 1 or 0, and *GI* diverges from 0and approaches 1. If the relative frequencies of the different labels incluster \(i\) are equal to the (background) relative frequencies of the labelsin the sample, then cluster \(i\) is completely `impure`

.

To compute the overall Gini impurity of a bubbletree, which represents aclustering solution with \(k\) bubbles, we estimated the weighted Gini impurity(*WGI*) by computing the weighted (by the cluster size) average of the

\(\textit{WGI}=\sum_{i=1}^{k} \textit{GI}_i \dfrac{n_i}{n}\),

with \(n_i\) as the number of cells in cluster \(i\) and \(n=\sum_i n_i\).

The Gini impurity results are shown below:

`# giniget_gini(labels = m$cell_line_demuxlet, clusters = k5_kmeans$cluster)$gi`

`FALSE cluster GIFALSE 1 3 0.020099588FALSE 2 1 0.004558391FALSE 3 5 0.000000000FALSE 4 2 0.007873961FALSE 5 4 0.000000000`

All cluster-specific *GI*s are close to 0 and hence also *WGI* is close to 0.This indicates nearly perfect mapping of cell lines to bubbles. This analysisperformed for different values of \(k\) with function `get_gini_k`

, which takesas main input the output of `get_k`

or `get_r`

`gini_boot <- get_gini_k(labels = m$cell_line_demuxlet, obj = b_k)`

From the figure we can conclude that WGI drops to 0 at `k=5`

, and all labelsare nearly perfectly split across the bubbles with each bubble containing cellsexclusively from one cell type.

`g1 <- ggplot(data = gini_boot$wgi_summary)+ geom_point(aes(x = k, y = wgi), size = 0.75)+ ylab(label = "WGI")+ ylim(c(0, 1))g1`

## 4.7 Attaching numeric features

We can also “attach” numeric cell features to the bubbletree. We will “attach”the expression of five marker genes, i.e.one marker gene for each of the fivecancer cell lines.

We can visualize numeric features in *two* ways.

First, we can show numeric feature aggregates (e.g.“mean”, “median”, “sum”,“pct nonzero” or “pct zero”) in the different bubbles with `get_num_tiles`

`w3 <- get_num_tiles(btd = k5_kmeans, fs = e, summary_function = "mean", x_axis_name = 'Gene expression', rotate_x_axis_labels = TRUE, round_digits = 1, tile_text_size = 2.75)(k5_kmeans$tree|w3$plot)+ patchwork::plot_layout(widths = c(1, 1))`

Second, we can visualize the distributions of the numeric cell features in eachbubble as violins with `get_num_violins`

`w4 <- get_num_violins(btd = k5_kmeans, fs = e, x_axis_name = 'Gene expression', rotate_x_axis_labels = TRUE)(k5_kmeans$tree|w3$plot|w4$plot)+ patchwork::plot_layout(widths = c(1.5, 2, 2.5))+ patchwork::plot_annotation(tag_levels = 'A')`

## 4.8 Quality control with *scBubbletree*

What is the percent of UMIs coming from mitochondrial genes in each bubble?

`w_mt_dist <- get_num_violins(btd = k5_kmeans, fs = 1-m$non_mt_percent, x_axis_name = 'MT [%]', rotate_x_axis_labels = TRUE)w_umi_dist <- get_num_violins(btd = k5_kmeans, fs = m$nCount_RNA/1000, x_axis_name = 'RNA count (in thousands)', rotate_x_axis_labels = TRUE)w_gene_dist <- get_num_violins(btd = k5_kmeans, fs = m$nFeature_RNA, x_axis_name = 'Gene count', rotate_x_axis_labels = TRUE)(k5_kmeans$tree|w_mt_dist$plot|w_umi_dist$plot|w_gene_dist$plot)+ patchwork::plot_layout(widths = c(0.7, 1, 1, 1))+ patchwork::plot_annotation(tag_levels = 'A')`

The clustering of both algorithms seems to work well, i.e.cells from specificcell lines are clustered together. Two potential issues:

- overplotting: points are stacked on top of each other \(\rightarrow\) lossof information about cell density distribution \(\rightarrow\) we areunable to accurately answer two questions:
- which cell line contains most cells?
- what are the relative frequencies of cells across cell lines?

- preservation of long-range cell distances: UMAP and t-SNE focus onpreserving local cell distance (at the cost of global distances)\(\rightarrow\) are there pairs of cell lines that are more similar to eachother than the remaining ones in the sample?

These challenges will become more severe in real datasets (e.g.PBMC samples,see case study B) which contain many clusters of cells that are not clearlyseparable.

## 4.9 scBubbletree can incorporate results from other clustering approaches

Wide range of clustering approaches are used for clustering of scRNA-seq data.*scBubbletree* implements the function `get_bubbletree_dummy`

toallow users to incorporate results from various clustering approaches togetherwith our workflow. With this function we skip the k-means clustering portion ofthe workflow and proceed with computing distances between the clusters andgeneration of the bubbletree.

Lets try `get_bubbletree_dummy`

. First, will perform k-medoids clustering withR-package *cluster* and then generate the bubbletree:

`pam_k5 <- cluster::pam(x = A, k = 5, metric = "euclidean")dummy_k5_pam <- get_bubbletree_dummy(x = A, cs = pam_k5$clustering, B = 200, N_eff = 200, cores = 2, round_digits = 1)dummy_k5_pam$tree| get_cat_tiles(btd = dummy_k5_pam, f = m$cell_line_demuxlet, integrate_vertical = TRUE, round_digits = 1, tile_text_size = 2.75, x_axis_name = 'Cell line', rotate_x_axis_labels = TRUE)$plot`

## 4.10 Summary

*scBubbletree* promotes simple and transparent visual explorationof scRNA-seq. It is *not a black-box* approach and the user *is encouraged* toexplore the data with different values of \(k\) and \(r\) or custom clusteringsolutions. Attaching of cell features to the bubbletree is necessary forbiological interpretation of the individual bubbles and their relationshipswhich are described by the bubbletree topology.

## 4.11 Session Info

`sessionInfo()`

`FALSE R version 4.3.0 RC (2023-04-13 r84269)FALSE Platform: x86_64-pc-linux-gnu (64-bit)FALSE Running under: Ubuntu 22.04.2 LTSFALSE FALSE Matrix products: defaultFALSE BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so FALSE LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0FALSE FALSE locale:FALSE [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C FALSE [3] LC_TIME=en_GB LC_COLLATE=C FALSE [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 FALSE [7] LC_PAPER=en_US.UTF-8 LC_NAME=C FALSE [9] LC_ADDRESS=C LC_TELEPHONE=C FALSE [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C FALSE FALSE time zone: America/New_YorkFALSE tzcode source: system (glibc)FALSE FALSE attached base packages:FALSE [1] stats graphics grDevices utils datasets methods base FALSE FALSE other attached packages:FALSE [1] patchwork_1.1.2 ggtree_3.8.0 ggplot2_3.4.2 scBubbletree_1.2.0FALSE [5] BiocStyle_2.28.0 FALSE FALSE loaded via a namespace (and not attached):FALSE [1] deldir_1.0-6 pbapply_1.7-0 gridExtra_2.3 FALSE [4] rlang_1.1.0 magrittr_2.0.3 RcppAnnoy_0.0.20 FALSE [7] spatstat.geom_3.1-0 matrixStats_0.63.0 ggridges_0.5.4 FALSE [10] compiler_4.3.0 png_0.1-8 vctrs_0.6.2 FALSE [13] reshape2_1.4.4 stringr_1.5.0 pkgconfig_2.0.3 FALSE [16] fastmap_1.1.1 magick_2.7.4 ellipsis_0.3.2 FALSE [19] labeling_0.4.2 utf8_1.2.3 promises_1.2.0.1 FALSE [22] rmarkdown_2.21 purrr_1.0.1 xfun_0.39 FALSE [25] cachem_1.0.7 aplot_0.1.10 jsonlite_1.8.4 FALSE [28] goftest_1.2-3 highr_0.10 later_1.3.0 FALSE [31] spatstat.utils_3.0-2 irlba_2.3.5.1 parallel_4.3.0 FALSE [34] cluster_2.1.4 R6_2.5.1 ica_1.0-3 FALSE [37] spatstat.data_3.0-1 bslib_0.4.2 stringi_1.7.12 FALSE [40] RColorBrewer_1.1-3 reticulate_1.28 parallelly_1.35.0 FALSE [43] scattermore_0.8 lmtest_0.9-40 jquerylib_0.1.4 FALSE [46] Rcpp_1.0.10 bookdown_0.33 knitr_1.42 FALSE [49] tensor_1.5 future.apply_1.10.0 zoo_1.8-12 FALSE [52] sctransform_0.3.5 httpuv_1.6.9 Matrix_1.5-4 FALSE [55] splines_4.3.0 igraph_1.4.2 tidyselect_1.2.0 FALSE [58] abind_1.4-5 yaml_2.3.7 spatstat.random_3.1-4 FALSE [61] spatstat.explore_3.1-0 codetools_0.2-19 miniUI_0.1.1.1 FALSE [64] listenv_0.9.0 lattice_0.21-8 tibble_3.2.1 FALSE [67] plyr_1.8.8 withr_2.5.0 shiny_1.7.4 FALSE [70] treeio_1.24.0 ROCR_1.0-11 evaluate_0.20 FALSE [73] Rtsne_0.16 gridGraphics_0.5-1 future_1.32.0 FALSE [76] survival_3.5-5 proxy_0.4-27 polyclip_1.10-4 FALSE [79] fitdistrplus_1.1-11 pillar_1.9.0 BiocManager_1.30.20 FALSE [82] Seurat_4.3.0 KernSmooth_2.23-20 plotly_4.10.1 FALSE [85] ggfun_0.0.9 generics_0.1.3 sp_1.6-0 FALSE [88] munsell_0.5.0 scales_1.2.1 tidytree_0.4.2 FALSE [91] globals_0.16.2 xtable_1.8-4 glue_1.6.2 FALSE [94] lazyeval_0.2.2 tools_4.3.0 data.table_1.14.8 FALSE [97] RANN_2.6.1 leiden_0.4.3 cowplot_1.1.1 FALSE [100] grid_4.3.0 tidyr_1.3.0 ape_5.7-1 FALSE [103] colorspace_2.1-0 nlme_3.1-162 cli_3.6.1 FALSE [106] spatstat.sparse_3.0-1 fansi_1.0.4 viridisLite_0.4.1 FALSE [109] dplyr_1.1.2 uwot_0.1.14 gtable_0.3.3 FALSE [112] yulab.utils_0.0.6 sass_0.4.5 digest_0.6.31 FALSE [115] progressr_0.13.0 ggrepel_0.9.3 ggplotify_0.1.0 FALSE [118] farver_2.1.1 htmlwidgets_1.6.2 SeuratObject_4.1.3 FALSE [121] htmltools_0.5.5 lifecycle_1.0.3 httr_1.4.5 FALSE [124] mime_0.12 MASS_7.3-59`