Chapter 7 MI-based clustering analysis
In this chapter, we will introduce more about the MICA component, and walk you through the MICA workflow, including preparing inputs, running MICA, visualizing and integrating MICA outputs into SparseEset object.
7.1 Introduction to MICA
MICA (Mutual Information-based Clustering Analysis) is a clustering tool designed for single cell genomics data. Compared to most existing single-cell clustering algorithms, MICA has two unique features:
- MICA uses mutual information to measure cell-cell similarity for unsupervised clustering analysis, while most existing tools employ linear-transformation of PCA and/or co-expression analysis using linear Pearson or Spearman correlations that may not capture the nonlinear cell-cell distance.
- MICA uses all high-quality features for clustering, while most existing tools select the top highly variable features to improve the clustering speed. This is arbitrary and may lose the information that can distinguish close cell states.
MICA is developed using Python framework, to take its strengths in calculation speed and memory consumption. A lot of effort has been made to improve the interoperability between Python and R. Now MICA works seamlessly with the SparseExpressionSet object. The input of MICA can be easily generated from the SparseExpressionSet object by generateMICAinput()
, and the output of MICA, the clustering results, can be effortlessly visualized by MICAplot()
and integrated into SparseExpressionSet object by addMICAoutput()
.
7.2 Generate MICA input
The standard input of MICA is a normalized and log-transformed gene expression matrix. scMINER can generate this matrix from sparse eSet object and save it into a file that can be directly read by MICA. MICA accepts .h5ad
or .txt
format as the input file, which can be easily generated by embedded function generateMICAinput()
:
## generate MICA input in txt format
generateMICAinput(input_eset = pbmc14k_log2cpm.eset, output_file = "/work-path/PBMC14k/MICA/micaInput.txt", overwrite = FALSE)
## check the format of MICA input
mica_input <- read.delim(system.file("extdata/demo_pbmc14k/PBMC14k/MICA/micaInput.txt", package = "scMINER"), header = T, sep = "\t", row.names = 1)
mica_input[1:5,1:5]
To use the .h5ad format, run the codes below.
## generate MICA input in h5ad format: anndata package is needed
generateMICAinput(input_eset = pbmc14k_log2cpm.eset, output_file = "/work-path/PBMC14k/MICA/micaInput.h5ad", overwrite = FALSE)
In addition to generating the standard MICA input file, generateMICAinput()
also returns the recommended commands of running MICA. You can copy the commands, modify according and run them.
7.3 Run MICA
MICA features two different modes named by their different dimension reduction techniques:
- Multi-Dimensional Scaling (MDS) mode: this mode is more accurate and robust for small datasets (less than 5,000 cells, be default) due to its global dimension reduction nature;
- Graph Embedding (GE) mode: this mode works better with large datasets (more than 5,000 cells, by default) using a graph embedding approach to explore distant neighbor cells.
In this case, since there are 13,605 cells, we will use the MICA GE
mode for the clustering:
mica ge -i /work-path/PBMC14k/MICA/micaInput.txt -o /work-path/PBMC14k/MICA/micaOutput -minr 0.1 -maxr 9.0 -ss 0.05 -nw 4
This command will generate the clustering results of multiple resolutions, from 0.1 to 9.0, with a step size of 0.05. MICA supports multi-threading. The -nw 4
here sets 4 threads to run in parallel. You can change it accordingly to speed up the clustering analysis.
7.4 Integrate MICA outputs into SparseEset object
MICA generates several files and save all of them in the output directory specified by the user with -o
argument. The core, and only, output file we need for subsequent analysis is the clustering label file named in the format of ProjectName_clustering_VisualizeMethod_euclidean_NumberOfDimensions_Resolution.txt
. In this case, since we used a range of resolutions, there are several clustering label files generated, one for each resolution. Based on the knowledge about PBMC14k dataset, we compared the results of different resolutions and picked clustering_UMAP_euclidean_20_2.05.txt
for subsequent analysis.
micaOutput <- read.table(system.file("extdata/demo_pbmc14k/MICA/clustering_UMAP_euclidean_20_2.05.txt", package = "scMINER"), header = TRUE, sep = "\t", quote = "", stringsAsFactors = F)
head(micaOutput)
## ID X Y label
## 1 CACTTTGACGCAAT 14.91650 13.04096 6
## 2 GTTACGGAAACGAA 14.57031 10.27093 6
## 3 CACTTATGAGTCGT 14.28869 13.61674 6
## 4 GCATGTGATTCTGT 14.12546 13.36319 6
## 5 TAGAATACGTATCG 14.91227 11.19407 6
## 6 CAAGAAGACCCTCA 15.34154 12.25821 6
As shown above, the clustering label file contains four columns:
ID
: cell barcodes;X
: coordinates of UMAP_1 or tSNE_1;Y
: coordinates of UMAP_2 or tSNE_2;label
: labels of predicted clusters.
The clustering result can be easily easily added to the SparseExpressionSet object by addMICAoutput()
:
pbmc14k_log2cpm.eset <- addMICAoutput(pbmc14k_log2cpm.eset, mica_output_file = system.file("extdata/demo_pbmc14k/MICA/clustering_UMAP_euclidean_20_2.05.txt", package = "scMINER"), visual_method = "umap")
head(pData(pbmc14k_log2cpm.eset))
## trueLabel_full trueLabel projectID nUMI nFeature pctMito
## CACTTTGACGCAAT CD14+ Monocyte Monocyte PBMC14k 764 354 0.01832461
## GTTACGGAAACGAA CD14+ Monocyte Monocyte PBMC14k 956 442 0.01569038
## CACTTATGAGTCGT CD14+ Monocyte Monocyte PBMC14k 629 323 0.02066773
## GCATGTGATTCTGT CD14+ Monocyte Monocyte PBMC14k 875 427 0.02628571
## TAGAATACGTATCG CD14+ Monocyte Monocyte PBMC14k 1060 445 0.03207547
## CAAGAAGACCCTCA CD14+ Monocyte Monocyte PBMC14k 849 384 0.01531213
## pctSpikeIn CellID UMAP_1 UMAP_2 clusterID
## CACTTTGACGCAAT 0 CACTTTGACGCAAT 14.91650 13.04096 6
## GTTACGGAAACGAA 0 GTTACGGAAACGAA 14.57031 10.27093 6
## CACTTATGAGTCGT 0 CACTTATGAGTCGT 14.28869 13.61674 6
## GCATGTGATTCTGT 0 GCATGTGATTCTGT 14.12546 13.36319 6
## TAGAATACGTATCG 0 TAGAATACGTATCG 14.91227 11.19407 6
## CAAGAAGACCCTCA 0 CAAGAAGACCCTCA 15.34154 12.25821 6
7.5 Visualize the MICA output
scMINER provides a function, MICAplot()
to easily visualize the clustering results on a 2D plot, UMAP or tSNE. And it can be colored by multiple variables, including cluster label, sample source, nUMI, nFeature, pctMito and more.
7.5.3 Color-coded by nUMI, for QC purpose
MICAplot(input_eset = pbmc14k_log2cpm.eset, color_by = "nUMI", do.logTransform = TRUE, point.size = 0.1)
## The values in "nUMI" have been transformed by log2(value + 1). To turn transformation off, set do.logTransform = FALSE.
7.5.4 Color-coded by nFeature, for QC purpose
MICAplot(input_eset = pbmc14k_log2cpm.eset, color_by = "nFeature", do.logTransform = TRUE, point.size = 0.1)
## The values in "nFeature" have been transformed by log2(value + 1). To turn transformation off, set do.logTransform = FALSE.
### Color-coded by pctMito, for QC purpose