A quick tutorial to run scMINER

Welcome to the quick tutorial of scMINER! This tutorial aims to provide immediate and practical guidance on running scMINER with your own data. For full functionality and detailed documentation of scMINER, please refer to the Full Documentation. In this tutorial, we will walk through the core analysis of scMINER step-by-step using a ground truth dataset called PBMC14k.

Get started

library(scMINER)
library(dplyr)
library(ggplot2)
library(anndata)
library(hdf5r)

Data preprocessing

Create project space

The project space created by scMINER is a folder that can not only keep your data centralized and organized but also make the scMINER pipeline more smooth and robust. We encourge you to create a project space for each of your studies.

scminer_dir <- createProjectSpace(project_dir = "/your-path", project_name = "PBMC14k")

This creates a folder named PBMC14k in the directory of /your-path, and creates four subfolders in it:

DATA: to save the sparse eSet objects and other files;
MICA: to save the inputs and outputs of mutual information-based clustering analysis;
SJARACNe: to save the inputs and outputs of network inference and quality control;
PLOT: to save the files of data visualization.

Generate gene expression matrix

scMINER provides four functions to generate gene expression matrix from multiple-format inputs:

## Input type 1: Data directory by 10x Genomics, containing matrix.mtx, barcodes.tsv and features.tsv (or genes.tsv)
demo1_mtx <- readInput_10x.dir(input_dir = system.file("extdata/demo_inputs/cell_matrix_10x", package = "scMINER"),
                               featureType = "gene_symbol", removeSuffix = TRUE, addPrefix = "demo1")

## Input type 2: Text-table file, eg. txt, tsv, csv
demo2_mtx <- readInput_table(table_file = system.file("extdata/demo_inputs/table_file/demoData2.txt.gz", package = "scMINER"),
                             is.geneBYcell = TRUE, # set is.geneBYcell = FALSE to read features in columns and cells in rows
                             sep = "\t", removeSuffix = TRUE, addPrefix = "demo2") 

## Input type 3: HDF5 file by 10x Genomics
demo2_mtx <- readInput_10x.h5(h5_file = system.file("extdata/demo_inputs/hdf5_10x/demoData2.h5", package = "scMINER"),
                              featureType = "gene_symbol", removeSuffix = TRUE, addPrefix = "demo3")

## Input type 4: H5AD file
demo4_obj <- readInput_h5ad(h5ad_file = system.file("extdata/demo_inputs/h5ad_file/demoData4.h5ad", package = "scMINER"),
                            removeSuffix = TRUE, addPrefix = "demo4")

The raw count matrix of PBMC14k dataset is embedded in scMINER and can be easily fetched by:

## load the raw count matrix of PBMC14k dataset
data("pbmc14k_rawCount")
dim(pbmc14k_rawCount)
#> [1] 17986 14000
pbmc14k_rawCount[1:5,1:4]
#> 5 x 4 sparse Matrix of class "dgCMatrix"
#>               CACTTTGACGCAAT GTTACGGAAACGAA AGTCACGACAGGAG TTCGAGGACCAGTA
#> AL627309.1                 .              .              .              .
#> AP006222.2                 .              .              .              .
#> RP11-206L10.3              .              .              .              .
#> RP11-206L10.2              .              .              .              .
#> RP11-206L10.9              .              .              .              .

Create SparseEset object

The SparseExpressionSet (or SparseEset for short) is a new class created by scMINER to handle the sparsity in scRNA-seq data. It is derived from ExpressionSet, and enables to compress, store and access efficiently and conveniently. The SparseEset object is the center of scRNA-seq data analysis by scMINER.

The SparseEset object can be easily created from the gene expression matrix:

## Create SparseEset object solely from gene expression matrix, meta data is automatically added
pbmc14k_raw.eset <- createSparseEset(input_matrix = pbmc14k_rawCount, projectID = "PBMC14k", addMetaData = TRUE)

createSparseEset() offers an argument, addMetaData, to automatically generate and add 5 meta data statistics for cells and genes into the SparseEset object. It also provides another two arguments, cellData and featureData, to allow you add your customized phenoData or featureData. In this case, we have the true labels of cell types and would like to add them to the SparseEset object:

## Read the true lables of cell types embedded in scMINER R package
true_label <- read.table(system.file("extdata/demo_pbmc14k/PBMC14k_trueLabel.txt.gz", package = "scMINER"), header = T, row.names = 1, sep = "\t", quote = "", stringsAsFactors = FALSE)

## Create SparseEset object with self-customized metadata
pbmc14k_raw.eset <- createSparseEset(input_matrix = pbmc14k_rawCount, cellData = true_label, featureData = NULL, projectID = "PBMC14k", addMetaData = TRUE)
#> Creating sparse eset from the input_matrix ...
#>  Adding meta data based on input_matrix ...
#> Done! The sparse eset has been generated: 17986 genes, 14000 cells.

head(pData(pbmc14k_raw.eset))
#>                trueLabel_full trueLabel projectID nUMI nFeature    pctMito
#> CACTTTGACGCAAT CD14+ Monocyte  Monocyte   PBMC14k  764      354 0.01832461
#> GTTACGGAAACGAA CD14+ Monocyte  Monocyte   PBMC14k  956      442 0.01569038
#> AGTCACGACAGGAG CD14+ Monocyte  Monocyte   PBMC14k 7940     2163 0.01977330
#> TTCGAGGACCAGTA CD14+ Monocyte  Monocyte   PBMC14k 4177     1277 0.01149150
#> CACTTATGAGTCGT CD14+ Monocyte  Monocyte   PBMC14k  629      323 0.02066773
#> GCATGTGATTCTGT CD14+ Monocyte  Monocyte   PBMC14k  875      427 0.02628571
#>                pctSpikeIn         CellID
#> CACTTTGACGCAAT          0 CACTTTGACGCAAT
#> GTTACGGAAACGAA          0 GTTACGGAAACGAA
#> AGTCACGACAGGAG          0 AGTCACGACAGGAG
#> TTCGAGGACCAGTA          0 TTCGAGGACCAGTA
#> CACTTATGAGTCGT          0 CACTTATGAGTCGT
#> GCATGTGATTCTGT          0 GCATGTGATTCTGT
table(pData(pbmc14k_raw.eset)$trueLabel_full)
#> 
#>               CD14+ Monocyte                      CD19+ B 
#>                         2000                         2000 
#>              CD4+/CD25 T Reg   CD4+/CD45RA+/CD25- Naive T 
#>                         2000                         2000 
#>          CD4+/CD45RO+ Memory                     CD56+ NK 
#>                         2000                         2000 
#> CD8+/CD45RA+ Naive Cytotoxic 
#>                         2000

If you have multiple samples for one project, please create one SparseEset object for each of the samples and combined these SparseEset objects into one:

## Create SparseEset from multiple samples
# Step 1: create an SparseEset for each sample
demo1_mtx <- readInput_10x.dir(input_dir = system.file("extdata/demo_inputs/cell_matrix_10x", package = "scMINER"),
                               featureType = "gene_symbol", removeSuffix = TRUE)
demo1.eset <- createSparseEset(input_matrix = demo1_mtx, addMetaData = TRUE)

demo2_mtx <- readInput_table(table_file = system.file("extdata/demo_inputs/table_file/demoData2.txt.gz", package = "scMINER"),
                             is.geneBYcell = TRUE, sep = "\t", removeSuffix = TRUE) 
demo2.eset <- createSparseEset(input_matrix = demo2_mtx, addMetaData = TRUE)

# Step 2: combine the SparseEset objects of all samples
combined.eset <- combineSparseEset(eset_list = c(demo1.eset, demo2.eset),
                                   projectID = c("sample1", "sample2"),
                                   addPrefix = c("demo1", "demo2"),
                                   addSurfix = NULL, addMetaData = TRUE, imputeNA = TRUE)

Filter SparseEset object

As we mentioned before, scMINER can automatically generate and add 5 meta data statistics to SparseEset object. These 5 meta data statistics are the metrics scMINER uses to assess the quality of cells and features:

For cell quality assessment, scMINER provides 4 metrics that commonly used by the community:
- nUMI: number of total UMIs in each cell. Cells with abnormally high nUMI usually indicate doublets, while those with abnormally low nUMI usually indicate poorly sequenced cells or empty droplets.
- nFeature: number of expressed features/genes in each cell. Similar to nUMI.
- pctMito: percentage of UMIs of mitochondrial genes (defined by “^mt-|MT-”) in each cell. Cells with aberrantly high pctMito usually indicate dying cells.
- pctSpikeIn: percentage of UMIs of spike-in RNAs (defined by “^ERCC-|Ercc-”)) in each cell. This is used to estimate the normalization factor. Cells with extremely high or low pctSpikeIn need to be removed.
For feature quality assessment, scMINER provides one metrics:
- nCell: number of cells expressing the features/genes. Genes with extremely low nCell are poorly sequenced and are usually of low variance.

To help assess the data quality and determine the cutoffs used for filtration, scMINER can generate a html-format QC report:

## To generate QC report from SparseEset object
drawSparseEsetQC(input_eset = pbmc14k_raw.eset, output_html_file = "/your-path/PBMC14k/PLOT/pbmc14k_rawCount.html", overwrite = FALSE)

## scMINER also supports group-specific QC highlights
drawSparseEsetQC(input_eset = pbmc14k_raw.eset, output_html_file = "/your-path/PBMC14k/PLOT/pbmc14k_rawCount.html", overwrite = FALSE, group_by = "trueLabel")

This QC report contains a variety of tables and plot of the key statistics of your data that can help you get a better sense about data quality and determine the cutoffs for filtration.

scMINER provides two modes to perform SparseEset object filtration:

auto: in this mode, the filtration cutoffs are automatically generated by scMINER in Median ± 3*MAD (maximum absolute deviation) method. This mode works well in the majority of our test datasets.
manual: in this mode, you can manually specify the cutoffs, both low and high, of all 5 metrics. No cells or features would be removed under the default cutoffs of each metrics.

To apply the auto mode in SparseEset filtration:

## Filter SparseEset object with the cutoffs automatically generated by scMINER
pbmc14k_filtered.eset <- filterSparseEset(pbmc14k_raw.eset, filter_mode = "auto", filter_type = "both")
#> Checking the availability of the 5 metrics ('nCell', 'nUMI', 'nFeature', 'pctMito', 'pctSpikeIn') used for filtration ...
#> Checking passed! All 5 metrics are available.
#> Filtration is done!
#> Filtration Summary:
#>  8846/17986 genes passed!
#>  13605/14000 cells passed!
#> 
#> For more details:
#>  Gene filtration statistics:
#>      Metrics     nCell
#>      Cutoff_Low  70
#>      Cutoff_High Inf
#>      Gene_total  17986
#>      Gene_passed 8846(49.18%)
#>      Gene_failed 9140(50.82%)
#> 
#>  Cell filtration statistics:
#>      Metrics     nUMI        nFeature    pctMito     pctSpikeIn  Combined
#>      Cutoff_Low  458     221     0       0       NA
#>      Cutoff_High 3694        Inf     0.0408      0.0000      NA
#>      Cell_total  14000       14000       14000       14000       14000
#>      Cell_passed 13826(98.76%)   14000(100.00%)  13778(98.41%)   14000(100.00%)  13605(97.18%)
#>      Cell_failed 174(1.24%)  0(0.00%)    222(1.59%)  0(0.00%)    395(2.82%)

This command generates a filtered SparseEset object pbmc14k_filtered.eset and returns a summary table with detailed information of filtration statistics. You can refer to it and adjust the cutoffs accordingly.

In some cases, you may find that most of the cutoffs generated by the auto mode are good, except one or two. Though there is no ‘hybrid’ mode, scMINER does allow you to customize some of the cutoffs generated by the auto mode. This can be easily done by adding the cutoffs you would customize under the auto mode:

## Filter eSet under the auto mode, with customized values
pbmc14k_filtered.eset <- filterSparseEset(pbmc14k_raw.eset, filter_mode = "auto", filter_type = "both", gene.nCell_min = 5)

If the cutoffs generated in auto model do not work well in your case and you would like to go with self-customized cutoffs, you can easily apply them by:

## Filter SparseEset object with self-customized cutoffs
pbmc14k_filtered.eset <- filterSparseEset(pbmc14k_raw.eset, filter_mode = "manual", filter_type = "both", gene.nCell_min = 10, cell.nUMI_min = 500, cell.nUMI_max = 6500, cell.nFeature_min = 200, cell.nFeature_max = 2500, cell.pctMito_max = 0.1)

Normalize SparseEset object

scMINER recommends the log2CPM method for normalization: the raw counts in each cell are normalized to a library size of 1 million, followed by log2 transformation.

pbmc14k_log2cpm.eset <- normalizeSparseEset(pbmc14k_filtered.eset, scale_factor = 1000000, log_base = 2, log_pseudoCount = 1)
#> Done! The data matrix of eset has been normalized and log-transformed!
#> The returned eset contains: 8846 genes, 13605 cells.

This normalized and log-transformed SparseEset object can be directly used for Mutual Information-based clustering, network inference and other downstream analysis. And it’s recommended to save it into the project space.

saveRDS(pbmc14k_log2cpm.eset, file = "/your-path/PBMC14k/DATA/pbmc14k_log2cpm.rds")

MI-based clustering analysis

MICA (Mutual Information-based Clustering Analysis) is a clustering tool designed for scRNA-seq data. It is developed with Python to take it’s strengths in calculation speed and memory consumption. As a component of scMINER framework, MICA works seamlessly with the scMINER R package and SparseEset object.

Generate MICA input

The standard input of MICA is a normalized and log-transformed gene expression matrix. scMINER can generate this matrix from SparseEset object and save it into a file that can be directly read by MICA.

scMINER uses .txt as the default input file format. It can by generated by:

MICA accepts .h5ad or .txt format as the input file, which can be easily generated by embedded function generateMICAinput():

## Generate MICA input in txt format
generateMICAinput(input_eset = pbmc14k_log2cpm.eset, output_file = "/your-path/PBMC14k/MICA/micaInput.txt", overwrite = FALSE)

## Check the format of MICA input
mica_input <- read.delim(system.file("extdata/demo_pbmc14k/MICA/micaInput.txt", package = "scMINER"), header = T, sep = "\t", row.names = 1)
mica_input[1:5,1:5]

scMINER also supports .h5ad as the input file format which is getting more popular in scRNA-seq data storage and sharing. It can by generated by:

## Generate MICA input in h5ad format
generateMICAinput(input_eset = pbmc14k_log2cpm.eset, output_file = "/your-path/PBMC14k/MICA/micaInput.h5ad", overwrite = FALSE)

In addition to generating the standard MICA input file, generateMICAinput() also returns the recommended commands of running MICA. You can copy the commands, modify accordingly and run.

Run MICA

MICA features two different modes named by their different dimension reduction techniques:

Multi-Dimensional Scaling (MDS) mode: this mode is more accurate and robust for small datasets (less than 5,000 cells, be default) due to its global dimension reduction nature;
Graph Embedding (GE) mode: this mode works better with large datasets (more than 5,000 cells, by default) using a graph embedding approach to explore distant neighbor cells.

To run MDS model:

mica mds -i /your-path/PBMC14k/MICA/micaInput.txt -o /your-path/PBMC14k/MICA/MDS -nck 5 6 7 8 9 10

The MDS model uses K-Means by default for clustering. The argument -nck above specifies the number of cluster for K-Means.

In this case, since there are 13,605 cells, we will use the GE mode for the clustering:

mica ge -i /your-path/PBMC14k/MICA/micaInput.txt -o /your-path/PBMC14k/MICA/GE -minr 0.1 -maxr 9.0 -ss 0.05 -nw 40

The GE mode uses Louvain for clustering. The command above will generate the clustering results of multiple resolutions, from 0.1 to 9.0, with a step size of 0.05.

Add MICA output to SparseEset object

MICA generates several files and save all of them in the output directory specified by the user with -o argument. The core, and only, output file we need for subsequent analysis is the clustering label file named in the format of ProjectName_clustering_VisualizeMethod_euclidean_NumberOfDimensions_Resolution.txt. In this case, since we used a range of resolutions, there are several clustering label files generated, one for each resolution. Based on the knowledge about PBMC14k dataset, we compared the results of different resolutions and picked clustering_UMAP_euclidean_20_2.05.txt for subsequent analysis.

## Read the selected MICA output file
micaOutput <- read.table(system.file("extdata/demo_pbmc14k/MICA/clustering_UMAP_euclidean_20_2.05.txt", package = "scMINER"), header = TRUE, sep = "\t", quote = "", stringsAsFactors = F)
head(micaOutput)
#>               ID        X        Y label
#> 1 CACTTTGACGCAAT 14.91650 13.04096     6
#> 2 GTTACGGAAACGAA 14.57031 10.27093     6
#> 3 CACTTATGAGTCGT 14.28869 13.61674     6
#> 4 GCATGTGATTCTGT 14.12546 13.36319     6
#> 5 TAGAATACGTATCG 14.91227 11.19407     6
#> 6 CAAGAAGACCCTCA 15.34154 12.25821     6

As shown above, the clustering label file contains four columns:

ID: cell barcodes;
X: coordinates of UMAP_1 or tSNE_1;
Y: coordinates of UMAP_2 or tSNE_2;
label: labels of predicted clusters.

The clustering result can be easily easily added to the SparseEset object by addMICAoutput():

## Add MICA output into SparseEset object
pbmc14k_clustered.eset <- addMICAoutput(input_eset = pbmc14k_log2cpm.eset,
                                        mica_output_file = system.file("extdata/demo_pbmc14k/MICA/clustering_UMAP_euclidean_20_2.05.txt", package = "scMINER"),
                                        visual_method = "umap") # use "tsne" if t-SNE was used in MICA
head(pData(pbmc14k_clustered.eset))
#>                trueLabel_full trueLabel projectID nUMI nFeature    pctMito
#> CACTTTGACGCAAT CD14+ Monocyte  Monocyte   PBMC14k  764      354 0.01832461
#> GTTACGGAAACGAA CD14+ Monocyte  Monocyte   PBMC14k  956      442 0.01569038
#> CACTTATGAGTCGT CD14+ Monocyte  Monocyte   PBMC14k  629      323 0.02066773
#> GCATGTGATTCTGT CD14+ Monocyte  Monocyte   PBMC14k  875      427 0.02628571
#> TAGAATACGTATCG CD14+ Monocyte  Monocyte   PBMC14k 1060      445 0.03207547
#> CAAGAAGACCCTCA CD14+ Monocyte  Monocyte   PBMC14k  849      384 0.01531213
#>                pctSpikeIn         CellID   UMAP_1   UMAP_2 clusterID
#> CACTTTGACGCAAT          0 CACTTTGACGCAAT 14.91650 13.04096         6
#> GTTACGGAAACGAA          0 GTTACGGAAACGAA 14.57031 10.27093         6
#> CACTTATGAGTCGT          0 CACTTATGAGTCGT 14.28869 13.61674         6
#> GCATGTGATTCTGT          0 GCATGTGATTCTGT 14.12546 13.36319         6
#> TAGAATACGTATCG          0 TAGAATACGTATCG 14.91227 11.19407         6
#> CAAGAAGACCCTCA          0 CAAGAAGACCCTCA 15.34154 12.25821         6

It’s optional but recommend to save the SparseEset object with clustering resluts added:

saveRDS(pbmc14k_clustered.eset, file = "/your-path/PBMC14k/DATA/pbmc14k_clustered.eset")

Visulize MICA output

scMINER provides a function, MICAplot() to easily visualize the clustering results on a 2D plot, UMAP or tSNE. And it can be colored by multiple variables, including cluster label, sample source, nUMI, nFeature, pctMito and more.

To visualize the clustering results:

MICAplot(input_eset = pbmc14k_clustered.eset, color_by = "clusterID", X = "UMAP_1", Y = "UMAP_2", point.size = 0.1, fontsize.cluster_label = 6)

To visualize the true labels of cell types:

MICAplot(input_eset = pbmc14k_clustered.eset, color_by = "trueLabel", X = "UMAP_1", Y = "UMAP_2", point.size = 0.1, fontsize.cluster_label = 4)

To visualize the nUMI on UMAP/t-SNE plot:

MICAplot(input_eset = pbmc14k_clustered.eset, color_by = "nUMI", do.logTransform = TRUE, point.size = 0.1)
#> The values in "nUMI" have been transformed by log2(value + 1). To turn transformation off, set do.logTransform = FALSE.

You can also visualize the nFeature pctMito and pctSpikeIn:

MICAplot(input_eset = pbmc14k_clustered.eset, color_by = "nFeature", do.logTransform = TRUE, point.size = 0.1)
#> The values in "nFeature" have been transformed by log2(value + 1). To turn transformation off, set do.logTransform = FALSE.

MICAplot(input_eset = pbmc14k_clustered.eset, color_by = "pctMito", do.logTransform = FALSE, point.size = 0.1)

MICAplot(input_eset = pbmc14k_clustered.eset, color_by = "pctSpikeIn", do.logTransform = FALSE, point.size = 0.1)

Cell type annotation

Currently, there are two types of strategies to annotate the clusters: supervised and unsupervised. The supervised methods use a list of known markers of potential cell types curated from some existing studies of the same/similar contexts. While in contrast, the unsupervised methods are usually based on the differentially expressed genes. scMINER provides several useful functions to support both types of strategies.

Supervised cell type annotation

In this showcase, we know the 7 cell types involved in the PBMC14k dataset, and curated a marker list from some existing PBMCs studies.

Using signature scores

Given a marker list of candidate cell types, scMINER can estimate a signature score, which is mathematically the weighted mean of the expression of marker genes involved, for each candidate cell type across all cell cluster. To do so, you will need to generate a signature table with three columns:

signature_name: name of cell types/signatures;
signature_feature: markers genes/features of corresponding cell type/signature;
weight: weight of corresponding maker/feature in corresponding cell type/signature. It ranges from -1 to 1, so both positive and negative markers are supported.

## Signature table of PBMC14k dataset
signature_table <- read.table(system.file("extdata/demo_pbmc14k/PBMC14k_signatureTable.txt", package = "scMINER"), header = TRUE, sep = "\t", quote = "", stringsAsFactors = FALSE)
head(signature_table)
#>   signature_name signature_feature weight
#> 1       Monocyte              CD14      1
#> 2       Monocyte               LYZ      1
#> 3       Monocyte            S100A8      1
#> 4       Monocyte            S100A9      1
#> 5       Monocyte           S100A12      1
#> 6             NK            FCGR3A      1

With this signature table, draw_bubbleplot() can estimate the signature scores and visualize them using bubble plot:

## Bubble plot of signature scores across clusters
draw_bubbleplot(input_eset = pbmc14k_clustered.eset, signature_table = signature_table, group_by = "clusterID")
#> 31 features of 7 signatures were found in the input eset and will be used in calculation.

In the bubble plot above, the color of the bubbles is proportional to the mean of signature scores, and the size of the bubbles is proportional to the percentage of cells with higher signature score than mean.

Using individual marker genes

scMINER also provides a variety of functions to visualize the selected features:

## For the demonstration purposes, we picked two well known markers for each of the 7 known cell types, plus "CD3D" and "CD4".
genes_of_interest <-c("CD14", "LYZ", "GZMB", "NKG7", "CD19", "MS4A1", "CD8A", "CD8B", "SELL", "CCR7", "IL2RA", "FOXP3", "IL7R", "S100A4", "CD3D", "CD4")

feature visualization: violin plot

## Violin plot of marker genes across clusters
feature_vlnplot(input_eset = pbmc14k_clustered.eset, features = genes_of_interest, group_by = "clusterID", ncol = 4)

feature visualization: box plot

## Box plot of marker genes across clusters
feature_boxplot(input_eset = pbmc14k_clustered.eset, features = genes_of_interest, group_by = "clusterID", ncol = 4)

feature visualization: scatter plot

## UMAP scatter plot of marker genes
feature_scatterplot(input_eset = pbmc14k_clustered.eset, features = genes_of_interest, ncol = 4, location_x = "UMAP_1", location_y =  "UMAP_2", point.size = 0.5, legend.key_height = 0.3, legend.key_width = 0.2, fontsize.legend_title = 8, fontsize.legend_text = 6, fontsize.axis_title = 8, legend.position = "none")

feature visualization: bubble plot

## Bubble plot of marker genes across clusters
feature_bubbleplot(input_eset = pbmc14k_clustered.eset, features = genes_of_interest, group_by = "clusterID", xlabel.angle = 45)

feature visualization: heatmap

## Heatmap of marker genes across clusters
feature_heatmap(input_eset = pbmc14k_clustered.eset, features = genes_of_interest, group_by = "clusterID", scale_method = "none", annotation_columns = c("trueLabel"))

Unsupervised cell type annotation

scMINER provides a function, getDE(), to perform the differential expression analysis and identify the markers of each cluster. The getDE() function supports three different methods to perform the differential expression analysis, limma, wilcoxon and t.test. And it allows the users to define the groups to compare in a highly flexible way:

## 1. To perform differential expression analysis in a 1-vs-rest manner for all groups
de_res1 <- getDE(input_eset = pbmc14k_clustered.eset, group_by = "clusterID", use_method = "limma")
#> 7 groups were found in group_by column [ clusterID ].
#> Since no group was specified, the differential analysis will be conducted among all groups in the group_by column [ clusterID ] in the 1-vs-rest manner.
#>   1 / 7 : group 1 ( 1 ) vs the rest...
#>   2505 cells were found for g1.
#>   11100 cells were found for g0.
#>   2 / 7 : group 1 ( 2 ) vs the rest...
#>   2022 cells were found for g1.
#>   11583 cells were found for g0.
#>   3 / 7 : group 1 ( 3 ) vs the rest...
#>   2014 cells were found for g1.
#>   11591 cells were found for g0.
#>   4 / 7 : group 1 ( 4 ) vs the rest...
#>   1918 cells were found for g1.
#>   11687 cells were found for g0.
#>   5 / 7 : group 1 ( 5 ) vs the rest...
#>   1912 cells were found for g1.
#>   11693 cells were found for g0.
#>   6 / 7 : group 1 ( 6 ) vs the rest...
#>   1786 cells were found for g1.
#>   11819 cells were found for g0.
#>   7 / 7 : group 1 ( 7 ) vs the rest...
#>   1448 cells were found for g1.
#>   12157 cells were found for g0.
head(de_res1)
#>      feature g1_tag      g0_tag    g1_avg   g0_avg    g1_pct    g0_pct   log2FC
#> 1251    CD3E      1 2,3,4,5,6,7  8.354660 3.874230 0.7920160 0.3819820 4.480430
#> 3820    LDHB      1 2,3,4,5,6,7  9.555670 5.614992 0.8806387 0.5458559 3.940678
#> 7765  TMEM66      1 2,3,4,5,6,7  8.604421 5.041570 0.8103792 0.5051351 3.562851
#> 1250    CD3D      1 2,3,4,5,6,7  7.281998 4.097082 0.6990020 0.3965766 3.184916
#> 1235    CD27      1 2,3,4,5,6,7  5.566280 2.428199 0.5481038 0.2482883 3.138081
#> 3992     LTB      1 2,3,4,5,6,7 10.436707 7.315803 0.9141717 0.6430631 3.120905
#>               Pval           FDR   Zscore
#> 1251 2.225074e-308  0.000000e+00 37.53784
#> 3820 4.919041e-274 6.216262e-271 35.37012
#> 7765 1.509154e-228 9.535698e-226 32.27621
#> 1250 3.193659e-174 1.130044e-171 28.13981
#> 1235 7.997857e-219 4.716603e-216 31.57555
#> 3992 5.621459e-159 1.841756e-156 26.86509

Here is an brief introduction to the results of getDE():

feature: feature name;
g1_tag: a vector of clusters or subgroups involved in g1, the fore-ground group;
g0_tag: a vector of clusters or subgroups involved in g0, the back-ground group;
g1_avg: mean of gene expression of cells in g1;
g0_tag: mean of gene expression of cells in g0;
g1_pct: percentage of cells expressing the corresponding genes in group 1;
g0_pct: percentage of cells expressing the corresponding genes in group 0;
log2FC: log2Fold change of gene expression between g1 and g0;
Pval: P values of g1-g0 comparison;
FDR: FDR of g1-g0 comparison;
Zscore: Z score of g1-g0 comparison, signed by log2FC;

## 2. To perform differential expression analysis in a 1-vs-rest manner for one specific group
de_res2 <- getDE(input_eset = pbmc14k_clustered.eset, group_by = "clusterID", g1 = c("1"), use_method = "limma")

## 3. To perform differential expression analysis in a rest-vs-1 manner for one specific group
de_res3 <- getDE(input_eset = pbmc14k_clustered.eset, group_by = "clusterID", g0 = c("1"), use_method = "limma")

## 4. To perform differential expression analysis in a 1-vs-1 manner for any two groups
de_res4 <- getDE(input_eset = pbmc14k_clustered.eset, group_by = "clusterID", g1 = c("1", "4"), g0 = c("3","5"), use_method = "limma")

scMINER also provides a function, getTopFeatures(), to easily extract the group-specific markers from the differential expression result:

cluster_markers <- getTopFeatures(input_table = de_res1, number = 10, group_by = "g1_tag", sort_by = "log2FC", sort_decreasing = TRUE)
dim(cluster_markers)
#> [1] 70 11
head(cluster_markers)
#>      feature g1_tag      g0_tag    g1_avg   g0_avg    g1_pct    g0_pct   log2FC
#> 1251    CD3E      1 2,3,4,5,6,7  8.354660 3.874230 0.7920160 0.3819820 4.480430
#> 3820    LDHB      1 2,3,4,5,6,7  9.555670 5.614992 0.8806387 0.5458559 3.940678
#> 7765  TMEM66      1 2,3,4,5,6,7  8.604421 5.041570 0.8103792 0.5051351 3.562851
#> 1250    CD3D      1 2,3,4,5,6,7  7.281998 4.097082 0.6990020 0.3965766 3.184916
#> 1235    CD27      1 2,3,4,5,6,7  5.566280 2.428199 0.5481038 0.2482883 3.138081
#> 3992     LTB      1 2,3,4,5,6,7 10.436707 7.315803 0.9141717 0.6430631 3.120905
#>               Pval           FDR   Zscore
#> 1251 2.225074e-308  0.000000e+00 37.53784
#> 3820 4.919041e-274 6.216262e-271 35.37012
#> 7765 1.509154e-228 9.535698e-226 32.27621
#> 1250 3.193659e-174 1.130044e-171 28.13981
#> 1235 7.997857e-219 4.716603e-216 31.57555
#> 3992 5.621459e-159 1.841756e-156 26.86509

Add cell type annotation to SparseEset object

Based on the supervised and unsupervised methods, we have annotated the cell types for each cluster. To add the cell type annotation information into the SparseEset object:

## Add cell type annotation to SparseEset object
pbmc14k_log2cpm_annotated.eset <- pbmc14k_clustered.eset
celltype_map <- c(`1`="CD4TN", `2`="CD4TCM", `3`="CD8TN", `4`="NK", `5`="B", `6`="Monocyte", `7`="CD4Treg")
pbmc14k_log2cpm_annotated.eset$cell_type <- as.character(celltype_map[pbmc14k_log2cpm_annotated.eset$clusterID])
head(pData(pbmc14k_log2cpm_annotated.eset))
#>                trueLabel_full trueLabel projectID nUMI nFeature    pctMito
#> CACTTTGACGCAAT CD14+ Monocyte  Monocyte   PBMC14k  764      354 0.01832461
#> GTTACGGAAACGAA CD14+ Monocyte  Monocyte   PBMC14k  956      442 0.01569038
#> CACTTATGAGTCGT CD14+ Monocyte  Monocyte   PBMC14k  629      323 0.02066773
#> GCATGTGATTCTGT CD14+ Monocyte  Monocyte   PBMC14k  875      427 0.02628571
#> TAGAATACGTATCG CD14+ Monocyte  Monocyte   PBMC14k 1060      445 0.03207547
#> CAAGAAGACCCTCA CD14+ Monocyte  Monocyte   PBMC14k  849      384 0.01531213
#>                pctSpikeIn         CellID   UMAP_1   UMAP_2 clusterID cell_type
#> CACTTTGACGCAAT          0 CACTTTGACGCAAT 14.91650 13.04096         6  Monocyte
#> GTTACGGAAACGAA          0 GTTACGGAAACGAA 14.57031 10.27093         6  Monocyte
#> CACTTATGAGTCGT          0 CACTTATGAGTCGT 14.28869 13.61674         6  Monocyte
#> GCATGTGATTCTGT          0 GCATGTGATTCTGT 14.12546 13.36319         6  Monocyte
#> TAGAATACGTATCG          0 TAGAATACGTATCG 14.91227 11.19407         6  Monocyte
#> CAAGAAGACCCTCA          0 CAAGAAGACCCTCA 15.34154 12.25821         6  Monocyte

The draw_barplot() function can visualize the cell composition of self-defined groups. We can use it to show the purity of MICA clusters:

## Show the composition of true labels of cell types among the annotated cell types
draw_barplot(input_eset = pbmc14k_log2cpm_annotated.eset, group_by = "cell_type", color_by = "trueLabel_full", xlabel.angle = 45)

Don’t forget to save the SparseEset object after the cell type annotation is added.

saveRDS(pbmc14k_log2cpm_annotated.eset, file = "/your-path/PBMC14k/DATA/pbmc14k_log2cpm_annotated.eset")

Network inference

scMINER constructs the cellular networks using SJARACNe, a scalable software tool for gene network reverse engineering from big data. Similar to MICA, SJARACNe is also a component of scMINER framework, and can work seamlessly with scMINER R package and SparseEset object.

Generate SJARACNe inputs

Network inference should be performed on groups of homogeneous cells. Usually it’s in a cluster- or cell type-specific basis. In this showcase, we know the true labels of cell types and would like to use them for grouping:

## Columns with any illegal characters can not be used for groupping
generateSJARACNeInput(input_eset = pbmc14k_log2cpm_annotated.eset, group_name = "trueLabel", sjaracne_dir = "/your-path/PBMC14k/SJARACNe", species_type = "hg", driver_type = "TF_SIG", downSample_N = NULL)

IMPORTANT NOTE: Any illegal characters in path in group labels may cause issues in subsequent analysis. To avoid it, scMINER only accepts letters(A-Za-z), numbers(0-9), underscores(’_‘) and periods(’.’).

For big datasets, generateSJARACNeInput() provides an argument, downSample_N, to allow you to down sample size of each group. The default value of downSample_N is 1,000, any group with >= 1,000 cells will be down-sampled to 1,000.

generateSJARACNeInput() creates a folder for each of the groups availble in trueLabel column, and generates the standard input files inside of it:

a “.exp.txt” file: a tab-separated genes/transcripts/proteins by cells/samples expression matrix with the first two columns being ID and symbol.
a “TF” folder containing a “.tf.txt” file: a list of significant gene/transcript/protein IDs of TF drivers.
a “SIG” folder containing a “.sig.txt” file: a list of significant gene/transcript/protein IDs of SIG drivers.
a bash script (runSJARACNe.sh) to run SJARACNe. Further modification is needed to run it.
a json file (config_cwlexec.json) containing parameters to run SJARACNe.

Run SJARACNe

As mentioned above, generateSJARACNeInput() generates a runSJARACNe.sh file in the folder of each group. You will need to make some modifications before you can run it:

remove unneeded lines: there are usually 4 lines in this file: the lines starting with sjaracne lsf are the command lines to run on IBM LSF cluster, while those starting with sjaracne local are the command lines running on a local machine (Linux/OSX). Please select the lines based on your situation and remove the others.
modify some key parameters:
- -n: number of bootstrap networks to generate. Default: 100.
- -pc: p value threshold to select edges in building consensus network. Default: e-2 for single-cell data, e-3 for meta-cell data, and e-5 for bulk sample data.

Please use sjaracne lsf -h or sjaracne local -h to check more details of arguments available in SJARACNe.

There is another file, config_cwlexec.json, available in the folder. It contains the information (e.g. memory request for each step of SJARACNe run) used for LSF job submission. This file is only needed for LSF runs and the default values works well in most cases. If you are running SJARACNe on a big dataset, you may need to request more memory from it.

In this case, we use LSF to run the SJARACNe:

## Let's use B cell as an example
# For TF
sjaracne lsf -e /your-path/PBMC14k/SJARACNe/B/B.8572_1902.exp.txt -g /your-path/PBMC14k/SJARACNe/B/TF/B.835_1902.tf.txt -o /your-path/PBMC14k/SJARACNe/B/TF/bt100_pc001 -n 100 -pc 0.01 -j /your-path/PBMC14k/SJARACNe/B/config_cwlexec.json

# For SIG
sjaracne lsf -e /your-path/PBMC14k/SJARACNe/B/B.8572_1902.exp.txt -g /your-path/PBMC14k/SJARACNe/B/SIG/B.4148_1902.sig.txt -o /your-path/PBMC14k/SJARACNe/B/SIG/bt100_pc001 -n 100 -pc 0.01 -j /work-path/PBMC14k/SJARACNe/B/config_cwlexec.json

We manually created a folder named “bt100_pc001” in both TF and SIG folders of each group, to save the networks generated under 100 bootstraps (-n 100) and 0.01 consensus p value (-pc 0.01).

To run SJARACNe on a local machine:

## Let's use B cell as an example
# For TF
sjaracne local -e /your-path/PBMC14k/SJARACNe/B/B.8572_1902.exp.txt -g /your-path/PBMC14k/SJARACNe/B/TF/B.835_1902.tf.txt -o /your-path/PBMC14k/SJARACNe/B/TF/bt100_pc001 -n 100 -pc 0.01

# For SIG
sjaracne local -e /your-path/PBMC14k/SJARACNe/B/B.8572_1902.exp.txt -g /your-path/PBMC14k/SJARACNe/B/SIG/B.4148_1902.sig.txt -o /your-path/PBMC14k/SJARACNe/B/SIG/bt100_pc001 -n 100 -pc 0.01

Quality control of networks

The core output of SJARACNe is the network file named consensus_network_ncol_.txt.

network_format <- read.table(system.file("extdata/demo_pbmc14k/SJARACNe/B/TF/bt100_pc001/consensus_network_ncol_.txt", package = "scMINER"),
                             header = T, sep = "\t", quote = "", stringsAsFactors = F)
head(network_format)
#>   source     target source.symbol target.symbol     MI pearson spearman   slope
#> 1   AATF      ACBD3          AATF         ACBD3 0.0509 -0.0310  -0.0311 -0.0193
#> 2   AATF       ADD3          AATF          ADD3 0.0486  0.0228   0.0258  0.0307
#> 3   AATF        AES          AATF           AES 0.0511  0.0311   0.0289  0.0668
#> 4   AATF     AKR7A2          AATF        AKR7A2 0.0498  0.0319   0.0366  0.0421
#> 5   AATF AL928768.3          AATF    AL928768.3 0.0447  0.0247   0.0293  0.0335
#> 6   AATF       ALG8          AATF          ALG8 0.0479  0.0358   0.0373  0.0234
#>   p.value
#> 1  0.1761
#> 2  0.3204
#> 3  0.1756
#> 4  0.1646
#> 5  0.2815
#> 6  0.1183

As shown above, it contains 9 columns:

source: ID of the source gene, can be the gene symbol;
target: ID of the target gene, can be the gene symbol;
source.symbol: symbol of the source gene;
target.symbol: symbol of the target gene;
MI: mutual information of source-gene pair;
pearson: Pearson correlation coefficient, [-1,1]
pearson: Spearman correlation coefficient, [-1,1]
slope: slop of the regression line, returned by stats.linregression()
p.value: p-value for a hypothesis test whose null hypothesis is that the slope is zero, using Wald Test with t-distribution of the test statistic

To help assess the quality of SJARACNe networks, scMINER provides a function drawNetworkQC():

## Network QC on single network file
network_stats <- drawNetworkQC(network_file = system.file("extdata/demo_pbmc14k/SJARACNe/B/TF/bt100_pc001/consensus_network_ncol_.txt", package = "scMINER"), generate_html = FALSE)

## Network QC on all network files under a directory
network_stats <- drawNetworkQC(sjaracne_dir = "/your-path/PBMC14K/SJARACNe", generate_html = FALSE) # Set `generate_html = TRUE` to generate html-format QC report for each network file

## The network QC statistics table is saved separately, for demonstration purposes.
network_stats <- readRDS(system.file("extdata/demo_pbmc14k/SJARACNe/network_stats.rds", package = "scMINER"))
head(network_stats)
#>              network_tag network_node network_edge driver_count targetSize_mean
#> 1      B.SIG.bt100_pc001         8572       391889         4148        94.47662
#> 2       B.TF.bt100_pc001         8572        95341          835       114.18084
#> 3 CD4TCM.SIG.bt100_pc001         8660       382153         4209        90.79425
#> 4  CD4TCM.TF.bt100_pc001         8660        94319          838       112.55251
#> 5  CD4TN.SIG.bt100_pc001         8612       401658         4180        96.09043
#> 6   CD4TN.TF.bt100_pc001         8612        95152          831       114.50301
#>   targetSize_median targetSize_minimum targetSize_maximum
#> 1              94.0                 33                396
#> 2              96.0                 64                913
#> 3              91.0                 31                281
#> 4              95.5                 60                689
#> 5              95.0                 43                303
#> 6              99.0                 64                743
#>                                                                                                                                                                                               network_path
#> 1      /Volumes/projects/scRNASeq/yu3grp/scMINER/NG_Revision/QPan/scminer_R/Datasets/PBMC14K/SJARACNe/B/SIG/bt100_pc001/sjaracne_workflow-df798096-8dee-4baf-8f70-891c689dc769/consensus_network_ncol_.txt
#> 2       /Volumes/projects/scRNASeq/yu3grp/scMINER/NG_Revision/QPan/scminer_R/Datasets/PBMC14K/SJARACNe/B/TF/bt100_pc001/sjaracne_workflow-fb2a69b9-f98e-47ff-87a0-6d538822fc6e/consensus_network_ncol_.txt
#> 3 /Volumes/projects/scRNASeq/yu3grp/scMINER/NG_Revision/QPan/scminer_R/Datasets/PBMC14K/SJARACNe/CD4TCM/SIG/bt100_pc001/sjaracne_workflow-424f1068-13d1-4f0e-9c26-56acd9a2027c/consensus_network_ncol_.txt
#> 4  /Volumes/projects/scRNASeq/yu3grp/scMINER/NG_Revision/QPan/scminer_R/Datasets/PBMC14K/SJARACNe/CD4TCM/TF/bt100_pc001/sjaracne_workflow-52b3cdf5-5914-4c8c-a77a-05f17c755d83/consensus_network_ncol_.txt
#> 5  /Volumes/projects/scRNASeq/yu3grp/scMINER/NG_Revision/QPan/scminer_R/Datasets/PBMC14K/SJARACNe/CD4TN/SIG/bt100_pc001/sjaracne_workflow-7b5bb68e-1de5-4d0e-80ec-8d8aa037866f/consensus_network_ncol_.txt
#> 6   /Volumes/projects/scRNASeq/yu3grp/scMINER/NG_Revision/QPan/scminer_R/Datasets/PBMC14K/SJARACNe/CD4TN/TF/bt100_pc001/sjaracne_workflow-89716541-eb53-435c-8a45-bab63d6b5198/consensus_network_ncol_.txt

drawNetworkQC() returns a summary table of key statistics of SJARACNe networks. Empirically, a network with 50-300 target size is good.

There are many signaling proteins (e.g., kinases), transcription factors, and other factors that are crucial drivers of phenotypes. These factors are not genetically or epigenetically altered, nor are they differentially expressed at the mRNA or protein level. Instead, they are altered by post-translational or other modifications, and are therefore termed hidden drivers. The gene activity-based analysis is proved to be an effective way to expose these hidden driver.

scMINER aims to expose the cell type-specific hidden drivers of various biological activities and provides a few useful functions to effortlessly calculate driver activities, identify the hidden drivers and visualize them in multiple ways.

Calculate driver activities

The driver activity estimation is one of the most important features of scMINER. Mathematically, the activity of one driver is a type of mean of the expressions of its targets. And biologically, the activity can be interpreted as a measure that describes how actively the driver functions, like the enzymes in digesting their subtracts, kinase in activating their downstream genes. Given the gene expression profiles and networks, scMINER can estimate the activities of some predefined drivers, including not only transcription factors (TFs) but also signaling genes (SIGs).

scMINER provides two functions, getActivity_individual() and getActivity_inBatch(), to calculate the driver activities individually or in batch:

getActivity_individual() is designed to calculate the driver activities for individual group. It takes the SparseEset and network files of group-to-calculate as the inputs:

## Let's use B cell as an example
activity_B.eset <- getActivity_individual(input_eset = pbmc14k_log2cpm_annotated.eset[, pData(pbmc14k_log2cpm_annotated.eset)$trueLabel == "B"],
                                          network_file.tf = system.file("extdata/demo_pbmc14k/SJARACNe/B/TF/bt100_pc001/consensus_network_ncol_.txt", package = "scMINER"),
                                          network_file.sig = system.file("extdata/demo_pbmc14k/SJARACNe/B/SIG/bt100_pc001/consensus_network_ncol_.txt", package = "scMINER"),
                                          driver_type = "TF_SIG")

While getActivity_inBatch() is designed to calculate the driver activities in batch. Instead of networks files, it takes the directory that contains the network files of multiples groups and automatically retrieve them for activity estimation:

## let's use B cell as an example
activity.eset <- getActivity_inBatch(input_eset = pbmc14k_log2cpm_annotated.eset,
                                     sjaracne_dir = system.file("extdata/demo_pbmc14k/SJARACNe", package = "scMINER"),
                                     group_name = "trueLabel", driver_type = "TF_SIG", activity_method = "mean", do.z_normalization = TRUE)
#> 7 groups were found in trueLabel ...
#> Checking network files for each group ...
#>  Group 1 / 7 : Monocyte ...
#>      TF network check passed!
#>      SIG network check passed!
#>  Group 2 / 7 : B ...
#>      TF network check passed!
#>      SIG network check passed!
#>  Group 3 / 7 : CD4Treg ...
#>      TF network check passed!
#>      SIG network check passed!
#>  Group 4 / 7 : CD4TN ...
#>      TF network check passed!
#>      SIG network check passed!
#>  Group 5 / 7 : CD4TCM ...
#>      TF network check passed!
#>      SIG network check passed!
#>  Group 6 / 7 : NK ...
#>      TF network check passed!
#>      SIG network check passed!
#>  Group 7 / 7 : CD8TN ...
#>      TF network check passed!
#>      SIG network check passed!
#> Calculating activity for each group ...
#>  Group 1 / 7 : Monocyte ...
#>  Activity calculation is completed successfully!
#>  Group 2 / 7 : B ...
#>  Activity calculation is completed successfully!
#>  Group 3 / 7 : CD4Treg ...
#>  Activity calculation is completed successfully!
#>  Group 4 / 7 : CD4TN ...
#>  Activity calculation is completed successfully!
#>  Group 5 / 7 : CD4TCM ...
#>  Activity calculation is completed successfully!
#>  Group 6 / 7 : NK ...
#>  Activity calculation is completed successfully!
#>  Group 7 / 7 : CD8TN ...
#>  Activity calculation is completed successfully!
#> NAs were found in the activity matrix and have been replaced by the minimum value:  -0.3968794 .

Both functions return an eSet object of driver activities. The full phenoData and part of featureData of the activity eSet object is inherited from SparseEset object.

Don’t forget to save the activity eSet object.

## Save activity eSet object
saveRDS(activity.eset, file = "/your-path/PBMC14k/DATA/pbmc14k_activity.eset")

Differential activity analysis

scMINER provides a function, getDA(), to perform the differential activity analysis and identify the cell type-specific drivers. Similar to getDE(), getDA() also allows you to compare different groups in a highly flexible way:

## 1. To perform differential expression analysis in a 1-vs-rest manner for all groups
da_res1 <- getDA(input_eset = activity.eset, group_by = "cell_type", use_method = "t.test")
#> 7 groups were found in group_by column [ cell_type ].
#> Since no group was specified, the differential analysis will be conducted among all groups in the group_by column [ cell_type ] in the 1-vs-rest manner.
#>   1 / 7 : group 1 ( B ) vs the rest...
#>   1912 cells were found for g1.
#>   11693 cells were found for g0.
#>   2 / 7 : group 1 ( CD4TCM ) vs the rest...
#>   2022 cells were found for g1.
#>   11583 cells were found for g0.
#>   3 / 7 : group 1 ( CD4TN ) vs the rest...
#>   2505 cells were found for g1.
#>   11100 cells were found for g0.
#>   4 / 7 : group 1 ( CD4Treg ) vs the rest...
#>   1448 cells were found for g1.
#>   12157 cells were found for g0.
#>   5 / 7 : group 1 ( CD8TN ) vs the rest...
#>   2014 cells were found for g1.
#>   11591 cells were found for g0.
#>   6 / 7 : group 1 ( Monocyte ) vs the rest...
#>   1786 cells were found for g1.
#>   11819 cells were found for g0.
#>   7 / 7 : group 1 ( NK ) vs the rest...
#>   1918 cells were found for g1.
#>   11687 cells were found for g0.
head(da_res1)
#>       feature g1_tag                                 g0_tag       g1_avg
#> 4   AASDH_SIG      B CD4TCM,CD4TN,CD4Treg,CD8TN,Monocyte,NK -0.008071658
#> 6    AATF_SIG      B CD4TCM,CD4TN,CD4Treg,CD8TN,Monocyte,NK -0.051767485
#> 12  ABCB8_SIG      B CD4TCM,CD4TN,CD4Treg,CD8TN,Monocyte,NK -0.077615607
#> 8   ABCA2_SIG      B CD4TCM,CD4TN,CD4Treg,CD8TN,Monocyte,NK -0.081643603
#> 10  ABCB1_SIG      B CD4TCM,CD4TN,CD4Treg,CD8TN,Monocyte,NK -0.134357577
#> 3  AARSD1_SIG      B CD4TCM,CD4TN,CD4Treg,CD8TN,Monocyte,NK -0.126010447
#>        g0_avg     g1_pct     g0_pct       log2FC          Pval           FDR
#> 4  -0.1025141 0.43043933 0.13991277  0.094442475 2.225074e-308  0.000000e+00
#> 6  -0.1084165 0.21652720 0.08757376  0.056649005 3.918924e-189 5.878386e-189
#> 12 -0.1094585 0.35251046 0.14153767  0.031842866  3.623209e-12  3.952592e-12
#> 8  -0.1101867 0.10198745 0.15676045  0.028543082  1.914570e-58  2.418404e-58
#> 10 -0.1559384 0.04393305 0.06114770  0.021580866  8.079661e-27  9.233898e-27
#> 3  -0.1225746 0.04184100 0.08192936 -0.003435892  4.213744e-02  4.213744e-02
#>      Zscore
#> 4  37.53784
#> 6  29.33316
#> 12  6.95115
#> 8  16.11775
#> 10 10.72137
#> 3  -2.03216

## 2. To perform differential expression analysis in a 1-vs-rest manner for one specific group
da_res2 <- getDA(input_eset = activity.eset, group_by = "cell_type", g1 = c("B"), use_method = "t.test")

## 3. To perform differential expression analysis in a rest-vs-1 manner for one specific group
da_res3 <- getDA(input_eset = activity.eset, group_by = "cell_type", g0 = c("B"), use_method = "t.test")

## 4. To perform differential expression analysis in a 1-vs-1 manner for any two groups
da_res4 <- getDA(input_eset = activity.eset, group_by = "cell_type", g1 = c("CD4Treg"), g0 = c("CD4TCM"), use_method = "t.test")

The getTopFeatures() function can also be used to effortlessly highlight the cell type-specific drivers from the differential activity analysis results:

top_drivers <- getTopFeatures(input_table = da_res1, number = 10, group_by = "g1_tag", sort_by = "log2FC", sort_decreasing = TRUE)
dim(top_drivers)
#> [1] 70 11
head(top_drivers)
#>       feature g1_tag                                 g0_tag       g1_avg
#> 4   AASDH_SIG      B CD4TCM,CD4TN,CD4Treg,CD8TN,Monocyte,NK -0.008071658
#> 6    AATF_SIG      B CD4TCM,CD4TN,CD4Treg,CD8TN,Monocyte,NK -0.051767485
#> 12  ABCB8_SIG      B CD4TCM,CD4TN,CD4Treg,CD8TN,Monocyte,NK -0.077615607
#> 8   ABCA2_SIG      B CD4TCM,CD4TN,CD4Treg,CD8TN,Monocyte,NK -0.081643603
#> 10  ABCB1_SIG      B CD4TCM,CD4TN,CD4Treg,CD8TN,Monocyte,NK -0.134357577
#> 3  AARSD1_SIG      B CD4TCM,CD4TN,CD4Treg,CD8TN,Monocyte,NK -0.126010447
#>        g0_avg     g1_pct     g0_pct       log2FC          Pval           FDR
#> 4  -0.1025141 0.43043933 0.13991277  0.094442475 2.225074e-308  0.000000e+00
#> 6  -0.1084165 0.21652720 0.08757376  0.056649005 3.918924e-189 5.878386e-189
#> 12 -0.1094585 0.35251046 0.14153767  0.031842866  3.623209e-12  3.952592e-12
#> 8  -0.1101867 0.10198745 0.15676045  0.028543082  1.914570e-58  2.418404e-58
#> 10 -0.1559384 0.04393305 0.06114770  0.021580866  8.079661e-27  9.233898e-27
#> 3  -0.1225746 0.04184100 0.08192936 -0.003435892  4.213744e-02  4.213744e-02
#>      Zscore
#> 4  37.53784
#> 6  29.33316
#> 12  6.95115
#> 8  16.11775
#> 10 10.72137
#> 3  -2.03216

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.4.1 (2024-06-14)
#>  os       macOS 15.1
#>  system   aarch64, darwin20
#>  ui       X11
#>  language en
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/Chicago
#>  date     2024-11-13
#>  pandoc   3.1.11 @ /usr/local/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package      * version date (UTC) lib source
#>  anndata      * 0.7.5.6 2023-03-17 [2] CRAN (R 4.4.0)
#>  assertthat     0.2.1   2019-03-21 [2] CRAN (R 4.4.0)
#>  Biobase      * 2.64.0  2024-04-30 [2] Bioconductor 3.19 (R 4.4.0)
#>  BiocGenerics * 0.50.0  2024-04-30 [2] Bioconductor 3.19 (R 4.4.0)
#>  bit            4.5.0   2024-09-20 [2] CRAN (R 4.4.1)
#>  bit64          4.5.2   2024-09-22 [2] CRAN (R 4.4.1)
#>  bslib          0.8.0   2024-07-29 [2] CRAN (R 4.4.0)
#>  cachem         1.1.0   2024-05-16 [2] CRAN (R 4.4.0)
#>  cli            3.6.3   2024-06-21 [2] CRAN (R 4.4.0)
#>  colorspace     2.1-1   2024-07-26 [2] CRAN (R 4.4.0)
#>  desc           1.4.3   2023-12-10 [2] CRAN (R 4.4.0)
#>  digest         0.6.37  2024-08-19 [2] CRAN (R 4.4.1)
#>  dplyr        * 1.1.4   2023-11-17 [2] CRAN (R 4.4.0)
#>  evaluate       1.0.1   2024-10-10 [2] CRAN (R 4.4.1)
#>  fansi          1.0.6   2023-12-08 [2] CRAN (R 4.4.0)
#>  farver         2.1.2   2024-05-13 [2] CRAN (R 4.4.0)
#>  fastmap        1.2.0   2024-05-15 [2] CRAN (R 4.4.0)
#>  fs             1.6.4   2024-04-25 [2] CRAN (R 4.4.0)
#>  generics       0.1.3   2022-07-05 [2] CRAN (R 4.4.0)
#>  ggplot2      * 3.5.1   2024-04-23 [2] CRAN (R 4.4.0)
#>  glue           1.8.0   2024-09-30 [2] CRAN (R 4.4.1)
#>  gridExtra      2.3     2017-09-09 [2] CRAN (R 4.4.0)
#>  gtable         0.3.5   2024-04-22 [2] CRAN (R 4.4.0)
#>  hdf5r        * 1.3.11  2024-07-07 [2] CRAN (R 4.4.0)
#>  highr          0.11    2024-05-26 [2] CRAN (R 4.4.0)
#>  htmltools      0.5.8.1 2024-04-04 [2] CRAN (R 4.4.0)
#>  htmlwidgets    1.6.4   2023-12-06 [2] CRAN (R 4.4.0)
#>  igraph         2.0.3   2024-03-13 [2] CRAN (R 4.4.0)
#>  jquerylib      0.1.4   2021-04-26 [2] CRAN (R 4.4.0)
#>  jsonlite       1.8.9   2024-09-20 [2] CRAN (R 4.4.1)
#>  knitr          1.48    2024-07-07 [2] CRAN (R 4.4.0)
#>  labeling       0.4.3   2023-08-29 [2] CRAN (R 4.4.0)
#>  lattice        0.22-6  2024-03-20 [2] CRAN (R 4.4.1)
#>  lifecycle      1.0.4   2023-11-07 [2] CRAN (R 4.4.0)
#>  limma          3.60.6  2024-10-02 [2] Bioconductor 3.19 (R 4.4.1)
#>  magrittr       2.0.3   2022-03-30 [2] CRAN (R 4.4.0)
#>  Matrix       * 1.7-1   2024-10-18 [2] CRAN (R 4.4.1)
#>  munsell        0.5.1   2024-04-01 [2] CRAN (R 4.4.0)
#>  pheatmap       1.0.12  2019-01-04 [2] CRAN (R 4.4.0)
#>  pillar         1.9.0   2023-03-22 [2] CRAN (R 4.4.0)
#>  pkgconfig      2.0.3   2019-09-22 [2] CRAN (R 4.4.0)
#>  pkgdown        2.1.1   2024-09-17 [2] CRAN (R 4.4.1)
#>  plyr           1.8.9   2023-10-02 [2] CRAN (R 4.4.0)
#>  png            0.1-8   2022-11-29 [2] CRAN (R 4.4.0)
#>  R6             2.5.1   2021-08-19 [2] CRAN (R 4.4.0)
#>  ragg           1.3.3   2024-09-11 [2] CRAN (R 4.4.1)
#>  RColorBrewer   1.1-3   2022-04-03 [2] CRAN (R 4.4.0)
#>  Rcpp           1.0.13  2024-07-17 [2] CRAN (R 4.4.0)
#>  reshape2       1.4.4   2020-04-09 [2] CRAN (R 4.4.0)
#>  reticulate     1.39.0  2024-09-05 [2] CRAN (R 4.4.1)
#>  rlang          1.1.4   2024-06-04 [2] CRAN (R 4.4.0)
#>  rmarkdown      2.28    2024-08-17 [2] CRAN (R 4.4.0)
#>  rstudioapi     0.17.0  2024-10-16 [2] CRAN (R 4.4.1)
#>  sass           0.4.9   2024-03-15 [2] CRAN (R 4.4.0)
#>  scales         1.3.0   2023-11-28 [2] CRAN (R 4.4.0)
#>  scMINER      * 1.1.0   2024-11-13 [1] local
#>  sessioninfo    1.2.2   2021-12-06 [2] CRAN (R 4.4.0)
#>  statmod        1.5.0   2023-01-06 [2] CRAN (R 4.4.0)
#>  stringi        1.8.4   2024-05-06 [2] CRAN (R 4.4.0)
#>  stringr        1.5.1   2023-11-14 [2] CRAN (R 4.4.0)
#>  systemfonts    1.1.0   2024-05-15 [2] CRAN (R 4.4.0)
#>  textshaping    0.4.0   2024-05-24 [2] CRAN (R 4.4.0)
#>  tibble         3.2.1   2023-03-20 [2] CRAN (R 4.4.0)
#>  tidyselect     1.2.1   2024-03-11 [2] CRAN (R 4.4.0)
#>  utf8           1.2.4   2023-10-22 [2] CRAN (R 4.4.0)
#>  vctrs          0.6.5   2023-12-01 [2] CRAN (R 4.4.0)
#>  withr          3.0.1   2024-07-31 [2] CRAN (R 4.4.0)
#>  xfun           0.48    2024-10-03 [2] CRAN (R 4.4.1)
#>  yaml           2.3.10  2024-07-26 [2] CRAN (R 4.4.0)
#> 
#>  [1] /private/var/folders/v0/njhqcmrs32xgrjgx2wz8d50r0000gp/T/Rtmpf8JULY/temp_libpath11ae7194479
#>  [2] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

Qingfei Pan

2024-11-13