Chapter 4 Create SparseEset object

In this chapter, we will introduce you how to create the parseExpressionSet(SparseEset) objects from gene expression matrix, genes by cells.

4.1 Solely from the gene expression matrix

This is the most commonly used way to create the sparse eSet object with scMINER:

pbmc14k_raw.eset <- createSparseEset(input_matrix = pbmc14k_rawCount, projectID = "PBMC14k", addMetaData = TRUE)

## Creating sparse eset from the input_matrix ...
##  Adding meta data based on input_matrix ...
## Done! The sparse eset has been generated: 17986 genes, 14000 cells.

pbmc14k_raw.eset

## SparseExpressionSet (storageMode: environment)
## assayData: 17986 features, 14000 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: CACTTTGACGCAAT GTTACGGAAACGAA ... ACGTGCCTTAAAGG (14000
##     total)
##   varLabels: CellID projectID ... pctSpikeIn (6 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: AL627309.1 AP006222.2 ... SRSF10.1 (17986 total)
##   fvarLabels: GeneSymbol nCell
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

input_matrix: it’s usually but not limited to a sparse matrix of raw UMI count.
- As for the data format, it accepts dgCMatrix, dgTMatrix, dgeMatrix, matrix, data.frame.
- As for the type of quantification measures, it takes raw counts, normalized counts (e.g. CPM or CP10k), TPM (Transcripts Per Million), FPKM/RPKM (Fragments/Reads Per Kilobase of transcript per Million) and others.
- What if a data frame object is given to it? When a non-matrix table is passed to input_matrix argument, the createSparseEset() function will automatically convert it to a matrix. And it the matrix, either converted from other format or directly passed from users, is not sparse. createSparseEset() will automatically convert it into sparse matrix, by default. This is controlled by another argument called do.sparseConversion, the default of which is TRUE. It’s not recommended but the users can set it as FALSE to disable the conversion. Then createSparseEset() will create the eSet based on the regular matrix.
addMetaData: when this argument is set TRUE (this is the default), createSparseEset() will automatically generate 5 statistics, 4 for cells and 1 for features, and add them into the phenoData and featureData slots. These 5 statistics will be used in quality control and data filtration.

## check the phenoData: metadata of cells
head(pData(pbmc14k_raw.eset))

##                        CellID projectID nUMI nFeature    pctMito pctSpikeIn
## CACTTTGACGCAAT CACTTTGACGCAAT   PBMC14k  764      354 0.01832461          0
## GTTACGGAAACGAA GTTACGGAAACGAA   PBMC14k  956      442 0.01569038          0
## AGTCACGACAGGAG AGTCACGACAGGAG   PBMC14k 7940     2163 0.01977330          0
## TTCGAGGACCAGTA TTCGAGGACCAGTA   PBMC14k 4177     1277 0.01149150          0
## CACTTATGAGTCGT CACTTATGAGTCGT   PBMC14k  629      323 0.02066773          0
## GCATGTGATTCTGT GCATGTGATTCTGT   PBMC14k  875      427 0.02628571          0

## check the featureData: metadata of features
head(fData(pbmc14k_raw.eset))

##                  GeneSymbol nCell
## AL627309.1       AL627309.1    50
## AP006222.2       AP006222.2     2
## RP11-206L10.3 RP11-206L10.3     1
## RP11-206L10.2 RP11-206L10.2    33
## RP11-206L10.9 RP11-206L10.9    17
## LINC00115         LINC00115   115

4.2 Using self-customized meta data

In some cases, you may have more meta data of either cells (e.g. sample id, treatment condition) or features (e.g. gene full name, gene type, genome location) which will be used in downstream analysis and you do want to add them into the sparse eSet object. the createSparseEset() function provides another two arguments, cellData and featureData, to take the self-customized meta data. For the PBMC14k dataset, we have the true labels of cell type and would like to add them to the sparse eSet object.

## read the true labels of cell type for PBMC14k dataset
true_label <- read.table(system.file("extdata/demo_pbmc14k/PBMC14k_trueLabel.txt.gz", package = "scMINER"), header = T, row.names = 1, sep = "\t", quote = "", stringsAsFactors = FALSE)

head(true_label)

##                trueLabel_full trueLabel
## CACTTTGACGCAAT CD14+ Monocyte  Monocyte
## GTTACGGAAACGAA CD14+ Monocyte  Monocyte
## AGTCACGACAGGAG CD14+ Monocyte  Monocyte
## TTCGAGGACCAGTA CD14+ Monocyte  Monocyte
## CACTTATGAGTCGT CD14+ Monocyte  Monocyte
## GCATGTGATTCTGT CD14+ Monocyte  Monocyte

table(true_label$trueLabel_full)

## 
##               CD14+ Monocyte                      CD19+ B 
##                         2000                         2000 
##              CD4+/CD25 T Reg   CD4+/CD45RA+/CD25- Naive T 
##                         2000                         2000 
##          CD4+/CD45RO+ Memory                     CD56+ NK 
##                         2000                         2000 
## CD8+/CD45RA+ Naive Cytotoxic 
##                         2000

## the true_label much cover all cells in the expression matrix
table(colnames(pbmc14k_rawCount) %in% row.names(true_label))

## 
##  TRUE 
## 14000

## create the sparse eSet object using the true_label
pbmc14k_raw.eset <- createSparseEset(input_matrix = pbmc14k_rawCount, cellData = true_label, featureData = NULL, projectID = "PBMC14k", addMetaData = TRUE)

## Creating sparse eset from the input_matrix ...
##  Adding meta data based on input_matrix ...
## Done! The sparse eset has been generated: 17986 genes, 14000 cells.

## check the true labels of cell type from sparse eSet object
head(pData(pbmc14k_raw.eset))

##                trueLabel_full trueLabel projectID nUMI nFeature    pctMito
## CACTTTGACGCAAT CD14+ Monocyte  Monocyte   PBMC14k  764      354 0.01832461
## GTTACGGAAACGAA CD14+ Monocyte  Monocyte   PBMC14k  956      442 0.01569038
## AGTCACGACAGGAG CD14+ Monocyte  Monocyte   PBMC14k 7940     2163 0.01977330
## TTCGAGGACCAGTA CD14+ Monocyte  Monocyte   PBMC14k 4177     1277 0.01149150
## CACTTATGAGTCGT CD14+ Monocyte  Monocyte   PBMC14k  629      323 0.02066773
## GCATGTGATTCTGT CD14+ Monocyte  Monocyte   PBMC14k  875      427 0.02628571
##                pctSpikeIn         CellID
## CACTTTGACGCAAT          0 CACTTTGACGCAAT
## GTTACGGAAACGAA          0 GTTACGGAAACGAA
## AGTCACGACAGGAG          0 AGTCACGACAGGAG
## TTCGAGGACCAGTA          0 TTCGAGGACCAGTA
## CACTTATGAGTCGT          0 CACTTATGAGTCGT
## GCATGTGATTCTGT          0 GCATGTGATTCTGT

table(pData(pbmc14k_raw.eset)$trueLabel_full)

## 
##               CD14+ Monocyte                      CD19+ B 
##                         2000                         2000 
##              CD4+/CD25 T Reg   CD4+/CD45RA+/CD25- Naive T 
##                         2000                         2000 
##          CD4+/CD45RO+ Memory                     CD56+ NK 
##                         2000                         2000 
## CD8+/CD45RA+ Naive Cytotoxic 
##                         2000

4.3 From multiple samples

What if you have multiple samples for one project? Now it’s pretty common to profile multiple samples of the same tissue but under different conditions (e.g. drug treatment) in one project. Analyzing these samples one by one is crucial, and analyzing them in a combined manner may give you more prospects. For this purpose, scMINER provides a function, combineSparseEset(), to easily combine the sparse eSet objects of multiple samples.

## create a sparse eSet object of each sample to combined
demo1_mtx <- readInput_10x.dir(input_dir = system.file("extdata/demo_inputs/cell_matrix_10x", package = "scMINER"), featureType = "gene_symbol", removeSuffix = TRUE)

## Reading 10x Genomcis data from: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/scMINER/extdata/demo_inputs/cell_matrix_10x ...
##  Multiple data modalities were found: Gene Expression, Peaks . Only the gene expression data (under "Gene Expression") was kept.
## Done! The sparse gene expression matrix has been generated: 500 genes, 100 cells.

demo1.eset <- createSparseEset(input_matrix = demo1_mtx, projectID = "demo1", addMetaData = TRUE)

## Creating sparse eset from the input_matrix ...
##  Adding meta data based on input_matrix ...
## Done! The sparse eset has been generated: 500 genes, 100 cells.

demo2_mtx <- readInput_table(table_file = system.file("extdata/demo_inputs/table_file/demoData2.txt.gz", package = "scMINER"), sep = "\t", is.geneBYcell = TRUE, removeSuffix = TRUE)

## Reading table file: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/scMINER/extdata/demo_inputs/table_file/demoData2.txt.gz ...
##  Suffix removal was specified but skipped, since some barcodes do not carry "-1" suffix.
## Done! The sparse gene expression matrix has been generated: 1000 genes, 100 cells.

demo2.eset <- createSparseEset(input_matrix = demo2_mtx, projectID = "demo2", addMetaData = TRUE)

## Creating sparse eset from the input_matrix ...
##  Adding meta data based on input_matrix ...
## Done! The sparse eset has been generated: 1000 genes, 100 cells.

## combine the 4 sparse eSet objects
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:Biobase':
## 
##     combine

## The following objects are masked from 'package:BiocGenerics':
## 
##     combine, intersect, setdiff, union

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

combined.eset <- combineSparseEset(eset_list = c(demo1.eset, demo2.eset),
                                   projectID = c("sample1", "sample2"),
                                   addPrefix = c("demo1", "demo2"),
                                   addSurfix = NULL, addMetaData = TRUE, imputeNA = TRUE)

## Combining the input sparse eSets ...
## NA values were found in the merged matrix and have been replaced by the minimum value:  0 .
## Adding meta data based on merged data matrix ...
## Done! The combined sparse eset has been generated: 1500 genes, 200 cells.

dim(combined.eset)

## Features  Samples 
##     1500      200

A few questions you may have about the combineSparseEset() function:

What if the input eSets have different features? combineSparseEset() ALWAYS keep all features from the input eSets, and generate NA values wherever the data is not available. By default, this function impute the NA values with the minimum value of the combined matrix, which is usually but not always zero. If this imputation method doesn’t fit your study, you can set imputeNA to FALSE to disable it. If so, the NAs will retain in the eSet object, and you can manually impute them with your own method.
What if the input eSets have some same cell barcodes? combineSparseEset() ALWAYS keep all cells from the input eSets, and will report an error when same barcodes are found in different input eSets. This function provides two arguments, addPrefix and addSurfix, to solve this issue. You can easily avoid the same barcodes of different input eSets by adding a eSet-specific prefix and/or surfix to the barcodes.

head(pData(combined.eset))

##                                        CellID projectID nUMI nFeature pctMito
## demo1_AAACAGCCAAACGGGC demo1_AAACAGCCAAACGGGC   sample1  119       43       0
## demo1_AAACAGCCAACTAGCC demo1_AAACAGCCAACTAGCC   sample1   55       28       0
## demo1_AAACAGCCAATTAGGA demo1_AAACAGCCAATTAGGA   sample1   45       20       0
## demo1_AAACAGCCAGCCAGTT demo1_AAACAGCCAGCCAGTT   sample1  175       44       0
## demo1_AAACATGCAAAGCTCC demo1_AAACATGCAAAGCTCC   sample1   51       31       0
## demo1_AAACATGCAATAGCCC demo1_AAACATGCAATAGCCC   sample1  121       44       0
##                        pctSpikeIn
## demo1_AAACAGCCAAACGGGC          0
## demo1_AAACAGCCAACTAGCC          0
## demo1_AAACAGCCAATTAGGA          0
## demo1_AAACAGCCAGCCAGTT          0
## demo1_AAACATGCAAAGCTCC          0
## demo1_AAACATGCAATAGCCC          0

I have some customized column in the phenoData and/or featureData slots. How does combineSparseEset() handle them? combineSparseEset() only keep the columns of phenoData and featureData that shared by all input eSets. Your customized columns would be kept only when they are available in all input eSets.
Are the 5 meta data statistics in the combined eSet still same with those generated in each eSet? No. By default, combineSparseEset() will update (add, if they are not available in input eSets) these 5 meta data statistics based on the combined matrix. It’s not recommended but you can disable it by setting addMataData to FALSE.