Chapter 4 Create SparseEset object
In this chapter, we will introduce you how to create the parseExpressionSet(SparseEset)
objects from gene expression matrix, genes by cells.
4.1 Solely from the gene expression matrix
This is the most commonly used way to create the sparse eSet object with scMINER:
pbmc14k_raw.eset <- createSparseEset(input_matrix = pbmc14k_rawCount, projectID = "PBMC14k", addMetaData = TRUE)
## Creating sparse eset from the input_matrix ...
## Adding meta data based on input_matrix ...
## Done! The sparse eset has been generated: 17986 genes, 14000 cells.
## SparseExpressionSet (storageMode: environment)
## assayData: 17986 features, 14000 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: CACTTTGACGCAAT GTTACGGAAACGAA ... ACGTGCCTTAAAGG (14000
## total)
## varLabels: CellID projectID ... pctSpikeIn (6 total)
## varMetadata: labelDescription
## featureData
## featureNames: AL627309.1 AP006222.2 ... SRSF10.1 (17986 total)
## fvarLabels: GeneSymbol nCell
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
- input_matrix: it’s usually but not limited to a sparse matrix of raw UMI count.
- As for the data format, it accepts
dgCMatrix
,dgTMatrix
,dgeMatrix
,matrix
,data.frame
. - As for the type of quantification measures, it takes raw counts, normalized counts (e.g.
CPM
orCP10k
),TPM
(Transcripts Per Million),FPKM/RPKM
(Fragments/Reads Per Kilobase of transcript per Million) and others. - What if a data frame object is given to it? When a non-matrix table is passed to
input_matrix
argument, thecreateSparseEset()
function will automatically convert it to a matrix. And it the matrix, either converted from other format or directly passed from users, is not sparse.createSparseEset()
will automatically convert it into sparse matrix, by default. This is controlled by another argument calleddo.sparseConversion
, the default of which isTRUE
. It’s not recommended but the users can set it asFALSE
to disable the conversion. ThencreateSparseEset()
will create the eSet based on the regular matrix.
- As for the data format, it accepts
- addMetaData: when this argument is set
TRUE
(this is the default),createSparseEset()
will automatically generate 5 statistics, 4 for cells and 1 for features, and add them into thephenoData
andfeatureData
slots. These 5 statistics will be used in quality control and data filtration.
## CellID projectID nUMI nFeature pctMito pctSpikeIn
## CACTTTGACGCAAT CACTTTGACGCAAT PBMC14k 764 354 0.01832461 0
## GTTACGGAAACGAA GTTACGGAAACGAA PBMC14k 956 442 0.01569038 0
## AGTCACGACAGGAG AGTCACGACAGGAG PBMC14k 7940 2163 0.01977330 0
## TTCGAGGACCAGTA TTCGAGGACCAGTA PBMC14k 4177 1277 0.01149150 0
## CACTTATGAGTCGT CACTTATGAGTCGT PBMC14k 629 323 0.02066773 0
## GCATGTGATTCTGT GCATGTGATTCTGT PBMC14k 875 427 0.02628571 0
## GeneSymbol nCell
## AL627309.1 AL627309.1 50
## AP006222.2 AP006222.2 2
## RP11-206L10.3 RP11-206L10.3 1
## RP11-206L10.2 RP11-206L10.2 33
## RP11-206L10.9 RP11-206L10.9 17
## LINC00115 LINC00115 115
4.2 Using self-customized meta data
In some cases, you may have more meta data of either cells (e.g. sample id, treatment condition) or features (e.g. gene full name, gene type, genome location) which will be used in downstream analysis and you do want to add them into the sparse eSet object. the createSparseEset()
function provides another two arguments, cellData
and featureData
, to take the self-customized meta data. For the PBMC14k dataset, we have the true labels of cell type and would like to add them to the sparse eSet object.
## read the true labels of cell type for PBMC14k dataset
true_label <- read.table(system.file("extdata/demo_pbmc14k/PBMC14k_trueLabel.txt.gz", package = "scMINER"), header = T, row.names = 1, sep = "\t", quote = "", stringsAsFactors = FALSE)
head(true_label)
## trueLabel_full trueLabel
## CACTTTGACGCAAT CD14+ Monocyte Monocyte
## GTTACGGAAACGAA CD14+ Monocyte Monocyte
## AGTCACGACAGGAG CD14+ Monocyte Monocyte
## TTCGAGGACCAGTA CD14+ Monocyte Monocyte
## CACTTATGAGTCGT CD14+ Monocyte Monocyte
## GCATGTGATTCTGT CD14+ Monocyte Monocyte
##
## CD14+ Monocyte CD19+ B
## 2000 2000
## CD4+/CD25 T Reg CD4+/CD45RA+/CD25- Naive T
## 2000 2000
## CD4+/CD45RO+ Memory CD56+ NK
## 2000 2000
## CD8+/CD45RA+ Naive Cytotoxic
## 2000
## the true_label much cover all cells in the expression matrix
table(colnames(pbmc14k_rawCount) %in% row.names(true_label))
##
## TRUE
## 14000
## create the sparse eSet object using the true_label
pbmc14k_raw.eset <- createSparseEset(input_matrix = pbmc14k_rawCount, cellData = true_label, featureData = NULL, projectID = "PBMC14k", addMetaData = TRUE)
## Creating sparse eset from the input_matrix ...
## Adding meta data based on input_matrix ...
## Done! The sparse eset has been generated: 17986 genes, 14000 cells.
## trueLabel_full trueLabel projectID nUMI nFeature pctMito
## CACTTTGACGCAAT CD14+ Monocyte Monocyte PBMC14k 764 354 0.01832461
## GTTACGGAAACGAA CD14+ Monocyte Monocyte PBMC14k 956 442 0.01569038
## AGTCACGACAGGAG CD14+ Monocyte Monocyte PBMC14k 7940 2163 0.01977330
## TTCGAGGACCAGTA CD14+ Monocyte Monocyte PBMC14k 4177 1277 0.01149150
## CACTTATGAGTCGT CD14+ Monocyte Monocyte PBMC14k 629 323 0.02066773
## GCATGTGATTCTGT CD14+ Monocyte Monocyte PBMC14k 875 427 0.02628571
## pctSpikeIn CellID
## CACTTTGACGCAAT 0 CACTTTGACGCAAT
## GTTACGGAAACGAA 0 GTTACGGAAACGAA
## AGTCACGACAGGAG 0 AGTCACGACAGGAG
## TTCGAGGACCAGTA 0 TTCGAGGACCAGTA
## CACTTATGAGTCGT 0 CACTTATGAGTCGT
## GCATGTGATTCTGT 0 GCATGTGATTCTGT
##
## CD14+ Monocyte CD19+ B
## 2000 2000
## CD4+/CD25 T Reg CD4+/CD45RA+/CD25- Naive T
## 2000 2000
## CD4+/CD45RO+ Memory CD56+ NK
## 2000 2000
## CD8+/CD45RA+ Naive Cytotoxic
## 2000
4.3 From multiple samples
What if you have multiple samples for one project? Now it’s pretty common to profile multiple samples of the same tissue but under different conditions (e.g. drug treatment) in one project. Analyzing these samples one by one is crucial, and analyzing them in a combined manner may give you more prospects. For this purpose, scMINER provides a function, combineSparseEset()
, to easily combine the sparse eSet objects of multiple samples.
## create a sparse eSet object of each sample to combined
demo1_mtx <- readInput_10x.dir(input_dir = system.file("extdata/demo_inputs/cell_matrix_10x", package = "scMINER"), featureType = "gene_symbol", removeSuffix = TRUE)
## Reading 10x Genomcis data from: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/scMINER/extdata/demo_inputs/cell_matrix_10x ...
## Multiple data modalities were found: Gene Expression, Peaks . Only the gene expression data (under "Gene Expression") was kept.
## Done! The sparse gene expression matrix has been generated: 500 genes, 100 cells.
## Creating sparse eset from the input_matrix ...
## Adding meta data based on input_matrix ...
## Done! The sparse eset has been generated: 500 genes, 100 cells.
demo2_mtx <- readInput_table(table_file = system.file("extdata/demo_inputs/table_file/demoData2.txt.gz", package = "scMINER"), sep = "\t", is.geneBYcell = TRUE, removeSuffix = TRUE)
## Reading table file: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/scMINER/extdata/demo_inputs/table_file/demoData2.txt.gz ...
## Suffix removal was specified but skipped, since some barcodes do not carry "-1" suffix.
## Done! The sparse gene expression matrix has been generated: 1000 genes, 100 cells.
## Creating sparse eset from the input_matrix ...
## Adding meta data based on input_matrix ...
## Done! The sparse eset has been generated: 1000 genes, 100 cells.
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:Biobase':
##
## combine
## The following objects are masked from 'package:BiocGenerics':
##
## combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
combined.eset <- combineSparseEset(eset_list = c(demo1.eset, demo2.eset),
projectID = c("sample1", "sample2"),
addPrefix = c("demo1", "demo2"),
addSurfix = NULL, addMetaData = TRUE, imputeNA = TRUE)
## Combining the input sparse eSets ...
## NA values were found in the merged matrix and have been replaced by the minimum value: 0 .
## Adding meta data based on merged data matrix ...
## Done! The combined sparse eset has been generated: 1500 genes, 200 cells.
## Features Samples
## 1500 200
A few questions you may have about the combineSparseEset()
function:
- What if the input eSets have different features?
combineSparseEset()
ALWAYS keep all features from the input eSets, and generate NA values wherever the data is not available. By default, this function impute the NA values with the minimum value of the combined matrix, which is usually but not always zero. If this imputation method doesn’t fit your study, you can setimputeNA
toFALSE
to disable it. If so, the NAs will retain in the eSet object, and you can manually impute them with your own method. - What if the input eSets have some same cell barcodes?
combineSparseEset()
ALWAYS keep all cells from the input eSets, and will report an error when same barcodes are found in different input eSets. This function provides two arguments,addPrefix
andaddSurfix
, to solve this issue. You can easily avoid the same barcodes of different input eSets by adding a eSet-specific prefix and/or surfix to the barcodes.
## CellID projectID nUMI nFeature pctMito
## demo1_AAACAGCCAAACGGGC demo1_AAACAGCCAAACGGGC sample1 119 43 0
## demo1_AAACAGCCAACTAGCC demo1_AAACAGCCAACTAGCC sample1 55 28 0
## demo1_AAACAGCCAATTAGGA demo1_AAACAGCCAATTAGGA sample1 45 20 0
## demo1_AAACAGCCAGCCAGTT demo1_AAACAGCCAGCCAGTT sample1 175 44 0
## demo1_AAACATGCAAAGCTCC demo1_AAACATGCAAAGCTCC sample1 51 31 0
## demo1_AAACATGCAATAGCCC demo1_AAACATGCAATAGCCC sample1 121 44 0
## pctSpikeIn
## demo1_AAACAGCCAAACGGGC 0
## demo1_AAACAGCCAACTAGCC 0
## demo1_AAACAGCCAATTAGGA 0
## demo1_AAACAGCCAGCCAGTT 0
## demo1_AAACATGCAAAGCTCC 0
## demo1_AAACATGCAATAGCCC 0
- I have some customized column in the phenoData and/or featureData slots. How does
combineSparseEset()
handle them?combineSparseEset()
only keep the columns of phenoData and featureData that shared by all input eSets. Your customized columns would be kept only when they are available in all input eSets. - Are the 5 meta data statistics in the combined eSet still same with those generated in each eSet? No. By default,
combineSparseEset()
will update (add, if they are not available in input eSets) these 5 meta data statistics based on the combined matrix. It’s not recommended but you can disable it by settingaddMataData
toFALSE
.