Chapter 5 Data filtration

In this chapter, we will introduce you how scMINER assess the scRNA-seq data quality, estimate the cutoffs for data filtration, and remove the cells and features of low quality from the SparseEset object. ## QC metrics in scMINER

As we mentioned before, scMINER can automatically generate 5 meta data statistics and add them to the SparseEset object. These 5 meta data statistics are the metrics scMINER uses to assess the quality of cells and features:

For cell quality assessment, scMINER provides 4 metrics that commonly used by the community:
- nUMI: number of total UMIs in each cell. Cells with abnormally high nUMI usually indicate doublets, while those with abnormally low nUMI usually indicate poorly sequenced cells or empty droplets.
- nFeature: number of expressed features/genes in each cell. Similar to nUMI.
- pctMito: percentage of UMIs of mitochondrial genes (defined by “mt-|MT-”) in each cell. Cells with aberrantly high pctMito usually indicate dying cells.
- pctSpikeIn: percentage of UMIs of spike-in RNAs (defined by “ERCC-|Ercc-”)) in each cell. This is used to estimate the normalization factor. Cells with extremely high or low pctSpikeIn need to be removed.
For feature quality assessment, scMINER provides one metrics:
- nCell: number of cells expressing the features/genes. Genes with extremely low nCell are poorly sequenced and are usually of low variance.

5.1 QC report

To help assess the data quality and determine the cutoffs for data filtration, scMINER provides a function, drawSparseEsetQC(), to generate a html-format quality control report:

## To generate the 
drawSparseEsetQC(input_eset = pbmc14k_raw.eset, output_html_file = "/your-path/PBMC14k/PLOT/pbmc14k_rawCount.html", overwrite = TRUE)

## scMINER supports group-specific QC highlights
drawSparseEsetQC(input_eset = pbmc14k_raw.eset, output_html_file = "/your-path/PBMC14k/PLOT/pbmc14k_rawCount.html", overwrite = FALSE, group_by = "trueLabel")

The quality control report consists of 4 parts:

Key Statistics: it highlights 5 key statistics of given eset object, including number of cells, number of genes, mean of genes per cell, mean of UMIs per cell and mean of cells per gene.
Detailed statistics of key metrics: it summarizes and visualizes the detailed statistics of 5 key metrics that scMINER uses for filtration: nUMI, nFeature, pctMito, pctSpikeIn, nCell.
Detailed statistics per cell and gene: it lists the detailed statistics of each gene and cell.
Filtration cutoffs by scMINER: it provides the cutoffs estimated automatically by scMINER based on Median ± 3 * MAD (maximum absolute deviance), and the pseudo-filtration statistics on both genes and cells with these cutoffs.

5.2 Filter the sparse eset object

From the quality control report generated above, we have got a better sense about the data quality and the cutoffs to use for filtration. scMINER provides a function, filterSparseEset() for this purpose, and it can work in two modes:

auto: in this mode, scMINER will use the cutoffs estimated by Median ± 3*MAD (maximum absolute deviation). Based on our tests, in most cases, this mode works well with the matrix of both raw UMI counts and TPM values.
manual: in this mode, the users can manually specify the cutoffs, both low and high, of all 5 metrics: nUMI, nFeature, pctMito, pctSpikeIn for cells, and nCell for genes. No cells or features would be removed under the default cutoffs of each metrics.

No matter which mode to use, filterSparseEset() returns a summary table with detailed information of filtration statistics. You can refer to it and adjust the cutoffs accordingly.

5.2.1 Data filtration with auto mode

To conduct the filtering using the cutoffs recommended by scMINER:

## Filter eSet under the auto mode
pbmc14k_filtered.eset <- filterSparseEset(pbmc14k_raw.eset, filter_mode = "auto", filter_type = "both")

## Checking the availability of the 5 metrics ('nCell', 'nUMI', 'nFeature', 'pctMito', 'pctSpikeIn') used for filtration ...
## Checking passed! All 5 metrics are available.
## Filtration is done!

## Filtration Summary:

##  8846/17986 genes passed!

##  13605/14000 cells passed!

## 
## For more details:
##  Gene filtration statistics:
##      Metrics     nCell
##      Cutoff_Low  70
##      Cutoff_High Inf
##      Gene_total  17986
##      Gene_passed 8846(49.18%)
##      Gene_failed 9140(50.82%)
## 
##  Cell filtration statistics:
##      Metrics     nUMI        nFeature    pctMito     pctSpikeIn  Combined
##      Cutoff_Low  458     221     0       0       NA
##      Cutoff_High 3694        Inf     0.0408      0.0000      NA
##      Cell_total  14000       14000       14000       14000       14000
##      Cell_passed 13826(98.76%)   14000(100.00%)  13778(98.41%)   14000(100.00%)  13605(97.18%)
##      Cell_failed 174(1.24%)  0(0.00%)    222(1.59%)  0(0.00%)    395(2.82%)

In some cases, you may find that most of the cutoffs generated by the auto mode are good, except one or two. Though there is no ‘hybrid’ mode, scMINER does allow you to customize some of the cutoffs generated by the auto mode. This can be easily done by adding the cutoffs you would customize under the auto mode:

## Filter eSet under the auto mode, with customized values
pbmc14k_filtered.eset <- filterSparseEset(pbmc14k_raw.eset, filter_mode = "auto", filter_type = "both", gene.nCell_min = 5)

With the code above, scMINER will filter the eSet using all of the cutoffs generated by auto mode, except gene.nCell_min.

5.2.2 Data filtration with manual mode

To apply the self-customized cutoffs:

## Filter eSet under the manual mode
pbmc14k_filtered.eset <- filterSparseEset(pbmc14k_raw.eset, filter_mode = "manual", filter_type = "both", gene.nCell_min = 10, cell.nUMI_min = 500, cell.nUMI_max = 6500, cell.nFeature_min = 200, cell.nFeature_max = 2500, cell.pctMito_max = 0.1)

For any unspecified cutoff arguments, like gene.nCell_max, filterSparseEset() will automatically assign the default values to them. The default values of any cutoff argument would not filter out any cells or features. So, if you want to skip some metrics, just leave the cutoffs of them unspecified. For example, in the codes above, gene.nCell_max is unspecified. Then filterSparseEset() wil assign the default value, which is Inf, to it. No features would be filtered out by this argument.