Chapter 3 Generate gene expresion matrix

In this chapter, we will generate the gene expression matrix, genes by cells, from multiple input formats commonly used in single-cell RNA sequencing, including sparse matrix by 10x Genomics, text-table file, HDF5 file by 10x Genomics and H5AD file.

For demonstration purposes, scMINER embedded four datasets in extdata/demoData_readInput, with one for each input format. All of these four samples were generated by downsampling the real scRNA-seq data.

3.1 From data directory by 10x Genomics

This is the most popular input format of scRNA-seq data generated by 10x Genomics. Usually, the data directory contains three files:

  • matrix.mtx: a sparse matrix format containing the raw UMI count per cell-gene combination
  • barcodes.tsv: a tab-separated matrix containing the cell barcodes
  • features.tsv: a tab-separated matrix containing the features/genes and their annotations

For more details about this format, please check out here.

data_dir <- system.file("extdata/demo_inputs/cell_matrix_10x", package = "scMINER")
list.files(path = data_dir, full.names = FALSE)
## [1] "barcodes.tsv.gz" "features.tsv.gz" "matrix.mtx.gz"
demo1_mtx <- readInput_10x.dir(input_dir = data_dir, featureType = "gene_symbol", removeSuffix = TRUE, addPrefix = "demo1")
## Reading 10x Genomcis data from: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/scMINER/extdata/demo_inputs/cell_matrix_10x ...
##  Multiple data modalities were found: Gene Expression, Peaks . Only the gene expression data (under "Gene Expression") was kept.
## Done! The sparse gene expression matrix has been generated: 500 genes, 100 cells.
demo1_mtx[1:5,1:5]
## 5 x 5 sparse Matrix of class "dgTMatrix"
##            demo1_AAACAGCCAAACGGGC demo1_AAACAGCCAACTAGCC demo1_AAACAGCCAATTAGGA
## AL590822.3                      .                      .                      .
## MORN1                           .                      .                      .
## AL589739.1                      .                      .                      .
## AL513477.2                      .                      .                      .
## RER1                            .                      .                      .
##            demo1_AAACAGCCAGCCAGTT demo1_AAACATGCAAAGCTCC
## AL590822.3                      .                      .
## MORN1                           .                      .
## AL589739.1                      .                      .
## AL513477.2                      .                      .
## RER1                            1                      .

The readInput_10x.dir() function can handle these conditions:

  • Alternative file names for feature data: for the datasets generated by CellRanger 3.0 or earlier, the file name is genes.tsv;
  • Compressed input files: one or more input files are compressed, usually in “.gz” format;
  • Data with multiple modalities: like the single cell multiome data. In this case, readInput_10x.dir() only retains the data of “Gene Expression” by default.

3.2 From text-table file

This is definitely the most compatible text format for scRNA-seq data. which can be used by all single-cell RNA-seq technologies, like 10x Genomics, Smart-Seq, Drop-Seq and more. The commonly used text table file formats include txt (text file format), csv (comma-separated values) and tsv (tab-separated values).

table_file <- system.file("extdata/demo_inputs/table_file/demoData2.txt.gz", package = "scMINER")
demo3_mtx <- readInput_table(table_file = table_file, sep = "\t", is.geneBYcell = TRUE, removeSuffix = TRUE, addPrefix = "demo3") # set is.geneBYcell = FALSE to read features in columns and cell in in rows
## Reading table file: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/scMINER/extdata/demo_inputs/table_file/demoData2.txt.gz ...
##  Suffix removal was specified but skipped, since some barcodes do not carry "-1" suffix.
## Done! The sparse gene expression matrix has been generated: 1000 genes, 100 cells.

NOTE: A major concern to read the gene expression matrix from text-table files is that the special characters in column names might change to dots (“.”), especially when the matrix is organized in cells by genes. This may cause failures in the identification of mitochondrial genes (usually defined by “MT-|mt-”) or spike-in RNAs (usually defined by “ERCC-|Ercc-”). The readInput_table() function has set check.names = FALSE to avoid this issue. However, if this issue already exists in the source data, you will have to fix it manually.

3.3 From HDF5 file by 10x Genomics

This is another popular input format of scRNA-seq data generated by 10x Genomics. The Hierarchical Data Format version 5 (HDF5 or H5) is a binary format that can compress and access data much more efficiently than text formats. It’s super useful when dealing with large datasets.

For more details about this format, please check out here.

library(hdh5r)
h5_file <- system.file("extdata/demo_inputs/hdf5_10x/demoData3.h5", package = "scMINER")
demo2_mtx <- readInput_10x.h5(h5_file = h5_file, featureType = "gene_symbol", removeSuffix = TRUE, addPrefix = "demo2")

NOTE: The readInput_10x.h5() function is developed exclusively for the HDF5 file generated by CellRanger of 10x Genomics. The HDF5 files from other source may have different hierarchical structures and can not be read by this function.

3.4 From H5AD file

The H5AD file is another well-used format for scRNA-seq data. Derived from HDF5 file format, the H5AD file is designed to store large amounts of data efficiently and to facilitate fast access to subsets of the data. Now it’s getting more and more popular in scRNA-seq data analysis, visualization and sharing.

For more details about this format, please check out here.

library(anndata)
h5ad_file <- system.file("extdata/demo_inputs/h5ad_file/demoData4.h5ad", package = "scMINER")
demo4_obj <- readInput_h5ad(h5ad_file = h5ad_file, removeSuffix = TRUE, addPrefix = "demo4") # set is.geneBYcell = FALSE to read features in columns and cell in in rows
## Reading h5ad file: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/scMINER/extdata/demo_inputs/h5ad_file/demoData4.h5ad ...
##  Suffix removal was specified but skipped, since some barcodes do not carry "-1" suffix.
## Done! The sparse gene expression matrix has been generated: 1000 genes, 100 cells.

NOTE: Unlike the other three readInput functions which return a gene expression matrx, the readInput_h5ad() returns an AnnData object. Here are the key components of an AnnData object:

  • X: the primary data matrix, cells by genes;
  • obs: observations (cells) metadata;
  • var: variables (features/genes) metadata.