RNASeq analysis on geneXplain platform
Rsubread – a Bioconductor software package that provides high-performance alignment and read counting functions for RNA-seq reads (NAR February 2019). Rsubread integrates read mapping and quantification in a single package.
Subread is a general-purpose read aligner which can align both genomic DNA-seq and RNA-seq reads, based on its unique seed-and-vote design, by which a large number of 16mer subreads from each read are mapped to the reference genome. The subread function accepts raw reads in the form of Fastq, SAM or BAM files and output read alignments in either SAM or BAM format. The output contains the total number of reads, the number of uniquely mapped reads, the number of multi-mapping reads and other mapping statistics. The align function is exceptionally flexible. It performs local read alignment and reports the largest mappable region for each read.
The align function automatically detects insertions and deletions (indels). First step of indels identification is mapping 16mer subreads from each read to the genome and determination of the major mapping location of the read. The second step undertakes a detailed local re-alignment of each read with the aid of collected indels. The align function also writes VCF files containing detected indels.
The align function can align read pairs arbitrarily far apart if the alignment is sufficiently good and no more canonical alignment is available. A weighting strategy is used to give preference to alignments within the expected fragment length bounds. Gene fusions are now supported by allowing different subreads from the same read to map to different chromosomes.
Figure 1 shows an example alignment output of the Subread-align method.
Subjunc is an RNA-seq read aligner that provides comprehensive detection of exon–exon junctions and reports full alignments of junction-spanning reads in BED file format.
First part is mapping a large number of 16mer subreads from each read to the genome. This step detects exon–exon junctions and determines the major mapping location of the read. The second part undertakes a detailed local re-alignment of each read with the aid of collected junctions.
Subjunc function can detect exon–exon junctions de novo and to quantify expression at the level of either genes, exons or exon junctions.
The featureCounts function counts the number of reads or read-pairs that overlap any specified set of genomic features. It can assign reads to any type of genomic region. Regions may be specified as simple genomic intervals (promoter regions) or can be collections of genomic intervals (genes comprising multiple exons). Any set of genomic features can be specified in GTF, GFF or SAF file format. SAF is a Simplified Annotation Format with columns GeneID, Chr, Start, End and Strand.
FeatureCounts produces a matrix of gene-wise counts and can be used as input for gene expression analysis with limma, edgeR or DESeq2. Alternatively, a matrix of exon-level counts can be produced suitable for differential exon usage analyses using limma, edgeR or DEXSeq.
FeatureCounts outputs the genomic length and position of each feature as well as the read count, making it straightforward to calculate summary measures such as RPKM (reads per kilobase per million reads).
The exactSNP function calls SNPs for individual samples, without requiring control samples to be provided. It tests the statistical significance of SNPs by comparing SNP signals to their background noises.
The Limma-voom function performs differential expression analysis for pre-processed RNA-seq data (single channel experiments) with sample-specific quality weights when the library sizes are quite variable between samples or the presence of outlier samples is given. The output reports the top100 differentially expressed genes and a pdf document containing density plots from raw and filtered counts, plot about the Mean−variance trend and gives visual information about sample clustering.
Figure 2 shows an example output of Limma-Voom method.
GeneXplain’s in-house implementation of a linear model analysis using Limma with experimental design specified through an annotation table. This tool performs linear model analysis on the given input table guided by selected experimental factors defined in a sample table. The analysis aims at finding significant differences between pairs of levels of a main factor. Furthermore, an ANOVA is carried out for all contrasts together. The assignment of main factor levels to columns of the input table is specified in a column of a sample table. Additional variables can be controlled by providing their column names in the sample table. Moreover, Surrogate Variable Analysis can be included to infer unspecified factors.
HISAT is a very fast and sensitive alignment tool for mapping next-generation sequencing reads (DNA and RNA) to a population of human genomes (as well as to a single reference genome). HISAT2 uses a large set of small GFM indexes that collectively cover the whole genome (each index representing a genomic region of 56 Kbp, with 55,000 indexes needed to cover the human population). These small indexes (called local indexes), combined with several alignment strategies, enable rapid and accurate alignment of sequencing reads. This new indexing scheme is called a Hierarchical Graph FM index (HGFM).
HISAT provides several alignment strategies specifically designed for mapping different types of RNA-seq reads. All these together, HISAT enables extremely fast and sensitive alignment of reads, in particular those spanning two exons or more. As a result, HISAT is much faster (over 50 times) than TopHat2 with better alignment quality. HISAT uses the Bowtie2 implementation to handle most of the operations on the FM index. In addition to spliced alignment, HISAT handles reads involving indels and supports a paired-end alignment mode. HISAT outputs alignments in SAM format.
This tool takes an alignment file in SAM or BAM format and feature file in GFF format and calculates the number of reads mapping to each feature. It uses the htseq-count script that is part of the HTSeq python module.
A feature is an interval (i.e., a range of positions) on a chromosome or a union of such intervals. In the case of RNA-Seq, the features are typically genes, where each gene is considered here as the union of all its exons. One may also consider each exon as a feature, e.g., in order to check for alternative splicing. For comparative ChIP-Seq, the features might be binding regions from a pre-determined list.
Figure 3 shows resulting table of calculated RNA-seq counts mapping to Ensembl genes as selected feature.
This tool takes a file with high-throughput sequencing reads (either raw or aligned reads) and performs a simple quality assessment by producing plots showing the distribution of called bases and base-call quality scores by position within the reads. Output is a PDF file with all quality plots.
Figure 4 shows example output plots for the quality assessment from RNA-seq raw data.
Empirical Analysis of Digital Gene Expression Data in R (EdgeR)
Differential expression analysis of RNA-seq expression profiles with biological replication. Implements a range of statistical methodology based on the negative binomial distributions, including empirical Bayes estimation, exact tests, generalized linear models and quasi-likelihood tests. As well as RNA-seq, it can be applied to differential signal analysis of other types of genomic data that produce counts, including ChIP-seq, Bisulfite-seq, SAGE and CAGE.
The tool can be applied to any technology that produces read counts for genomic features. Of interest are summaries of short reads from massively parallel sequencing technologies such as Illumina™, 454 or ABI SOLiD applied to RNA-Seq, SAGE-Seq or ChIP-Seq experiments, pooled shRNA-seq or CRISPR-Cas9 genetic screens and bisulfite sequencing for DNA methylation studies. EdgeR provides statistical routines for assessing differential expression in RNA-Seq experiments or differential marking in ChIP-Seq experiments.
EdgeR can be applied to differential expression at the gene, exon, transcript or tag level. In fact, read counts can be summarized by any genomic feature. EdgeR analyses at the exon level are easily extended to detect differential splicing or isoform-specific differential expression.
This tool uses the edgeR quasi-likelihood pipeline (edgeR-quasi) for differential expression analysis. This statistical methodology uses negative binomial generalized linear models, but with F-tests instead of likelihood ratio tests. This method provides stricter error rate control than other negative binomial based pipelines, including the traditional edgeR pipelines or DESeq2. While the limma pipelines are recommended for large-scale datasets, because of their speed and flexibility, the edgeR-quasi pipeline gives better performance in low-count situations.