RNA Sequencing (RNA-Seq) is the application of Next Generation Sequencing (NGS) technologies to determine relative gene expression levels within a biological sample, at a far higher resolution than is available with Sanger sequencing- and microarray-based methods. RNA-Seq has been used successfully to determine gene expression differences between biological samples, precisely quantify alternative transcript levels, determine regions of de novo gene expression, confirm or revise previously annotated 5′ and 3′ ends of genes, and map exon/intron boundaries.
Although a few years since first published, the ENCODE Consortium “Standards, Guidelines and Best Practices for RNA-Seq” guide is a useful starting place for describing some of the considerations to take into account when planning an RNA-seq experiment. We would always recommend at least 3 biological replicates, and more if it is an experiment involving human tissue.
Single reads or paired end reads?
Single read sequencing is usually sufficient for counting transcripts for gene expression analysis in model organisms. Paired end sequencing will provide additional coverage of transcripts which is useful for determining the structure of transcripts (i.e. splicing variance) and de novo transcriptome assembly.
Differential Expression Analysis Pipeline
- Quality assess the sequence reads using Fastqc (see previous section)
- Map reads to reference genome
- Determine exon/gene level expression values
- Differential gene expression analysis
There are a large number of different software tools available for RNA-seq analysis. One of the most popular is the collection of analysis packages known as the Tuxedo tools developed by the CCB group at John Hopkins University:
- Bowtie: used for mapping the short length sequence reads to a reference genome.
- TopHat: A spliced alignment system for RNA-seq experiments.
- Cufflinks: A transcript assembler and abundance estimator for RNA-seq data. Cufflinks assembles transcripts from the alignments produced by TopHat, including novel isoforms, and quantitates those transcripts.
- Cufflinks-Cuffdiff: use to find significant changes in transcript expression, splicing, and promoter use.
Alternatively the analysis can be done using R-Bioconductor.
- Bowtie or BWA is used to map the reads to a reference genome creating a ‘.bam’ file.
- Rsamtools and GenomicsRanges are used in R to import the .bam files and the countOverlaps function is used to calculate the read counts for each exon/gene.
- EdgeR or DESeq can then be used for differential expression analysis.