RNA-Seq data processing and gene expression analysis | H3ABioNet Standard Operating Procedures

Edit me

Step 2.1: Generation of gene/transcript-level counts

Protocol 1: Alignment-based approach

Once the data are cleaned up, the next step is alignment to the reference genome. There are various tools available for this step, but it is important that the alignment tool chosen here is a “splice-aware” tool. That means that the tool should have the capability to align reads that contain exonic sequences from 2 exons on either side of one intron (also called intron-spanning reads). STAR¹, HISAT2², GSNAP³, SOAPSplice⁴ are some of the many splice-aware aligners available. Note that TopHat used to be a very commonly-used tool from the Tuxedo suite, however this software is no longer supported and has been superseded by HISAT2 (recommended for human data).

When performing alignments it is imperative to set up the parameters properly to ensure the best alignment. Irrespective of the aligner used, it needs the following information:

Are the data made up of single-end reads or paired-end reads?
Are the data stranded, if so, was the standard dUTP method employed (STAR can detect this automatically)?

The other information that should be provided when setting up the alignment is the gene annotation information, a GTF or GFF3 file that contains the information about the location of all the genes in the context of the reference genome (a FASTA file). It is very important to pick the gene annotation file that corresponds to the reference genome, i.e. the same version number and from the same source (Ensembl, UCSC or NCBI).

Note, that most all aligners require an index file to be created from the reference genome (a FASTA file). This helps speed up alignment drastically. For STAR, this index can be created using the “genomeGenerate” mode with or without an annotation file.

Once the alignment is complete, the final result will be a file in SAM or BAM format. For a STAR run, the main alignment output will end in “Aligned.out.sam” by default, but it may also be returned in a sorted or unsorted BAM file. In addition, there are two log files that are returned from STAR that report the progress of the run and save a summary of the final results. There is also a splice junctions (SJ) file that details high confidence splice junctions.

Once the alignment is completed, the first step is to check how many reads aligned to the genome. For STAR, all of these details can be found in the “Log.final.out” file. For RNA-Seq on Human samples, for good quality data, about 70 - 90% of the reads should match somewhere on the genome. If the data in question are of good quality but < 60 % of the reads are mapping to the genome, it is worth evaluating the parameters, and testing unmapped reads for presence of potential contaminants.

Bibliography

Dobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., … & Gingeras, T. R. (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29(1), 15-21. ↩
Kim, D., Langmead, B., & Salzberg, S. L. (2015). HISAT: a fast spliced aligner with low memory requirements. Nature methods, 12(4), 357. ↩
Wu, T. D., & Nacu, S. (2010). Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics, 26(7), 873-881. ↩
Huang, S., Zhang, J., Li, R., Zhang, W., He, Z., Lam, T. W., … & Yiu, S. M. (2011). SOAPsplice: genome-wide ab initio detection of splice junctions from RNA-Seq data. Frontiers in genetics, 2, 46. ↩

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Tags: