Edit me

Step 2.3: Collecting and tabulating alignment stats

Protocol 1

For a given RNA-Seq run it is valuable to collect several stats related to the alignment and counting steps; this is an important step in the evaluation process. There are many tools that gather this information from the SAM or BAM output file, e.g samtools’ flagstat1 , Picard’s CollectAlignmentSummaryMetrics2. However there are quirks with each reporting tool, hence it is recommended to collect this information as described in the table below.

Gather numbers for the following categories Calculating or gathering the information Tool generating the information

Total reads

"Total Sequences" in the "Basic Statistics" section FastQC

Total reads after trimming

"Total Sequences" in the "Basic Statistics" section (FastQC) OR "Number of input reads" in "Log.final.out" file (STAR) FastQC OR STAR (prefix_Log.final.out)

Unmapped reads

Take the "Number of input reads" and subtract "Uniquely mapped reads number", "Number of reads mapped to multiple loci", and "Number of reads mapped to too many loci" STAR (prefix_Log.final.out)

Reads mapped to genome

Add "Uniquely mapped reads number", "Number of reads mapped to multiple loci", and "Number of reads mapped to too many loci" STAR (prefix_Log.final.out)

Multiply mapped reads

Add "Number of reads mapped to multiple loci", and "Number of reads mapped to too many loci" STAR (prefix_Log.final.out)

Reads mapped to genes

This number is listed as "Assigned" in ".summary" file featureCounts (prefix.txt.summary)

Uniquely mapped reads without an associated gene

This number is listed as "Unassigned_NoFeatures" in ".summary" file featureCounts (prefix.txt.summary)

Uniquely mapped reads with an ambiguous gene assignment

This number is listed as "Unassigned_Ambiguity" in ".summary" file featureCounts (prefix.txt.summary)

You may also find it easier to collate all of the summary data from FastQC, STAR, featureCounts, and Picard’s metrics by running MultiQC on all of the reports generated by these three programs. MultiQC will generate an html file that visually summarizes this data as well as tab-delimited files containing all the stats produced by these programs.

Bibliography

  1. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., … & Durbin, R. (2009). The sequence alignment/map format and SAMtools. Bioinformatics, 25(16), 2078-2079. 

  2. Picard