Step 2.3: Collecting and tabulating alignment stats
Protocol 1
For a given RNA-Seq run it is valuable to collect several stats related to the alignment and counting steps; this is an important step in the evaluation process. There are many tools that gather this information from the SAM
or BAM
output file, e.g samtools’ flagstat
1 , Picard’s CollectAlignmentSummaryMetrics
2. However there are quirks with each reporting tool, hence it is recommended to collect this information as described in the table below.
Gather numbers for the following categories | Calculating or gathering the information | Tool generating the information |
Total reads |
"Total Sequences" in the "Basic Statistics" section | FastQC |
Total reads after trimming |
"Total Sequences" in the "Basic Statistics" section (FastQC) OR "Number of input reads" in "Log.final.out" file (STAR) | FastQC OR STAR (prefix_Log.final.out) |
Unmapped reads |
Take the "Number of input reads" and subtract "Uniquely mapped reads number", "Number of reads mapped to multiple loci", and "Number of reads mapped to too many loci" | STAR (prefix_Log.final.out) |
Reads mapped to genome |
Add "Uniquely mapped reads number", "Number of reads mapped to multiple loci", and "Number of reads mapped to too many loci" | STAR (prefix_Log.final.out) |
Multiply mapped reads |
Add "Number of reads mapped to multiple loci", and "Number of reads mapped to too many loci" | STAR (prefix_Log.final.out) |
Reads mapped to genes |
This number is listed as "Assigned" in ".summary" file | featureCounts (prefix.txt.summary) |
Uniquely mapped reads without an associated gene |
This number is listed as "Unassigned_NoFeatures" in ".summary" file | featureCounts (prefix.txt.summary) |
Uniquely mapped reads with an ambiguous gene assignment |
This number is listed as "Unassigned_Ambiguity" in ".summary" file | featureCounts (prefix.txt.summary) |
You may also find it easier to collate all of the summary data from FastQC
, STAR
, featureCounts
, and Picard’s metrics by running MultiQC
on all of the reports generated by these three programs. MultiQC
will generate an html
file that visually summarizes this data as well as tab-delimited files containing all the stats produced by these programs.
Bibliography
-
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., … & Durbin, R. (2009). The sequence alignment/map format and SAMtools. Bioinformatics, 25(16), 2078-2079. ↩
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.