Step 1.2: Adaptor and Quality trimming + Removal of very short reads
In this step we deal with 3 major preprocessing steps that clean up the data and reduce noise in the overall analysis.
-
Adaptors (glossary term) are artificial pieces of DNA introduced prior to sequencing to ensure that the DNA fragment being sequenced attaches to the sequencing flow cell. Usually these adaptors get sequenced, and have already been removed from the reads. But sometimes bits of adaptors are left behind, anywhere from 90% to 20% of the adaptor length. These need to be removed from the reads. The adaptor sequence for this step will have to be obtained from the same source as the sequence data.
-
Frequently, the quality of bases sequenced tends to drop off toward one end of the read. A low quality base call means that the nucleotide assigned has a higher probability of being incorrect (see this link for a more in-depth overview of quality scores). It is best to trim off any low quality bases at the ends of reads to ensure the best alignment to the reference. Usually a quality score of <25 is considered as a “poor” quality score.
-
Once the adaptor remnants and low quality ends have been trimmed, some reads may end up being very short (i.e. <20 bases). These short reads are likely to align to multiple (wrong) locations on the reference, introducing noise. Hence any reads that are shorter than a predetermined cutoff (e.g. 20) need to be removed
One tool that deals with all of these issues at once is Trimmomatic1, though there are various alternatives that can perform these 3 clean up steps either combined or one after the other; these are listed below. For data that are paired ended, it is very important to perform the trimming for both read1 and read2 simultaneously. This is because all downstream applications expect paired information, and if one of the 2 reads is lost because it is too short, then the other read becomes unpaired (orphaned) and cannot be used properly for most applications. Trimmomatic has 2 modes, one for single end data (SE) and another one for paired end data (PE). If using paired end reads, please be sure to use the PE mode with both read1 and read2 FASTQ files for the same run.
Alternative Tools:
-
For adaptor trimming: Trim_Galore2, BBMap3, Flexbar4 and one of the many tools listed here.
-
For trimming low quality bases from the ends of reads: Trim_Galore2, BBMap3, FASTX-Toolkit (fastq_quality_filter)5, PrinSeq6, SolexaQA7.
-
For removing very short reads: PrinSeq6, Trim_Galore2
Bibliography
-
Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, 30(15), 2114-2120. ↩
-
Dodt, M., Roehr, J. T., Ahmed, R., & Dieterich, C. (2012). FLEXBAR—flexible barcode and adapter processing for next-generation sequencing platforms. ↩
-
Gordon, A., & Hannon, G. (2010). Fastx-toolkit. FASTQ/A short-reads pre-processing tools. Unpublished ↩
-
(PRINSEQ) Schmieder, R., & Edwards, R. (2011). Quality control and preprocessing of metagenomic datasets. Bioinformatics, 27(6), 863-864. ↩ ↩2
-
Cox, M. P., Peterson, D. A., & Biggs, P. J. (2010). SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data. BMC bioinformatics, 11(1), 485. ↩
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.