Step 2.5 Statistical filtering
The VCF files resulting from the previous steps frequently have many sites that are not really genetic variants, but rather machine artifacts that make the site statistically non-reference. In small studies, hard filtering of variants based on annotations of genomic context is typically sufficient.
While, it requires expertise to define appropriate filtering thresholds, Heng Li provides some general guidelines in this paper 1. For experiments with a sufficiently large number of samples (30 or more), the GATK team designed the Variant Quality Score Recalibrator (VQSR) protocol to separate out the false positive machine artifacts from the true positive genetic variants using a Gaussian Mixture model based on the learned annotations of known datasets 2. A full tutorial is posted on GATK forums:
http://gatkforums.broadinstitute.org/discussion/39/variant-quality-score-recalibration-vqsr
Bibliography
-
Li, H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843–2851 (2014). ↩
-
DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011). ↩
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.