Edit me

Step 2.5 Statistical filtering

The VCF files resulting from the previous steps frequently have many sites that are not really genetic variants, but rather machine artifacts that make the site statistically non-reference. In small studies, hard filtering of variants based on annotations of genomic context is typically sufficient.

While, it requires expertise to define appropriate filtering thresholds, Heng Li provides some general guidelines in this paper 1. For experiments with a sufficiently large number of samples (30 or more), the GATK team designed the Variant Quality Score Recalibrator (VQSR) protocol to separate out the false positive machine artifacts from the true positive genetic variants using a Gaussian Mixture model based on the learned annotations of known datasets 2. A full tutorial is posted on GATK forums:

http://gatkforums.broadinstitute.org/discussion/39/variant-quality-score-recalibration-vqsr

Bibliography