Variant calling in human whole genome/exome sequencing data | H3ABioNet Standard Operating Procedures

Edit me

Alphabetized list of recommended tools

Note on Galaxy: If it is desirable to perform all processing in Galaxy ¹, then it is possible to construct a complete workflow by including the needed tools from its toolshed. The majority of the tools below can be found in the toolshed and incorporated readily. A complete coverage is beyond the scope of this SOP, and hence, the interested reader is referred to the galaxy project site for more details.

ANNOVAR

Utilizes update-to-date information to generate gene-based, region-based and filter-based functional annotations of genetic variants detected from diverse genomes.

Usage:

Annotate the variants in ex1.human file and classify them as intergenic, intronic, non-synonymous SNP, frameshift deletion, large-scale duplication, etc. The ex1.human file is in text format, one variant per line. The annotation procedure takes seconds on a typical modern computer.
```
perl annotate_variation.pl -geneanno example/ex1.human humandb/
```
Download cytogenetic band annotation databases from the UCSC Genome Browser and save it to the humandb/ directory as hg18_cytoBand.txt file, then annotate variants in ex1.human file and identify the cytogenetic band for these variants.
```
perl annotate_variation.pl -downdb cytoBand humandb/
perl annotate_variation.pl -regionanno -dbtype cytoBand example/ex1.human humandb/
```
Download 1000 Genome Projects allele frequency annotations, then identify a subset of variants in ex1.human that are not observed in 1000G CEU populations and those that are observed with allele frequencies.
```
perl annotate_variation.pl -downdb 1000g humandb/
perl annotate_variation.pl -filter -dbtype 1000g_ceu example/ex1.human humandb/
```

Note: By default, all the above commands work on variants files in hg18 (human genome NCBI build 36) coordinate. If your file is in hg19 coordinate, add “-buildver hg19” in every command.

Bed tools

A flexible suite of utilities for comparing genomic features.

Filtering SNPs

To remove known SNPs, use the intersectBed command from the bedtools suite. Known SNPs can be downloaded from Ensembl or NCBI/UCSC (eg. dbSNP)

intersectBed -a accepted_hits.raw.filtered.vcf \
             -b Mus_musculus.NCBIM37.60.bed \
             -wo \
             > filteredSNPs.vcf

Bioconductor

Provides tools for the analysis and comprehension of high-throughput genomic data. Bioconductor uses the R statistical programming language, and is open source and open development.

Sequence trimming

Use trimLRPatterns for adaptor trimming. From this manual:

# Create sample sequences.
myseq <-DNAStringSet(c("CCCATGCAGACATAGTG", "CCCATGAACATAGATCC", "CCCGTACAGATCACGTG"));
names(myseq) <- letters[1:3];

# Remove the specified R/L-patterns. The number of maximum mismatches can be specified
# for each pattern individually. Indel matches can be specified with the arguments:
# with.Lindels and with.Rindels.
trimLRPatterns(Lpattern ="CCC", Rpattern="AGTG", subject=myseq, max.Lmismatch = 0.33,
               max.Rmismatch = 1)

# To remove partial adaptors, the number of mismatches for all possible partial matches
# can be specified by providing a numeric vector of length nchar(mypattern).
# The numbers specifiy the number of mismatches for each partial match.
# Negative numbers are used to prevent trimming of a minimum fragment length,
# e.g. most terminal nucleotides.
trimLRPatterns(Lpattern = "CCC", Rpattern="AGTG", subject=myseq,
               max.Lmismatch=c(0,0,0), max.Rmismatch=c(1,0,0))

# With the setting 'ranges=TRUE' one can retrieve the corresponding
# trimming coordinates.
trimLRPatterns(Lpattern = "CCC", Rpattern="AGTG", subject=myseq,
               max.Lmismatch=c(0,0,0), max.Rmismatch=c(1,0,0), ranges=T)

Base quality score recalibration:

Use the ReQON package to recalibrate quality of nucleotides for aligned sequencing data in BAM format. See the ReQON tutorial for detailed examples of usage.

blat (also here)

Blat is an alignment tool like BLAST, but it is structured differently. On DNA, Blat works by keeping an index of an entire genome in memory. From a practical standpoint, Blat has several advantages over BLAST:

speed (no queues, response in seconds) at the price of lesser homology depth
the ability to submit a long list of simultaneous queries in fasta format
five convenient output sort options
a direct link into the UCSC browser
alignment block details in natural genomic order
an option to launch the alignment later as part of a custom track

Usage:

blat <database> <query> [-ooc=11.ooc] output.psl

Database and query are each either a .fa , .nib or .2bit file, or a list these files.

The option -ooc=11.ooc tells the program to load over-occurring 11-mers from an external file. This will increase the speed by a factor of 40 in many cases, but is not required.

Bowtie2

An ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters to relatively long (e.g. mammalian) genomes.

Usage:

bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r>} -S [<sam>]

-x <bt2-idx>  --  The basename of the index for the reference genome. The basename is the name of any of the index files up to but not including the final .1.bt2 / .rev.1.bt2 / etc. bowtie2 looks for the specified index first in the current directory, then in the directory specified in the BOWTIE2_INDEXES environment variable.

-1 <m1>  --  Comma-separated list of files containing mate 1s (filename usually includes _1), e.g. -1 flyA_1.fq,flyB_1.fq. Sequences specified with this option must correspond file-for-file and read-for-read with those specified in <m2>. Reads may be a mix of different lengths. If - is specified, bowtie2 will read the mate 1s from the "standard in" or "stdin" filehandle.

-2 <m2>  --  Comma-separated list of files containing mate 2s (filename usually includes _2), e.g. -2 flyA_2.fq,flyB_2.fq. Sequences specified with this option must correspond file-for-file and read-for-read with those specified in <m1>. Reads may be a mix of different lengths. If - is specified, bowtie2 will read the mate 2s from the "standard in" or "stdin" filehandle.

-U <r>   --  Comma-separated list of files containing unpaired reads to be aligned, e.g. lane1.fq,lane2.fq,lane3.fq,lane4.fq. Reads may be a mix of different lengths. If - is specified, bowtie2 gets the reads from the "standard in" or "stdin" filehandle.

-S <sam> --  File to write SAM alignments to. By default, alignments are written to the "standard out" or "stdout" filehandle (i.e. the console).

BWA

Burrows-Wheeler Alignment Tool. A software package for mapping low-divergent sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM. The first algorithm is designed for Illumina sequence reads up to 100bp, while the rest two for longer sequences ranged from 70bp to 1Mbp. BWA-MEM and BWA-SW share similar features such as long-read support and split alignment, but BWA-MEM, which is the latest, is generally recommended for high-quality queries as it is faster and more accurate. BWA-MEM also has better performance than BWA-backtrack for 70-100bp Illumina reads.

Usage:

# Index database sequences:
bwa index ref.fa

# Align 70bp-1Mbp query sequences with the BWA-MEM algorithm:
bwa mem ref.fa reads.fq > aln-se.sam #Single end reads
bwa mem ref.fa read1.fq read2.fq > aln-se.sam #For functional equivalence 9, add the options: -K 100000000 -Y

# Find the SA coordinates of the input reads.
bwa aln ref.fa short_read.fq > aln_sa.sai

# Generate alignments in the SAM format given single-end reads shorter than ~ 70bp

bwa aln ref.fa short_reads.fq > reads.sai
bwa samse ref.fa reads.sai short_reads.fq > aln-se.sam

# Generate alignments in the SAM format given paired-end reads shorter than ~70bp.
bwa aln ref.fa read1.fq > read1.sai; bwa aln ref.fa read2.fq > read2.sai
bwa sampe ref.fa read1.sai read2.sai read1.fq read2.fq > aln-pe.sam

FASTX-Toolkit

A collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing.

fastq_quality_filter:

Filters sequences based on quality

fastq_quality_filter [-q N] [-p N] [-z] [-i INFILE] [-o OUTFILE]

[-q N]  =  Minimum quality score to keep.
[-p N]  =  Minimum percent of bases that must have [-q] quality.
[-z]    =  Compress output with GZIP.
[-i INFILE]  =  FASTA/Q input file. default is STDIN.
[-o OUTFILE]  =  FASTA/Q output file. default is STDOUT.

fastx-collapser:

Collapses identical sequences in a FASTQ/A file into a single sequence)

fastx_collapser [-i INFILE] [-o OUTFILE]

Flexbar

Flexbar — flexible barcode and adapter removal. Flexbar is a software to preprocess high-throughput sequencing data efficiently. It demultiplexes barcoded runs and removes adapter sequences. Trimming and filtering features are provided. Flexbar increases mapping rates and improves genome and transcriptome assemblies. It supports next-generation sequencing data from Illumina, Roche 454, and the SOLiD platform. Recognition is based on exact overlap sequence alignment.

Usage:

flexbar -t target -r reads [-b barcodes] [-a adapters] [options]

Mandatory parameters:
input file with reads
input format of reads
target name for output prefix

GATK

Genome Analysis Toolkit - a software package developed at the Broad Institute to analyse next-generation resequencing data. The toolkit offers a wide variety of tools, with a primary focus on variant discovery and genotyping as well as strong emphasis on data quality assurance. Its robust architecture, powerful processing engine and high-performance computing features make it capable of taking on projects of any size.

There is a slight change when calling tools from either GATK version:

GATK3 invocation:
java -jar <jvm args like -Xmx4G go here> GenomeAnalysisTK.jar -T <ToolName>

GATK4 invocation:
gatk [--java-options <jvm args like -Xmx4G go here>] <ToolName> [GATK args go here]

BQSR

The tools in this package recalibrate base quality scores of sequencing-by-synthesis reads in an aligned BAM file. After recalibration, the quality scores in the QUAL field in each read in the output BAM are more accurate in that the reported quality score is closer to its actual probability of mismatching the reference genome. Moreover, the recalibration tool attempts to correct for variation in quality with machine cycle and sequence context, and by doing so provides not only more accurate quality scores but also more widely dispersed ones. The system works on BAM files coming from many sequencing platforms: Illumina, SOLiD, 454, Complete Genomics, Pacific Biosciences, etc. Invocation based on recommended functional equivalent parameters ² is below:

java -Xmx4g -jar GenomeAnalysisTK.jar \
               -T BaseRecalibrator
               -R ${ref_fasta} \
               -I ${input_bam} \
               -O ${recalibration_report_filename} \
               -knownSites "Homo_sapiens_assembly38.dbsnp138.vcf" \
               -knownSites "Mills_and_1000G_gold_standard.indels.hg38.vcf.gz" \
               -knownSites "Homo_sapiens_assembly38.known_indels.vcf.gz"

To create a recalibrated BAM you can use GATK’s PrintReads (GATK3), or ApplyBQSR (GATK4) with the engine on-the-fly recalibration capability. Here is the recommended command line to do so while complying with the functional equivalence specification ²:

java -jar GenomeAnalysisTK.jar \
               -T PrintReads \
               -R reference.fasta \
               -I input.bam \    
               -BQSR recalibration_report.grp \    
               -o output.bam
               -SQQ 10 -SQQ 20 -SQQ 30 \
               --disable_indel_quals

IndelRealigner

Performs local realignment of reads to correct misalignments due to the presence of indels. Realginemet is no longer needed with a haplotype-aware variant caller like the HaplotypeCaller ³ , but this is provided for completion

java -Xmx4g -jar GenomeAnalysisTK.jar \
               -T IndelRealigner \
               -R ref.fasta \
               -I input.bam \
               -targetIntervals intervalListFromRTC.intervals \
               -o realignedBam.bam \
              [-known /path/to/indels.vcf]

HaplotypeCaller

Call SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region.

java -jar GenomeAnalysisTK.jar
 	        -T HaplotypeCaller \
               -R reference/human_g1k_v37.fasta \
               -I sample1.bam [-I sample2.bam ...] \
               --dbsnp dbSNP.vcf \
               -stand_call_conf [50.0] \
               -stand_emit_conf 10.0 \
              [-L targets.interval_list] \
               -o output.raw.snps.indels.vcf

MuTect2

Call SNPs and indels simultaneously via local de-novo assembly of haplotypes, combining the MuTect genotyping engine and the assembly-based machinery of HaplotypeCaller. For cancer variant discovery. Tumor/Normal or Normal-only calls.

java -jar GenomeAnalysisTK.jar
               -T MuTect2 \
               -R reference.fa \
               -I:tumor tumor.bam \
               -I:normal normal.bam \
               [--dbsnp sbSNP.vcf] \
               [--cosmic COSMIC.vcf] \
               [-L targets.interval_list] \
               -o output.vcf

UnifiedGenotyper

A variant caller which unifies the approaches of several disparate callers – Works for single-sample and multi-sample data. In most applications, the recommendation is to use the newer HaplotypeCaller

java -jar GenomeAnalysisTK.jar
               -R resources/Homo_sapiens_assembly18.fasta \
               -T UnifiedGenotyper \
               -I sample1.bam [-I sample2.bam ...] \
               --dbsnp dbSNP.vcf \
               -o snps.raw.vcf \
               -stand_call_conf [50.0] \
               -stand_emit_conf 10.0 \
               -dcov [50 for 4x, 200 for >30x WGS or Whole exome] \
              [-L targets.interval_list]

GATK Variant Quality Score Recalibrator

Creates a Gaussian mixture model by looking at the annotations values over a high quality subset of the input call set and then evaluate all input variants.

 java -Xmx4g -jar GenomeAnalysisTK.jar
   -T VariantRecalibrator \
   -R reference/human_g1k_v37.fasta \
   -input NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.vcf \
   -resource:hapmap,known=false,training=true,truth=true,prior=15.0 \
                                                     hapmap_3.3.b37.sites.vcf \
   -resource:omni,known=false,training=true,truth=false,prior=12.0 \
                                                     1000G_omni2.5.b37.sites.vcf \
   -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 dbsnp_135.b37.vcf \
   -an QD -an HaplotypeScore -an MQRankSum \
   -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff \
   -mode SNP \
   -recalFile path/to/output.recal \
   -tranchesFile path/to/output.tranches \
   -rscriptFile path/to/output.plots.R

HuVariome

The HuVariome project aims to determine rare and common genetic variation in a Northern European population (Benelux) based on whole genome sequencing results. Variations, their population frequencies and the functional impact are stored in the HuVariome Database. Users who provide samples for inclusion within the database are able to access the full content of the database, whilst guests can access the public set of genomes published by Complete Genomics.

Has a straightforward web interface.

KGGSeq

A biological Knowledge-based mining platform for Genomic and Genetic studies using Sequence data.

Usage:

To prioritize variants based on the hg18 assembly for a rare Mendelian disease, need input files:

a VCF file (rare.disease.hg18.vcf); and
a linkage pedigree file (rare.disease.ped.txt).

java  -jar   -Xms256m   -Xmx1300m  kggseq.jar   examples/param.rare.disease.hg18.txt

NEAT-genReads

NEAT-genReads is a fine-grained read simulator published in PLoS One ⁴. GenReads simulates real-looking data using models learned from specific datasets. Simulated reads can be whole genome, whole exome, specific targeted regions, or tumor/normal, with optional vcf and bam file outputs. Additionally, there are several supporting utilities for generating models used for simulation. Requires Python 2.7 and Numpy 1.9.1+.

Usage:

Simulate whole genome dataset with random variants inserted according to the default model. Generates paired-end reads, bam, and vcf output.

python genReads.py                  \
        -r hg19.fa                  \
        -R 101                      \
        -o /home/me/simulated_reads \
        --bam                       \
        --vcf                       \
        --pe 300 30

Simulate a targeted region of a genome, e.g. exome, with -t. Generates paired-end reads, bam, and vcf output.

python genReads.py                  \
        -r hg19.fa                  \
        -R 101                      \
        -o /home/me/simulated_reads \
        --bam                       \
        --vcf                       \
        --pe 300 30                 \
        -t hg19_exome.bed

Picard MarkDuplicates

Examines aligned records in the supplied SAM or BAM file to locate duplicate molecules. All records are then written to the output file with the duplicate records flagged.

Usage:

Starting with GATK4.0, all Picard tools are directly available from the GATK suit itself. This means that for older GATK releases, one would call a tool as below:

java -jar picard.jar MarkDuplicates [options] INPUT=File OUTPUT=File

Whereas for GATK4 releases, the same tool would be called as below:

gatk MarkDuplicates [options] --INPUT File --OUTPUT File

For a comprehensive list of options, see the utility’s documentation page listed above.

Multiple databases can be used to annotate variants

A database that annotates SNPs with known and predicted regulatory elements in the intergenic regions of the H. sapiens genome. Known and predicted regulatory DNA elements include regions of DNAase hypersensitivity, binding sites of transcription factors, and promoter regions that have been biochemically characterized to regulation transcription. Source of these data include public datasets from GEO, Chromatin States from the Roadmap Epigenome Consortium, the ENCODE project, and published literature.

Has a straightforward web interface.

SAMtools and htslib

SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.

SNP calling with mpileup:

# Use Q to set base quality threshold:
samtools mpileup -Q 25 -ugf <reference.fasta> <file.bam> | \
         bcftools view -bvcg - > accepted_hits.raw.bcf

# Convert bcf to vcf (vairant call format)
# and filter using varFilter (from vcfutils), if needed;
# use Q to set mapping quality; use d for minimum read depth:
bcftools view accepted_hits.raw.bcf | vcfutils.pl varFilter -d 5 -Q 20 > \
         accepted_hits.raw.vcf

SolexaQA++

SolexaQA is a software package to calculate sequence quality statistics and create visual representations of data quality for Illumina’s second-generation sequencing technology (historically known as “Solexa”).

Usage:

Running directly on Illumina FASTQ files:

./SolexaQA++ analysis FASTQ_input_files \
     [-p|probcutoff 0.05] \
     [-h|phredcutoff] \
     [-v|variance] \
     [-m|minmax] \
     [-s|sample 10000] \
     [-b|bwa] \
     [-d|directory path] \
     [-sanger -illumina -solexa]

snpEff

Genetic variant annotation and effect prediction toolbox. Version 2.0.5 is is included in the GATK suit, but not newer versions. For human data, the recommendation is to use the GRCh37.64 database

Usage:

# If you don't already have the database installed:
java -jar snpEff.jar download -v GRCh37.66

# Annotate the file:
# Use appropriate i) organism  and ii) annotation (eg. UCSC, RefSeq, Ensembl)
# E.g. mouse: mm37 (UCSC/RefSeq), mm37.61 (Ensembl)
# Human: hg37 (UCSC/RefSeq), hg37.61 (Ensembl)
# Output is created in several files:
# an html summary file and text files with detailed information.
java -jar snpEff.jar -vcf4 GRCh37.75  accepted_hits.raw.filtered.vcf >accepted_hits.raw.filtered.snpEff

SnpSift

Helps filter and manipulate annotated files. Included in GATK.

Once your genomic variants have been annotated, you need to filter them out in order to find the “interesting / relevant variants”. Given the large data files, this is not a trivial task (e.g. you cannot load all the variants into XLS spreasheet). SnpSift helps to perform this VCF file manipulation and filtering required at this stage in data processing pipelines.

Usage

To filter out samples with quality less than 30:

cat variants.vcf   |   java -jar SnpSift.jar filter " ( QUAL >= 30 )"   >  filtered.vcf

To do the same but keep InDels that have quality 20 or more:

cat variants.vcf   |   java -jar SnpSift.jar filter \
                       "(( exists INDEL ) & (QUAL >= 20)) | (QUAL >= 30 )" \
                       >   filtered.vcf

VAAST

The Variant Annotation, Analysis and Search Tool (VAAST) is a probabilistic search tool for identifying damaged genes and their disease-causing variants in personal genome sequences. VAAST builds upon existing amino acid substitution (AAS) and aggregative approaches to variant prioritization, combining elements of both into a single unified likelihood-framework that allows users to identify damaged genes and deleterious variants with greater accuracy, and in an easy-to-use fashion. VAAST can score both coding and non-coding variants, evaluating the cumulative impact of both types of variants simultaneously. VAAST can identify rare variants causing rare genetic diseases, and it can also use both rare and common variants to identify genes responsible for common diseases. VAAST thus has a much greater scope of use than any other existing methodology.

It has outstanding quickstart quide

Usage:

The set of genomes being analyzed for disease causing features (cases) are referred to as the target genomes. The set of healthy genomes (controls) that the target genomes are being compared to are referred to as the background genomes. The basic inputs to VAAST consist of:

A set of target (case) variant files in either VCF or GVF format.
A set of background (control) variant files in either VCF or GVF format.
A set features to be scored, usually genes; this file should be in gff3 format.

A multi-fasta file of the reference genome

 ../bin/VAT -f data/easy-hg18-chr16.gff3 \
        -a data/hg18-chr16.fasta data/miller-1.gvf > data/miller-1.vat.gvf

VCF can be easily converted to GVF using the vaast_converter script, included with the distribution: VAAST/bin/vaast_tools/vaast_converter.

Bibliography

Afgan, E. et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res. 46, W537–W544 (2018). ↩
Regier, A. A. et al. Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects. BioRxiv (2018). doi:10.1101/269316 ↩ ↩²
Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. BioRxiv (2017). ↩
Stephens, Z. D. et al. Simulating Next-Generation Sequencing Datasets from Empirical Mutation and Sequencing Models. PLoS ONE 11, e0167047 (2016). ↩

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Tags: