Genomic analysis pages | H3ABioNet Standard Operating Procedures

Edit me

The following pages and posts are tagged with

Title	Type	Excerpt
16s data processing and microbiome analysis	Page	Introduction The genes encoding the RNA component of the small subunit of ribosomes, commonly known as the 16S rRNA in bacteria and archaea, are among the most conserved across all kingdoms of life. Nevertheless, they contain regions that are less evolutionarily constrained and whose sequences are...
16s data processing and microbiome analysis	Page	Glossary of terms and jargon 16S rRNA gene The gene that is responsible for the coding of the 16S ribosomal RNA. The gene is used in constructing phylogenies....
16s data processing and microbiome analysis	Page	Schematic workflow of the analysis Figure 1. Steps in 16s Analysis Workflow Bibliography
16s data processing and microbiome analysis	Page	Phase 1: Preprocessing of reads It is essential to preprocess raw reads before subjecting them for downstream analysis. The preprocessing includes the removal of low quality bases, ambiguous bases and adapter sequences, the stitching together of paired reads, and the detection of chimeric reads. Sequencing errors, reads with ambiguous...
16s data processing and microbiome analysis	Page	QC plots and stats The first step in the data preprocessing is to check the quality of bases in all the reads. Once we understand the quality spectrum of the reads, we can decide on the parameters for trimming low quality bases. If the raw reads are not demultiplexed,...
16s data processing and microbiome analysis	Page	Trim and Filter reads Data received from sequencing facilities might still contain sequencing artefacts and would therefore need to be removed or reads need to be filtered. For example at the 3’ end of reads there are often adapter sequences left from library preparation. These adaptor bases need to...
16s data processing and microbiome analysis	Page	Paired read stitching When the combined length of reads sequenced from both ends of DNA fragments is longer than the size of the fragment, there is an overlap between the paired reads. The read pairs can be stitched together based on the overlap information, thus generating a single sequence....
16s data processing and microbiome analysis	Page	Chimera detection Chimeras are artefacts of PCR. These are formed during PCR cycles...
16s data processing and microbiome analysis	Page	Phase 2: OTU picking, classification and phylogenetic tree generation During this phase reads are processed so that comparisons between samples can be made. The first...
16s data processing and microbiome analysis	Page	OTU picking <abbr title="An operational taxonomic unit is an operational definition of a species or group of species often used when only DNA sequence data...
16s data processing and microbiome analysis	Page	ASV prediction Exact ASV prediction prove to be a good alternative to OTU picking (Callahan, B. J. et al...
16s data processing and microbiome analysis	Page	Classification Here a taxonomic identity is assigned to each representative sequence. The taxonomies are pulled from a reference set. There are three main reference databases with aligned, validated and annotated <abbr title="The gene that is responsible for the coding of the 16S ribosomal RNA. The gene is used in...
16s data processing and microbiome analysis	Page	Alignment To understand the evolutionary relationships between the sequences in the sample and to perform a diversity analysis, it is necessary to generate a phylogenetic tree of the OTUs. The first step in generating the tree is to generate a multiple alignment of the representative <abbr title="An operational taxonomic...
16s data processing and microbiome analysis	Page	Create phylogenetic tree The phylogenetic tree represents the relationship between the sequences in terms of the evolutionary distance from a common ancestor. In downstream analysis this tree is used for example in calculating the UniFrac distances. Software: FastTree An alternate option to most of the steps mentioned in...
16s data processing and microbiome analysis	Page	Phase 3: Measure diversity and other statistical analysis OTU information (number of OTUs, abundance of OTUs) and the phylogenetic tree generated from the phase 2...
16s data processing and microbiome analysis	Page	Determine alpha diversity Alpha diversity is a measure of diversity within a sample. It...
16s data processing and microbiome analysis	Page	Determine beta diversity Beta diversity is a measure of diversity between samples. One of the most commonly used metrics is the <abbr title="Beta diversity distance metric based on the phylogenetic distance between the members of communities/samples. Unifrac captures the total amount of evolution that...
16s data processing and microbiome analysis	Page	Other statistical analysis Additional statistical tests between samples or groups of samples can be done in QIIME. For alpha diversity a parametric or non-parametric t-test can be performed on a rarefied number of sequences. For beta diversity the Mantel, partial Mantel and Mantel correlogram matrix correlation can be used...
16s data processing and microbiome analysis	Page	Additional notes In the SOP we refer both to QIIME and QIIME2. QIIME2 is more of a platform / command line interface than the original QIIME that contained a set of Python wrapper scripts. The QIIME developers suggest migrating to QIIME2. vsearch is an open source alternative to usearch...
16s data processing and microbiome analysis	Page	Tools referred to in SOP FASTQC - http://www.bioinformatics.babraham.ac.uk/projects/fastqc MultiQC - https://multiqc.info/ PRINSEQ - http://edwards.sdsu.edu/cgi-bin/prinseq/prinseq.cgi SolexaQA - http://www.biomedcentral.com/1471-2105/11/485 PEAR - http://bioinformatics.oxfordjournals.org/content/early/2013/10/18/bioinformatics.btt593.full.pdf PANDASeq - http://www.biomedcentral.com/1471-2105/13/31 FLASH - http://bioinformatics.oxfordjournals.org/content/early/2011/09/07/bioinformatics.btr507.full.pdf UCHIME - http://drive5.com/usearch/manual/uchime_algo.html ChimeraSlayer - http://nebc.nox.ac.uk/bioinformatics/docs/chimeraslayer.html Perseus - http://www.biomedcentral.com/1471-2105/12/38/ UPARSE - <a...
16s data processing and microbiome analysis	Page	H3ABioNet Assessment exercises Bibliography
16s data processing and microbiome analysis	Page	Practice dataset The input datasets and metadata can be accessed here. Bibliography
16s data processing and microbiome analysis	Page	Input data assessment Questions Were the number, length and quality of the reads obtained in line with what would be expected for the sequencing platform used? Was the input dataset of sufficiently good quality to perform the analysis? How did the reads’ quality and GC content affect the...
16s data processing and microbiome analysis	Page	Operational assessment questions At each step of the workflow, describe which software was used and why: Was the choice affected by the nature and/or quality of the reads? Was the choice made due to the time and cost of the analysis? What are the accuracy and performance...
16s data processing and microbiome analysis	Page	Runtime analysis This is useful information for making predictions for the clients and collaborators How much time and disk space did each step of the workflow take? How did the underlying hardware perform? Was it possible to do other things, or run other analyses on the same computer...
16s data processing and microbiome analysis	Page	Analysis of the results What percentage of the reads were removed during the quality trimming step? Did all samples have similar number of reads after the preprocessing of reads steps? What was the median, maximum and minimum read count per sample? How many reads were discarded due to...
16s data processing and microbiome analysis	Page	Introduction The genes encoding the RNA component of the small subunit of ribosomes, commonly known as the 16S rRNA in bacteria and archaea, are among the most conserved across all kingdoms of life. Nevertheless, they contain regions that are less evolutionarily constrained and whose sequences are...
GWAS data processing	Page	This SOP is intended to layout best practices for GWAS data processing, especially for groups undertaking H3ABioNet accreditation exercises
RNA-Seq data processing and gene expression analysis	Page	Introduction This document outlines the essential steps in the process of analyzing gene expression data using RNA sequencing (mRNA, specifically), and recommends commonly used tools and techniques for this purpose. It is assumed in this document that the experimental design is simple and that differential expression is being assessed...
RNA-Seq data processing and gene expression analysis	Page	Glossary of associated terms and jargon FASTQ format & quality scores FASTQ format is the standard format of raw sequence data. Quality scores assigned in the FASTQ...
RNA-Seq data processing and gene expression analysis	Page	Procedural steps This protocol paper 1 was a very good resource for understanding the procedural steps involved in any RNA-Seq analysis. The datasets they use in that paper are freely available, but the source of RNA was the fruitfly Drosophila melanogaster, and not...
RNA-Seq data processing and gene expression analysis	Page	Phase 1: Preprocessing of the raw reads The following steps prepare reads for analysis and should be always performed prior to alignment. Bibliography
RNA-Seq data processing and gene expression analysis	Page	Step 1.1: Quality check The overall quality of the sequence information received from the sequencing center will determine how the quality trimming should be set up in Step 1.2. Tools like FastQC1 will enable the collection of this information. Sequencing facilities usually produce...
RNA-Seq data processing and gene expression analysis	Page	Step 1.2: Adaptor and Quality trimming + Removal of very short reads In this step we deal with 3 major preprocessing steps that clean up the data...
RNA-Seq data processing and gene expression analysis	Page	Step 1.3: Quality recheck Once the trimming step is complete, it is always good practice to make sure that your dataset looks better by rerunning FastQC on the trimmed data. The metrics to compare between trimmed and raw fastq data, in the context of the tool FastQC are listed...
RNA-Seq data processing and gene expression analysis	Page	Phase 2: Determining how many read counts are associated with known genes Bibliography
RNA-Seq data processing and gene expression analysis	Page	Step 2.1: Generation of gene/transcript-level counts Protocol 1: Alignment-based approach Once the data are cleaned up, the next step is alignment to the reference genome. There are various tools available for this step, but it is important that the alignment tool chosen here is a “splice-aware” tool. That...
RNA-Seq data processing and gene expression analysis	Page	Step 2.1: Generation of gene/transcript-level counts Protocol 2: Estimated counts using graph-based or similar approaches A more recent alternative to alignment-based methods both dramatically increases the speed of the analysis and resolves some of issues that make analysis of transcripts or gene families problematic using alignment-based approaches. These...
RNA-Seq data processing and gene expression analysis	Page	Step 2.2: Count generation Protocol 1 Once it is determined that the alignment step was successful, the next step is to enumerate the number of reads that are associated with the genes. There are multiple tools to perform this step (e.g. HTSeq’s htseq-count1...
RNA-Seq data processing and gene expression analysis	Page	Step 2.2: Count generation Protocol 2 Counts are already generated and can be skipped Bibliography
RNA-Seq data processing and gene expression analysis	Page	Step 2.3: Collecting and tabulating alignment stats Protocol 1 For a given RNA-Seq run it is valuable to collect several stats related to the alignment and counting steps; this is an important step in the evaluation process. There are many tools that gather this information from the <code...
RNA-Seq data processing and gene expression analysis	Page	Step 2.3: Collecting and tabulating alignment stats Protocol 2 Apart from FASTQC, other standard QC metrics that rely on an alignment are not available, such as Picard’s tools, or a more complete assessment of read fates. Salmon and kallisto both provide output that give basic overall...
RNA-Seq data processing and gene expression analysis	Page	Phase 3: Statistical analysis Bibliography
RNA-Seq data processing and gene expression analysis	Page	Step 3.1: QC and outlier/batch detection Alignment-based counts can normally be imported directly into R/Bioconductor1 , but estimated counts generated using Protocol 2 above will require importing using the tximport2 Bioconductor library. This can also be...
RNA-Seq data processing and gene expression analysis	Page	Step 3.2: Removal of low count genes and normalization The QC investigations in Step 3.1 should be done on the output from htseq-count, which typically is the entire transcriptome (all genes). However, for most eukaryotic species, only ~40-60% of genes are expressed in any given cell, tissue...
RNA-Seq data processing and gene expression analysis	Page	Step 3.3: Statistics for differential expression The main goal of most RNA-Seq is detection of differential expression between two or more groups. This is done for thousands of genes with often only a few replicates per group, so a statistical method must be used. This field is still under...
RNA-Seq data processing and gene expression analysis	Page	Working with Galaxy If it is desirable to perform all processing in Galaxy1 , it should not be a problem for smaller experiments with a 1:1 comparisons between samples. For experiments with a large number of samples, and also for complex comparisons (e.g. 2x2...
RNA-Seq data processing and gene expression analysis	Page	H3ABioNet Accreditation exercise Bibliography
RNA-Seq data processing and gene expression analysis	Page	Practice dataset Some practice dataset for RNA seq analysis can be found here Bibliography
RNA-Seq data processing and gene expression analysis	Page	Reproducible Research In any experiment where computation plays a critical role in generating the results and conclusions, researchers should ensure that the presentation of their work includes reproducibility, meaning “the ability to recompute data analytic results given an observed dataset and knowledge of the data analysis pipeline.”<a...
RNA-Seq data processing and gene expression analysis	Page	General questions These questions, originally derived for the 16S analysis, relate to other sequencing workflows as well. Questions related to the nature of the input sequence data Were the number, length and quality of the reads obtained in line with what would be expected for the sequencing platform...
RNA-Seq data processing and gene expression analysis	Page	Analysis-specific questions The following questions relate specifically to the phases denoted above. PHASE 1 - Preprocessing of the raw reads What percentage of the reads were removed during the quality trimming step? Did all samples have similar number of reads after the preprocessing of reads steps? What...
RNA-Seq data processing and gene expression analysis	Page	Introduction This document outlines the essential steps in the process of analyzing gene expression data using RNA sequencing (mRNA, specifically), and recommends commonly used tools and techniques for this purpose. It is assumed in this document that the experimental design is simple and that differential expression is being assessed...
Variant calling in human whole genome/exome sequencing data	Page	Introduction This document briefly outlines the essential steps in the process of making genetic variant calls, and recommends tools that have gained community acceptance for this purpose. It is assumed that the purpose of the study is to detect short germline or somatic variants in a single <abbr title="A...
Variant calling in human whole genome/exome sequencing data	Page	Glossary of associated terms and jargon The table below (partly borrowed from the GATK dictionary 1 ) provides definitions of the basic terms used. Adapters Short nucleotide sequences added on...
Variant calling in human whole genome/exome sequencing data	Page	Procedural steps The publication by 1 provides a good discussion of the common tools and approaches for variant calling. Also see the older 2. The figure below depicts the essential steps of the pipeline, which are detailed in the...
Variant calling in human whole genome/exome sequencing data	Page	Important note The Genome Analysis Toolkit (GATK) distributed by the Broad Institute of Harvard and MIT (see http://www.broadinstitute.org/gatk/) is a commonly used framework and toolbox for many of the tasks described below. In its latest...
Variant calling in human whole genome/exome sequencing data	Page	Phase 1: Preprocessing of the raw reads The following steps prepare reads for analysis and must be performed in sequence. Bibliography
Variant calling in human whole genome/exome sequencing data	Page	Step 1.1: Adapter trimming Sequencing facilities usually produce read files in fastq format 1, which contain a base sequence and a quality score for each base in...
Variant calling in human whole genome/exome sequencing data	Page	Step 1.2: Quality trimming Once the adapters have been trimmed, it is useful to inspect the quality of reads in bulk, and try to trim low quality nucleotides 1. Also, frequently the quality tends to drop off toward one end of the read. FASTQC...
Variant calling in human whole genome/exome sequencing data	Page	Step 1.3: Removal of very short reads Once the adapter remnants and low quality ends have been trimmed, some reads may end up being very short (i.e. <20 bases). These short reads are likely to align to multiple (wrong) locations on the reference, introducing noise into the variation calls....
Variant calling in human whole genome/exome sequencing data	Page	Phase 2: Initial variant discovery Analysis proceeds as a series of the following sequential steps. Bibliography
Variant calling in human whole genome/exome sequencing data	Page	Step 2.1a Alignment Reads need to be aligned to the reference genome in order to identify the similar and polymorphic regions in the <abbr title="A single individual, such as human CEPH NA12878. Multiple libraries with different properties can be constructed from the original sample DNA source. Here we treat...
Variant calling in human whole genome/exome sequencing data	Page	Step 2.2 Artifact removal: local realignment around indels Some artifacts may arise due to the alignment stage, especially around indels where reads covering the start or the end of an indel are often incorrectly mapped. This results in mismatches between the reference and reads near the misalignment region, which...
Variant calling in human whole genome/exome sequencing data	Page	Step 2.3: Base quality score recalibration Base quality scores, which refer to the per-base error estimates assigned by the sequencing machine to each called base, can often be inaccurate or biased. The recalibration stage aims to correct for these errors via an empirical error model built based on the...
Variant calling in human whole genome/exome sequencing data	Page	Step 2.4 Calling the variants There is no single “best” approach to capture all the genetic variations. For germline variants, 1 suggest using a consensus of results from three tools: CRISP 2, HaplotypeCaller <a href="#fn:36" class="footnote"...
Variant calling in human whole genome/exome sequencing data	Page	Step 2.5 Statistical filtering The VCF files resulting from the previous steps frequently have many sites that are not really genetic variants, but rather machine artifacts that make the site statistically non-reference. In small studies, hard filtering of variants based on annotations of genomic context is...
Variant calling in human whole genome/exome sequencing data	Page	Phase 3: Variant annotation and prioritization This phase serves to select those variants that are of particular interest, depending on the research problem at hand. The methods are specific to the problem, thus we do not elaborate on them, and only provide a list of some commonly used tools...
Variant calling in human whole genome/exome sequencing data	Page	H3ABioNet Next Gen Training dataset Practice data set for Variant Calling is available here These are synthetic data, generated using the NEAT simulator1 to produce synthetic reads with “golden” variants inserted into the reference genome before...
Variant calling in human whole genome/exome sequencing data	Page	H3ABioNet Next Gen Accreditation Questions The following are questions to keep in mind when running the NextGen Workflow during the H3ABioNet accreditation exercise. Use them to plan your work in a way that would allow...
Variant calling in human whole genome/exome sequencing data	Page	Appendices Bibliography
Variant calling in human whole genome/exome sequencing data	Page	Useful Resources Examples of complete implementations of a variant calling pipeline: The Broad institute WDL reference implementations of the GATK best practices- here The H3ABioNet CWL...
Variant calling in human whole genome/exome sequencing data	Page	Alphabetized list of recommended tools Note on Galaxy: If it is desirable to perform all processing in Galaxy 1, then it is possible to construct a complete workflow by including the needed tools from its toolshed. The majority...
Variant calling in human whole genome/exome sequencing data	Page	This document briefly outlines the essential steps in calling short germline variants, and recommends tools that have gained community acceptance for this pu...

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.