RNA-Seq data processing and gene expression analysis | H3ABioNet Standard Operating Procedures

Edit me

Analysis-specific questions

The following questions relate specifically to the phases denoted above.

PHASE 1 - Preprocessing of the raw reads

What percentage of the reads were removed during the quality trimming step?
Did all samples have similar number of reads after the preprocessing of reads steps?
What tools were used for assessing quality of the reads?

PHASE 2 - Determining how many read counts are associated with known genes

Alignment and generation of gene/transcript counts

What percentage of the reads aligned to the genome sequence? If the data are paired-end, how many of the reads are aligning in a concordant manner?
How many reads mapped within exons? In introns? Intergenic regions? Are these consistent across samples? What do the relative proportions of reads in each region tell you about your samples?
What tools/methods were used to determine this?

Pseudoalignment-based methods

What percentage of the reads mapped successfully to transcripts? Are these consistent across samples? How is this information similar to / different from what you get from genome alignment?
What are the different quantification values you can get from the pseudo-alignment methods? Which one did you decide to use and why?
How do counts between the two methods (alignment vs. pseudoalignment) compare?

PHASE 3 - Initial differential gene expression analysis

Describe the normalization methods used for both statistical analysis and visualizations and why the methods were selected.
During the initial stages of analysis, do experimental samples cluster as expected i.e. based on the experimental conditions? What other information can be gained from sample clustering?
Which statistical methods and options did you pick and why?
Which filtering method/s did you use for filtering out the low expressed genes/ transcripts?
What would be the next steps for your analysis, given the results of Phase 3?

Questions on additional analyses

Additional downstream tertiary analyses are very commonly performed using RNA-Seq data, for example gene set enrichment, isoform analysis, surrogate variables analysis, or weighted gene coexpression network analysis. As these are highly dependent on the data available and the results (particularly from differential expression analysis), we do not cover them in detail, and thus these are not currently part of the accreditation exercise.

Nodes are more than welcome to attempt these, however. Should these be attempted, the following questions (though not comprehensive) may help guide what is expected regarding reporting of results.

Alternative reference genomes - How does changing the reference genome and annotation used influence your final results, choice of software, and parameters?
Isoform analysis - What method/s did you use for the identification of alternative splicing and why? Provide splice variants with list of exons associated with each isoform.
Pathway annotation source - Where did you get additional annotation information on the genes (i.e., gene names, symbols, GO (Gene Ontology)¹ terms, KEGG² pathways) and when was that resource last updated?
General pathway analyses - What is the impact of the experiment on biological pathways and processes?, i.e. how many Gene Ontology (GO) terms were found to be over-represented within differentially expressed genes?
Assessing for unknown signals in analyses - Were methods used to adjust for potentially unknown factors in your analysis, such as unknown confounding variables? If so, how did this influence your results?

Bibliography

Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., … & Harris, M. A. (2000). Gene Ontology: tool for the unification of biology. Nature genetics, 25(1), 25. ↩
Kanehisa, M., & Goto, S. (2000). KEGG: kyoto encyclopedia of genes and genomes. Nucleic acids research, 28(1), 27-30. ↩

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Tags: