Edit me

Step 2.1: Generation of gene/transcript-level counts

Protocol 2: Estimated counts using graph-based or similar approaches

A more recent alternative to alignment-based methods both dramatically increases the speed of the analysis and resolves some of issues that make analysis of transcripts or gene families problematic using alignment-based approaches. These methods use an alternative approach that performs essentially a very lightweight ‘alignment’ that speeds up analysis, sometimes by orders of magnitude. These approaches also generate (as output) estimated counts that can be imported into standard R-based workflows, thus combining the initial two steps in the alignment-based approach above.

The speedups are based on the tool being used and are accomplished in slightly different ways, such as mapping kmers from the reads to a transcriptome-based de bruijn graph (exemplified by kallisto1 or ‘quasi-mapping’ of reads to transcript positions in a simple transcriptome-based index (e.g. Salmon2). These are normally followed by an expectation maximization (EM) step to resolve ambiguous assignments, re-proportioning reads based on evidence from the overall analysis. Estimates of read counts to the transcripts can then be generated and used in downstream analyses. For more background, a good independent summary of the ‘pseudo-alignment’ approach is found here. The EM step in particular has proven useful in finding additional genes or transcripts in data that were missed using alignment-based approaches, which normally skip ambiguously mapped sequences.

A key difference in these procedures from the alignment approach is the tools require a transcriptome data set (not a reference genome). This may be a problem if your reference genome annotation isn’t of reasonably high quality, for instance if the transcripts described aren’t well-annotated or incomplete. However, these tools are of great use for well-characterized genomes such as human and mouse, and can also be used with transcriptome assemblies.

Note, as these analyses generate estimated read counts as part of their output, you can skip Step 2.2

Bibliography

  1. Bray, N. L., Pimentel, H., Melsted, P., & Pachter, L. (2016). Near-optimal probabilistic RNA-seq quantification. Nature biotechnology, 34(5), 525. 

  2. Patro, R., Duggal, G., Love, M. I., Irizarry, R. A., & Kingsford, C. (2017). Salmon provides fast and bias-aware quantification of transcript expression. Nature methods, 14(4), 417.