Edit me

Step 2.2: Count generation

Protocol 1

Once it is determined that the alignment step was successful, the next step is to enumerate the number of reads that are associated with the genes. There are multiple tools to perform this step (e.g. HTSeq’s htseq-count1 , Subread’s featureCounts2 ), in addition, there are statistical analysis tools that do not require this step (e.g. Cufflinks’3 Cuffdiff and Ballgown). Both scenarios will be tackled in Phase III.

We recommend using featureCounts to collect the raw gene count information; this tool will require the alignment file (BAM), and the associated gene annotation file (GTF). Irrespective of the counting tool used, it needs the following information:

  • Different sources have slightly different formats, so it is essential to specify how the counting needs to be performed, irrespective of the counting tool employed. Gene counts should be collected for each gene (-g set as gene_id, for featureCounts), and at the level of the exon (-t set as exon, for featureCounts).
  • Are the data stranded, if so, was the standard dUTP method employed (-s set as 2 for reverse, using featureCounts)?
  • Different sources have slightly different formats, so it is essential to specify how the counting needs to be performed, irrespective of the counting tool employed. Gene counts should be collected for each gene (-g set as gene_id, for featureCounts), and at the level of the exon (-t set as exon, for featureCounts).
  • Are the data stranded, if so, was the standard dUTP method employed (-s set as 2 for reverse, using featureCounts)?

The final output of any of these programs is a tab-delimited file with gene names in column one and counts in the second column. featureCounts will return an additional file that ends in .summary that specifies the number of reads that did not map only to one gene, split into various categories. It is normal for the total sum of all the rows in this file to be higher than the number of aligned reads for a sample, because if one read maps to two locations, featureCounts will count it twice in the Unassigned_MultiMapping category.

Bibliography

  1. Anders, S., Pyl, P. T., & Huber, W. (2015). HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics, 31(2), 166-169. 

  2. Liao, Y., Smyth, G. K., & Shi, W. (2013). featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics, 30(7), 923-930. 

  3. (Cufflinks) Trapnell, C., Williams, B. A., Pertea, G., Mortazavi, A., Kwan, G., Van Baren, M. J., … & Pachter, L. (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature biotechnology, 28(5), 511.