The impact of read length on quantification of differentially expressed genes and splice junction detection. Academic Article uri icon

Overview

abstract

  • BACKGROUND: The initial next-generation sequencing technologies produced reads of 25 or 36 bp, and only from a single-end of the library sequence. Currently, it is possible to reliably produce 300 bp paired-end sequences for RNA expression analysis. While read lengths have consistently increased, people have assumed that longer reads are more informative and that paired-end reads produce better results than single-end reads. We used paired-end 101 bp reads and trimmed them to simulate different read lengths, and also separated the pairs to produce single-end reads. For each read length and paired status, we evaluated differential expression levels between two standard samples and compared the results to those obtained by qPCR. RESULTS: We found that, with the exception of 25 bp reads, there is little difference for the detection of differential expression regardless of the read length. Once single-end reads are at a length of 50 bp, the results do not change substantially for any level up to, and including, 100 bp paired-end. However, splice junction detection significantly improves as the read length increases with 100 bp paired-end showing the best performance. We performed the same analysis on two ENCODE samples and found consistent results confirming that our conclusions have broad application. CONCLUSIONS: A researcher could save substantial resources by using 50 bp single-end reads for differential expression analysis instead of using longer reads. However, splicing detection is unquestionably improved by paired-end and longer reads. Therefore, an appropriate read length should be used based on the final goal of the study.

publication date

  • June 23, 2015

Research

keywords

  • Gene Expression Profiling
  • RNA Splice Sites
  • Sequence Analysis, RNA

Identity

PubMed Central ID

  • PMC4531809

Scopus Document Identifier

  • 84938570643

Digital Object Identifier (DOI)

  • 10.1186/s13059-015-0697-y

PubMed ID

  • 26100517

Additional Document Info

volume

  • 16