Prediction and characterization of noncoding RNAs in C. elegans by integrating conservation, secondary structure, and high-throughput sequencing and array data.

Overview

abstract

We present an integrative machine learning method, incRNA, for whole-genome identification of noncoding RNAs (ncRNAs). It combines a large amount of expression data, RNA secondary-structure stability, and evolutionary conservation at the protein and nucleic-acid level. Using the incRNA model and data from the modENCODE consortium, we are able to separate known C. elegans ncRNAs from coding sequences and other genomic elements with a high level of accuracy (97% AUC on an independent validation set), and find more than 7000 novel ncRNA candidates, among which more than 1000 are located in the intergenic regions of C. elegans genome. Based on the validation set, we estimate that 91% of the approximately 7000 novel ncRNA candidates are true positives. We then analyze 15 novel ncRNA candidates by RT-PCR, detecting the expression for 14. In addition, we characterize the properties of all the novel ncRNA candidates and find that they have distinct expression patterns across developmental stages and tend to use novel RNA structural families. We also find that they are often targeted by specific transcription factors (∼59% of intergenic novel ncRNA candidates). Overall, our study identifies many new potential ncRNAs in C. elegans and provides a method that can be adapted to other organisms.

authors

Khurana, Ekta
Agarwal, Ashish
Auerbach, Raymond
Rozowsky, Joel
Cheng, Chao
Kato, Masaomi
Miller, David M
Slack, Frank
Snyder, Michael
Waterston, Robert H
Reinke, Valerie
Gerstein, Mark B

publication date

December 22, 2010

published in

Genome research Journal

Research

keywords

Caenorhabditis elegans
High-Throughput Nucleotide Sequencing
Oligonucleotide Array Sequence Analysis
RNA, Untranslated

Identity

PubMed Central ID

PMC3032931

Scopus Document Identifier

79551575924

Digital Object Identifier (DOI)

10.1101/gr.110189.110

PubMed ID

21177971

Additional Document Info

has global citation frequency

58

volume

21

issue

2

VIVO Weill Cornell Medical College

Prediction and characterization of noncoding RNAs in C. elegans by integrating conservation, secondary structure, and high-throughput sequencing and array data. Academic Article

Overview

abstract

authors

publication date

published in

Research

keywords

Identity

PubMed Central ID

Scopus Document Identifier

Digital Object Identifier (DOI)

PubMed ID

Additional Document Info

has global citation frequency

volume

issue