Fast and reliable prediction of noncoding RNAs Stefan Washietl Ivo L Hofacker and Peter F Stadler Department of Theoretical Chemistry and Structural Biology University of Vienna Wa hringerstrasse 17 A 1090 Wien Austria and Bioinformatics Group Department of Computer Science and Interdisciplinary Center for Bioinformatics University of Leipzig Ha rtelstrasse 16 18 D 04107 Leipzig Germany Communicated by Hans Frauenfelder Los Alamos National Laboratory Los Alamos NM December 14 2004 received for review November 2 2004 We report an efficient method for detecting functional RNAs The approach which combines comparative sequence analysis and structure prediction already has yielded excellent results for a small number of aligned sequences and is suitable for large scale genomic screens It consists of two basic components i a measure for RNA secondary structure conservation based on computing a consensus secondary structure and ii a measure for thermodynamic stability which in the spirit of a z score is normalized with respect to both sequence length and base composition but can be calculated without sampling from shuffled sequences Functional RNA secondary structures can be identified in multiple sequence alignments with high sensitivity and high specificity We demonstrate that this approach is not only much more accurate than previous methods but also significantly faster The method is implemented in the program RNAZ which can be downloaded from www tbi univie ac at wash RNAz We screened all alignments of length n 50 in the Comparative Regulatory Genomics database which compiles conserved noncoding elements in upstream regions of orthologous genes from human mouse rat Fugu and zebrafish We recovered all of the known noncoding RNAs and cis acting elements with high significance and found compelling evidence for many other conserved RNA secondary structures not described so far to our knowledge comparative genomics conserved RNA secondary structure T raditionally the role of RNA in the cell was considered mostly in the context of protein gene expression limiting RNA to its function as mRNA tRNA and rRNA The discovery of a diverse array of transcripts that are not translated to proteins but rather function as RNAs has changed this view profoundly 1 3 Noncoding RNAs ncRNAs are involved in a large variety of processes including gene regulation 4 maturation of mRNAs rRNAs and tRNAs or X chromosome inactivation in mammals 5 In fact a large fraction of the mouse transcriptome consists of ncRNAs 6 and about half of the transcripts from human chromosomes 21 and 22 are noncoding 7 8 Structured RNA motifs furthermore function as cis acting regulatory elements within protein coding genes Also in this context new intriguing mechanisms are being discovered 9 Hence a comprehensive understanding of cellular processes is impossible without considering RNAs as key players Efficient identification of functional RNAs ncRNAs as well as cis acting elements in genomic sequences is therefore one of the major goals of current bioinformatics Notwithstanding its utmost biological relevance de novo prediction is still a largely unsolved issue Unlike protein coding genes functional RNAs lack in their primary sequence common statistical signals that could be exploited for reliable detection algorithms Many functional RNAs however depend on a defined secondary structure In particular evolutionary conservation of secondary structures serves as compelling evidence for biologically relevant RNA function Comparative studies therefore seem to be the most promising approach To date complete genomic sequences of related species have been sequenced for almost all genetic model organisms as for example bacteria 10 11 yeasts 12 nematodes 13 14 and even mammals 15 17 Several studies 18 21 have identified a large collection of evolutionary con2454 2459 PNAS February 15 2005 vol 102 no 7 served noncoding elements in mammalian or more generally vertebrate genomes and it must be expected that a significant fraction of them are functional RNAs Possible candidates however have been identified only sporadically so far 19 21 simply because there are no reliable tools to scan multiple sequence alignments for functional RNAs The most widely used program QRNA 22 which has been successfully used to identify ncRNAs in bacteria 23 and yeast 24 is not suitable for screens of large genomes QRNA is limited to pairwise alignments and its reliability is low especially if the evolutionary distance of the two sequences lies outside of the optimal range An alternative approach DDBRNA 25 suffers from similar problems and so far has not been used in a real life application MSARI 26 on the other hand gains its drastically enhanced accuracy from the large amount of information contained in large multiple sequence alignments of 10 15 sequences with high sequence diversity At present however data sets of this kind are not available at a genomewide scale at least for multicellular organisms In this article we address the problem by using an alternative approach we combine a measure for thermodynamic stability with a measure for structure conservation Using a combination of both scores we are able to efficiently detect functional RNAs in multiple sequence alignments of only a few sequences Our method is substantially more accurate than QRNA or DDBRNA and performs better on pairwise alignments than MSARI does on alignments with 15 sequences On the large diverse alignments used for testing MSARI in ref 26 our RNAZ program achieved 100 sensitivity at 100 specificity Methods Minimum Free Energy MFE RNA Folding For MFE RNA folding we used the C libraries of the Vienna RNA package version 1 5 27 We used RNAFOLD for folding single sequences and RNAALIFOLD 28 for consensus folding of aligned sequences The same folding parameters were used for both algorithms to ensure that the obtained MFE values were comparable For the covariation part of RNAALIFOLD we used default parameters Gaps were removed for single sequence folding Calculation of z Scores Using Support Vector Machine SVM Regression To calculate z scores by regression analysis we used the following procedure we generated synthetic sequences of different length and base composition The length of the test sequences ranged from 50 to 400 nt in steps of 50 To quantify base composition we used the GC AT A T and G C ratios of the sequences and chose values for all ratios ranging from 0 25 to 0 75 in steps of 0
View Full Document