Consensus Folding of Unaligned RNA Sequences Revisited Vineet Bafna1 Haixu Tang2 and Shaojie Zhang1 1 Dept of Computer Science and Engineering University of California San Diego La Jolla CA 92093 vbafna cs ucsd edu shzhang cs ucsd edu 2 School of Informatics and Center for Genomics and Bioinformatics Indiana University Bloomington IN 47408 hatang indiana edu Abstract As one of the earliest problems in computational biology RNA secondary structure prediction sometimes referred to as RNA folding problem has attracted attention again thanking to the recent discoveries of many novel non coding RNA molecules The two common approaches to this problem are de novo prediction of RNA secondary structure based on energy minimization and consensus folding approach computing the common secondary structure for a set of unaligned RNA sequences Consensus folding algorithms work well when the correct seed alignment is part of the input to the problem However seed alignment itself is a challenging problem for diverged RNA families In this paper we propose a novel framework to predict the common secondary structure for unaligned RNA sequences By matching putative stacks in RNA sequences we make use of both primary sequence information and thermodynamic stability for prediction at the same time We show that our method can predict the correct common RNA secondary structures even when we are only given a limited number of unaligned RNA sequences and it outperforms current algorithms in sensitivity and accuracy 1 Introduction With the recent discovery of novel non coding RNA ncRNA families RNA is rapidly gaining importance as a molecule of interest 1 2 Recent article puts the number of human genes down to about 20000 25000 3 By comparison even the worm C elegans has around 19 500 genes On the other hand expression activity has been detected on a much larger portion of human genome 4 It is likely that many of these novel transcripts are ncRNA genes which may carry on many unknown cellular functions Like proteins RNA structures are more important for function than their sequences RNA with similar function often have similar structures but distinct primary sequences Therefore understanding the structures of these RNA molecules will help elucidate their functions Consider the recent exciting discovery of riboswitches 5 6 as an example These control elements with a conserved secondary structure are located in the untranslated 2 regions of genes coding for proteins that are involved in variant metabolite nucleic acids amino acids etc synthesis pathways The riboswitches can turn off the expression of their downstream genes by binding to certain metabolites subsequently changing their secondary structures and blocking the translation initiation While there is a resurgence in interest in ncRNA the problem of RNA secondary structure prediction has been extensively studied since the 70s The key idea is that to stabilize its structure distant base pairs in the single stranded RNA molecule must form hydrogen bonds There are two distinct approaches to predict RNA secondary structure The RNA folding approach initiated by Tinoco et al 7 assigns free energies to the components of RNA secondary structure and then computes the RNA secondary structure with the minimum energy Dynamic programming algorithms have been developed to compute minimum energy secondary structures 8 12 and implemented in software packages such as MFOLD 13 and ViennaRNA 14 However RNA folding via energy minimization has its shortcomings First fold prediction depends critically upon correct values of the energy parameters as shown by Jaeger et al 15 which are hard to obtain experimentally Also RNA folding in a real cell is mediated by interactions with other molecules and the absence of knowledge of these interactions may cause mis folding in silico Pavesi et al 16 tried to alleviate this problem by comparing minimum energy structures of a set of RNA sequences from the same family to determine conserved secondary structure However it is unclear how the misprediction of secondary structure for a single RNA sequence can affect the accuracy of this approach A different approach attempts to resolve these shortcomings by using evolutionary conservation of structure as the basis for structure prediction It needs as input multiple RNA sequences from an RNA family which have common secondary structures Since for divergent sequences the mutations in base pairing regions must be compensated in the complementary base to preserve structure the presence of multiple covarying mutations is a strong signal for base pairing In fact most RNA sequences are selected more for maintenance of the structure than conservation of primary sequence If the sequence similarity between the given RNA sequences is appropriate one can first align these sequences using multiple sequence alignment algorithm and then figure out the potential base pairs in RNA secondary structures by looking at regions with a high number of compensating mutations Levitt successfully derived the theoretical tRNA secondary structure using this approach which was largely confirmed by crystallography 17 and various other structures have been determined through such analysis Computer programs were implemented later on to achieve this goal automatically 18 However aligning multiple and divergent RNA sequences so as to preserve their conserved structures is not easy because many compensatory mutations decrease the overall sequence similarity For unaligned sequences one must compute the structure and alignment simultaneously Sankoff proposed an algorithm that can simultaneously align RNA sequences and find the optimal common 3 fold 19 21 However the complexity of this algorithm is O l6 where l is the length of RNA sequences too high to be practical even for two sequences The complexity can be reduced to O l4 22 but only when RNA has no multi loop structure Eddy and Durbin and other groups 23 25 pioneered the approach of modeling RNA sequences using stochastic context free grammars The rules of the SCFG allow for position dependent scoring of distant base pairs and primary sequence conservation and also allow automated estimation of model parameters from unaligned sequences using EM However in practice the extensive divergence of RNA sequences makes it hard to reconstruct structure and alignment with perfect accuracy and the covariance models work best when supplied with good seed alignments Much recent work
View Full Document