L3 Comparative approach to discovering ncRNA CSE280B 06 Bafna Pevzner Comparative prediction of ncRNA In genome genome comparisons many sequences are found to be conserved Can you use the pattern of conservation to detect if these are ncRNA sequences QRNA is a software to do that CSE280B 06 Bafna Pevzner QRNA Approach Compute the 3 probabilities Pr XY COD Pr XY RNA Pr XY OTH CSE280B 06 Bafna Pevzner Pr XY RNA Pr XY RNA s Pr XY s RNA Pr s RNA While there are many structures this expression can be computed efficiently We start by describing a different formalism for computing RNA structure We will show that the probabilistic and energy frameworks are essentially equivalent CSE280B 06 Bafna Pevzner Stochastic Context Free Grammars SCFG is a generative model that will generate a random RNA string The rules that were stochastically applied to generate the string define base pairing loops Consider the following CFG S W W x W y W x W W WW W x y A C G U Start base pairing unpaired bases branching termination The CFG generates RNA sequences with an associated structure CSE280B 06 Bafna Pevzner Example S W W W W W W W A A C G C G U CSE280B 06 Bafna Pevzner Computing RNA structures S W W W W W W A C G C G U OR S W W W W A C G C G U Consider the inverse problem Given an RNA string find the best parse a sequence of Context Free rules that generate the sequence This is equivalent to computing structure CSE280B 06 Bafna Pevzner Computing the optimum parse CSE280B 06 Bafna Pevzner Stochastic Context Free Grammars Associate a probability with each rule Hence SCFG x y A C G U Start base pairing unpaired bases branching termination Let ij be the probability that the RNA subsequence s i j was generated by the SCFG Pr S W Pr W x W y Pr W x W Pr W WW Pr W ij Pr s i j SCFG Pr s i j SCFG Pr SCFG It is sufficient to compute ij for all i j CSE280B 06 Bafna Pevzner Computing i j i j Pr W s i W s j i 1 j 1 Pr W s i W i 1 j Pr W W s j i j 1 k Pr W WW i k 1 k j Computing the most likely parse vi j max Pr W s i W s j vi 1 j 1 Pr W s i W vi 1 j Pr W W s j vi j 1 maxk Pr W WW vi k 1 vk j CSE280B 06 Bafna Pevzner SCFGs versus Energy minimization The two approaches most likely parse and energy minimization give equivalent answers The full likelihood function might sometimes be more meaningful from the max likelihood parse It helps answer the question is the string s an RNA sequence The probabilistic approach makes it easier to train parameters using Bayesian methods However Wexler et al s speedup may or may not be possible open problem CSE280B 06 Bafna Pevzner Probability of RNA alignments How can we compute Pr XY RNA Pr XY RNA Pr RNA CSE280B 06 Bafna Pevzner Probability of an RNA alignment CSE280B 06 i j Pr W A i W A j i 1 j 1 Pr W A i W i 1 j Pr W W A j i j 1 k Pr W WW i k 1 k j Bafna Pevzner Computing RNA emission probabilities CSE280B 06 Bafna Pevzner Other models in QRNA CSE280B 06 Bafna Pevzner Is the sequence RNA coding or OTH Pr XY Model can be computed for the 3 models RNA COD OTH Pr Modeli XY P XY modeli P modeli P XY P XY jP XY modelj P modelj CSE280B 06 Bafna Pevzner QRNA results Multiple alignment of 63 Eukaryotic SRP RNAs and 52 RNaseP RNA Use pair wise alignments from the structural alignment Alignments are classified according to sequence diversity Use each sequence as query to Blast against other family members Sensitivity fraction pairs predicted to be RNA Specificity 1 fraction predicted to be RNA after shuffling CSE280B 06 Bafna Pevzner Sensitivity and Specificity CSE280B 06 Bafna Pevzner QRNA results experiment 2 Each of the sequences was chosen in turn and compared against members of its own family WU Blastn2 Poor quality of alignments Bias towards conserved sequences 1003 out of 3342 pairs alignments were selected CSE280B 06 Bafna Pevzner CSE280B 06 Bafna Pevzner QRNA Results Comparison of E coli and S typhii E coli was partitioned into 115 RNA 4290 ORFs and 2367 intergenic Each region blasted against S typhii and QRNA was used on quality alignments 354 alignments to RNA 4946 to ORFs 11509 alignments to intergenic regions Repeats CSE280B 06 Bafna Pevzner Genomic comparison Results ncRNA 148 33 RNA regions annotated 115 OTH ORF 7422 88 RNA 3397 COD 3937 OTH intergenic 1974 351 RNA 61 COD 1562 OTH CSE280B 06 Bafna Pevzner Conclusions Blastn does not produce good alignments from a structural viewpoint Can we use paired SCFGs to redo the alignment and the structure In principle yes but it is expensive Rivas and Eddy did not use a true comparison of orthologs Would that help CSE280B 06 Bafna Pevzner Computing Structural alignments i j k l For all intervals i j in s1 n and k l in t1 m S i 1 j 1 k 1 l 1 i j k l S i 1 j k l s i S i j k l max S i 1 j k 1 l s i t j M max j l S i j 1 k l 1 S j k k k CSE280B 06 Bafna Pevzner Project Question Can you improve upon QRNA with the following Structural alignments to obtain better results Filtering to make search efficient Most pairs should be discarded without computing a structural alignment CSE280B 06 Bafna Pevzner RNA multiple alignments Why should we compute multiple structural alignments for RNA CSE280B 06 Bafna Pevzner Structural Alignment X07545 ACCCGGC CAUA GUGGCCG GGCAA CAC CCGG U C UCGUU M21086 ACCCGGC CAUA GCGGCCG GGCAA CAC CCGG A C UCAUG X05870 ACCCGGC CACA GUGAGCG GGCAA CAC CCGG A C UCAUU U05019 ACCCGGU CAUA GUGAGCG GGUAA CAC CCGG A C UCGUU M16530 ACCCGGC AAUA GGCGCCGGUGCUA CGC CCGG U C UCUUC X01588 ACCCGGU CACA GUGAGCG GGCAA CAC CCGG A C UCAUU AF034619 GGCGGC CACA GCGGUGG GGUUGCCUC CCGU A C CCAUC L27170 AGUGGUGGC CAUA UCGGCGG GGUUC CUCCCCGU A C CCAUC X05532 AGGAACGGC CAUA CCACGUC GAUCG CAC CACA U C CCGUC GC Conserved sequences and conserved structure are more apparent in multiple alignments CSE280B 06 Bafna Pevzner Computing Structural Alignments Analogy In sequence alignment the score for aligning a column is position independent In profiles or HMMs position specific scoring is used to distinguish conserved positions from nonconserved positions Similar ideas can be used for RNA CSE280B 06 Bafna Pevzner Covariance models RNA profiles S W1 a W2 a W4 b W3 b a W4 b A A A U CSE280B 06 A A A A U U …
View Full Document