RNA Secondary StructureSlide 2A Context Free GrammarExample: modeling a stem loopSlide 5A parse tree: alignment of CFG to sequenceAlignment scores for parsesThe Nussinov AlgorithmSlide 9Slide 10The Nussinov Algorithm and CFGsSlide 12Stochastic Context Free GrammarsComputational ProblemsNormal Forms for CFGsExample of converting a CFG to C.N.F.Another exampleDecoding: the CYK algorithmThe CYK algorithm (Cocke-Younger-Kasami)A SCFG for predicting RNA structureCYK for RNA foldingEvaluationThe Inside AlgorithmAlgorithm: InsideThe Outside AlgorithmAlgorithm: OutsideLearning for SCFGsSlide 28Summary: SCFG and HMM algorithmsThe Zuker algorithm – main ideasMethods for inferring RNA foldMultiple alignment and RNA foldingMutual informationRNA Secondary StructureaagacuucggaucuggcgacacccuacacuucggaugacaccaaagugaggucuucggcacgggcaccauucccaacuucggauuuugcuaccauaaagccuucggagcgggcguaacucHairpin LoopsStemsBulge loopInterior loopsMulti-branched loopA Context Free GrammarS AB Nonterminals: S, A, BA aAc | a Terminals: a, b, c, dB bBd | bDerivation:S AB aAcB … aaaacccB aaaacccbBd … aaaacccbbbbbbdddProduces all strings ai+1cibj+1dj, for i, j 0Example: modeling a stem loopS a W1 uW1 c W2 gW2 g W3 cW3 g L cL agucgWhat if the stem loop can have other letters in place of the ones shown?ACGGUGCCAG UCGExample: modeling a stem loopS a W1 u | g W1 u W1 c W2 gW2 g W3 c | g W3 uW3 g L c | a L uL agucg | agccg | cugugcMore general: Any 4-long stem, 3-5-long loop:S aW1u | gW1u | gW1c | cW1g | uW1g | uW1aW1 aW2u | gW2u | gW2c | cW2g | uW2g | uW2aW2 aW3u | gW3u | gW3c | cW3g | uW3g | uW3aW3 aLu | gLu | gLc | cLg | uLg | uLaL aL1 | cL1 | gL1 | uL1L1 aL2 | cL2 | gL2 | uL2L2 a | c | g | u | aa | … | uu | aaa | … | uuuACGGUGCCAG UCGGCGAUGCUAG CCGGCGAUGUUCUG UCGA parse tree: alignment of CFG to sequenceACGGUGCCAG UCGA C G G A G U G C C C G USW1W2W3L•S a W1 u•W1 c W2 g•W2 g W3 c•W3 g L c•L agucgAlignment scores for parsesWe can define each rule X s, where s is a string,to have a score.Example:W a W’ u: 3 (forms 3 hydrogen bonds)W g W’ c: 2 (forms 2 hydrogen bonds)W g W’ u: 1 (forms 1 hydrogen bond)W x W’ z -1, when (x, z) is not an a/u, g/c, g/u pairQuestions:-How do we best align a CFG to a sequence? (DP)-How do we set the parameters? (Stochastic CFGs)The Nussinov AlgorithmLet’s forget CFGs for a momentProblem:Find the RNA structure with the maximum (weighted) number of nested pairingsAGACCUCUGGGCGGCAGUCUAUGCGAACGCGUCAUCAGCUGGAAGAAGGGAGAUCUUCACCAAUACUGAAUUGCACCACGCUUAAGACACCUAGCUUGUGUCCUGGAGGUCUAUAAGUCAGACCGCGAGAGGGAAGACUCGUAUAAGCGAThe Nussinov AlgorithmGiven sequence X = x1…xN,Define DP matrix:F(i, j) = maximum number of bonds if xi…xj folds optimallyTwo cases, if i < j:1. xi is paired with xjF(i, j) = s(xi, xj) + F(i+1, j-1)•xi is not paired with xjF(i, j) = max{ k: i k < j } F(i, k) + F(k+1, j)i ji ji jkF(i, j)F(i, k)F(k+1, j)The Nussinov AlgorithmInitialization:F(i, i-1) = 0; for i = 2 to NF(i, i) = 0; for i = 1 to NIteration:For i = 2 to N:For i = 1 to N – lj = i + l – 1F(i+1, j -1) + s(xi, xj)F(i, j) = maxmax{ i k < j } F(i, k) + F(k+1, j)Termination: Best structure is given by F(1, N)(Need to trace back; refer to the Durbin book)The Nussinov Algorithm and CFGsDefine the following grammar, with scores:S a S u : 3 | u S a : 3 g S c : 2 | c S g : 2 g S u : 1 | u S g : 1 S S : 0 | a S : 0 | c S : 0 | g S : 0 | u S : 0 | : 0 Note: is the “” stringThen, the Nussinov algorithm finds the optimal parse of a string with this grammarThe Nussinov AlgorithmInitialization:F(i, i-1) = 0; for i = 2 to NF(i, i) = 0; for i = 1 to N S a | c | g | uIteration:For i = 2 to N:For i = 1 to N – lj = i + l – 1F(i+1, j -1) + s(xi, xj) S a S u | …F(i, j) = maxmax{ i k < j } F(i, k) + F(k+1, j) S S STermination: Best structure is given by F(1, N)Stochastic Context Free GrammarsIn an analogy to HMMs, we can assign probabilities to transitions:Given grammarX1 s11 | … | sin…Xm sm1 | … | smnCan assign probability to each rule, s.t.P(Xi si1) + … + P(Xi sin) = 1Computational Problems•Calculate an optimal alignment of a sequence and a SCFG(DECODING)•Calculate Prob[ sequence | grammar ](EVALUATION)•Given a set of sequences, estimate parameters of a SCFG(LEARNING)Normal Forms for CFGsChomsky Normal Form:X YZX aAll productions are either to 2 nonterminals, or to 1 terminalTheorem (technical)Every CFG has an equivalent one in Chomsky Normal Form(That is, the grammar in normal form produces exactly the same set of strings)Example of converting a CFG to C.N.F.S ABCA Aa | aB Bb | bC CAc | cConverting:S AS’S’ BCA AA | aB BB | bC DC’ | cC’ cD CASAB CAaaBbBbbCAcc aSAS’B CA Aa aB BB BbbbDC’CAcc aAnother exampleS ABCA C | aAB bB | bC cCd | cConverting:S AS’S’ BCA C’C’’ | c | A’AA’ aB B’B | bB’ bC C’C’’ | cC’ cC’’ CDD dDecoding: the CYK algorithmGiven x = x1....xN, and a SCFG G,Find the most likely parse of x(the most likely alignment of G to x)Dynamic programming variable:(i, j, V): likelihood of the most likely parse of xi…xj,rooted at nonterminal VThen, (1, N, S): likelihood of the most likely parse of x by the grammarThe CYK algorithm (Cocke-Younger-Kasami)Initialization:For i = 1 to N, any nonterminal V, (i, i, V) = log P(V xi)Iteration:For i = 1 to N-1 For j = i+1 to N For any nonterminal V,(i, j, V) = maxXmaxYmaxik<j (i,k,X) + (k+1,j,Y) + log P(V XY)Termination:log P(x | , *) = (1, N, S) Where * is the optimal parse tree (if traced back appropriately from above)A SCFG for predicting RNA structureS a S | c S | g S | u S | S a | S c | S g | S u a S u | c S g | g S u | u S g | g S c | u S a SS•Adjust the probability parameters to reflect bond strength etc•No distinction between non-paired bases, bulges, loops•Can modify to model these eventsL: loop nonterminalH: hairpin nonterminalB: bulge
View Full Document