Stanford CS 262 - RNA Secondary Structure Prediction - D2545986

Home> Schools> Stanford University> Computer Science (CS) > CS 262> RNA Secondary Structure Prediction

DOC PREVIEW

Stanford CS 262 - RNA Secondary Structure Prediction

School name Stanford University

Course Cs 262- Computational Genomics

Pages 9

This preview shows page 1-2-3 out of 9 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS 262 Lecture 20Scribed by Michael YuMarch 15, 2007RNA Secondary Structure PredictionThis lecture continues the discussion of using stochastic context free grammars (SCFG) for predict-ing the secondary structure (loops, bulges, etc.) of RNA.Stochastic Context Free GrammarsAn SCFG is a context free grammar (CFG) with probabilities assigned to every production rulesuch that the sum of the probabilities of all productions out of a given non-terminal is 1. That is,given a grammar,X1→ s11| . . . |s1n...Xm→ sm1| . . . |smn,We can assign probabilities to each production Xi→ sjksuch that Pr(Xi→ si1) + . . . + Pr(Xi→sin) = 1.We are interested in the decoding, evaluation, and learning problems for SCFGs, just like wewere for hidden Markov models (HMMs). These problems can be solved using the CYK (Cocke-Younger-Kasami) algorithm, the inner and outer algorithms, and expectation maximization (EM)respectively.The CYK Algorithm (Decoding)The CYK algorithm is the analogous to the Viterbi algorithm; it solves the decoding problem forSCFGs. Given a sequence and a SCFG in Chomsky Normal Form (CNF), the CYK algorithm findsthe most likely parse tree of the sequence by the SCFG. The algorithm is as follows:Initialization:For i = 1 to N , any non-terminal V :γ(i, i, V ) = log Pr(V → xi)Iteration:For i = 1 to N − 1:For j = i + 1 to N :For any non-terminal V :γ(i, j, V ) = maxXmaxYmaxi≤k<jγ(i, k, X) + γ(k + 1, j, Y ) + log Pr(V → XY ))1Termination:log Pr(x|θ, π∗) = γ(1, N, S)Where π∗is the optimal parse tree (if traced appropriately from above).The basic idea is that γ(i, j, V ) is the likelihood of the most probable parse of xi· · · xjrooted at V .In CNF, a parse rooted at V can either produce a terminal xior two non-terminals X and Y . Thefirst case is handled in the initialization step by letting γ(i, i, V ) = log Pr(V → xi). The second caseis handled recursively. If we take a production of the form V → XY , then X is the root of a parsesubtree for xi· · · xkand Y for xk+1· · · xj:The max likelihood of such a parse tree is the max likelihood of the two subtrees X and Y , γ(i, k , X)and γ(k +1, j, Y ) respectively, summed with log Pr(V → XY ). So, we take the maximum sum overall productions of the form V → XY and all k ∈ [i, j).The Inside and Outside Algorithms (Evaluation)The inside and outside algorithms are analogous to the forward and backward algorithms; theysolve the evaluation problem for SCFGs. The inside algorithm calculates:a(i, j, V ) = Pr(xi, . . . , xjis generated by nonterminal V )It is very similar to the CYK algorithm, but we sum instead of maximize over all productions andk. The algorithm is as follows:Initialization:For i = 1 to N , any non-terminal V :a(i, i, V ) = log Pr(V → xi)Iteration:For i = 1 to N − 1:For j = i + 1 to N :For any non-terminal V :a(i, j, V ) =PXPYPi≤k<ja(i, k, X) a(k + 1, j , Y ) Pr(V → XY ))Termination:Pr(x|θ) = a(1, N, S)The outside algorithm calculates:a(i, j, V ) = Pr(x, excluding xi· · · xj, is generated by S and the excluded part is rooted at V )2And the actual algorithm is as follows:Initialization:b(1, N, S) = 1For any other V, b(1, N, V ) = 0Iteration:For i = 1 to N − 1:For j = N down to i:For any non-terminal V :b(i, j, V ) =PXPYPk<ia(k, i − 1, X) b(k, j, Y ) Pr(Y → XV ))PXPYPk>ja(j + 1, k, X) b(i, k, Y ) Pr(Y → V X))Termination:It is true for any i that Pr(x|θ) =PXb(i, i, X) Pr(X → xi)The main idea is that if V is the root of the parse tree for xi· · · xj, then there must be a productionof the form Y → XV or Y → V X. We’ll consider the first case (the second case follows similarreasoning), which is shown below:V is the root for the parse of xi· · · xj, so X must be the root for the parse of xk· · · xi−1for k < i.We take the product of the maximum likelihood of X covering xk· · · xi−1, which is a(k, i − 1, X),the probability of covering x excluding the pieces covered by X and V , and the probability of theproduction Y → XV . We sum this product over all k < i and all productions. Similarly, we do thisfor the case where V is the left-side of a production.Expectation Maximization (Learning)We do parameter training for SCFGs very similarly to how we do it for HMMs. There are two cases.The first case is when we have annotated examples (a set of sequences and the corresponding RNAstructure). The second case is when we only have a set of sequences, but we do not know thecorresponding RNA structures. We can use EM in both cases, although a closed form solutionexists for the maximum likelihood parameters for the first case, so EM would never be used inpractice for that case.Let c(V ) = expected number of times V is in the parse of x. We can compute this using a run of the3inside and outside algorithms as follows:c(V ) =1Pr(x|θ)NXi=1NXj=ia(i, j, V ) b(i, j , V )Every time V is used in the parse of x, it roots a tree covering xi· · · xjfor some i and j. V can showup multiple times, but each case is disjoint, so we can sum the probabilities and normalize it withPr(x|θ) to get the expected number of times V is in the parse of x.We can also calculate c(V → XY ), the expected number of times the production v → XY appearsin the parse of x:c(V ) =1Pr(x|θ)NXi=1NXj=ij−1Xk=ib(i, j, V ) a(i, k, X) a(k + 1, j, Y ) Pr(V → XY )b(i, j, V ) gives us the probability of x excluding xi· · · xjand that the subtree covering xi· · · xjis rooted at V . Pr(V → XY ) gives the probability that V branches into two subtrees X and Y .a(i, k, X) and a(k + 1, j, Y ) give the maximum likelihood probabilities for these two subtrees. Theproduct of these probabilities is the probability of V → XY , X covers xi· · · xk, and V coversxk+1· · · xjin the parse of x. Again, we can sum these products to get the expected number.We can now re-estimate our parameters with EM:Pr(V → XY ) =c(V → XY )c(V )Pr(V → a) =c(V → a)c(V )Stochastic Context Free Grammars vs Hidden Markov ModelsThe following table summarizes SCFGs compared to HMMs:Goal HMM Algorithm SCFG AlgorithmOptimal parse Viterbi CYKEstimation Forward InsideBackward OutsideLearning EM: Forward / Backward EM: Inside / OutsideMemory complexity O(NK) O(N2K)Time complexity O(NK2) O(N3K3)K is the number of states in the HMM and the number of non-terminals in the SCFG. Notice thatSCFGs do not scale as well as HMMs.There is a recent paper, which won best paper award in the 2006 International Conference …

View Full Document