Stanford CS 262 - Lecture 17 - D1674818

Home> Schools> Stanford University> Computer Science (CS) > CS 262> Lecture 17

DOC PREVIEW

Stanford CS 262 - Lecture 17

School name Stanford University

Course Cs 262- Computational Genomics

Pages 14

This preview shows page 1-2-3-4-5 out of 14 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS262 – Computational Genomics Lecture 17 – May 29th, 2003 RNA Secondary Structure Scribed by Rohan Angrish Characteristics of RNA Secondary Structure In this lecture, we covered the topics of prediction of RNA secondary structure and detection of RNA. RNA structure is determined mainly by the formation of Hydrogen bonds between complementary nucleotides; viz. A-U, G-C and G-U. They form 3, 2 and 1 Hydrogen bonds respectively. Fig1 shows the characteristic features of the RNA secondary structure. Fig1. RNA Secondary Structure The secondary structure determines much of the functions of the RNA, for instance, the loops are the binding sites, and the stems hold these different binding sites together. Context Free Grammers (CFGs) The RNA secondary structure can be well captured by Context Free Grammars (CFGs). The reason for this is that CFGs are ideal for capturing interactions between distant pairsof letters in a sequence, when these interactions are in a nested structure. A pair of interacting letters are either always completely contained within other pairs of interacting letters, or in parallel to other pairs of interacting letters. This is identical in nature to parentheses. A pair of corresponding parentheses are either completely contained within another pair, or in parallel to other pairs of corresponding parentheses. Such structures are very well modeled by CFGs. Such interactions cannot, for instance, be modeled using HMMs, or their syntactic counterparts, viz. Regular Expressions. Fig2 is an example of a CFG that represents the hairpin loop shown in the right of the figure. Fig2. Example of a CFG This is a simple CFG that only derives the sequence shown in the right of Fig2. How exactly the sequence is derived is shown in Fig3, and the red arrow shows the direction in which the derivation proceeds. Fig3. How the sequence in Fig2 is derivedNow consider the following CFG. S à aSu :3 | uSa :3 gSc :2 | cSg :2 gSu :1 | uSg :1 SS : 0| aS :0 | cS :0 | gS :0 | uS :0 | ε :0 where ε is the empty string; “”. This grammar describes RNA secondary structure. To understand this, consider production rules of the type a1Sa2 to represent the fact that a1 and a2 are paired with one another, and the rules of the type a1S to represent those letters that aren’t paired with anything. Each rule is also given a score, which directly corresponds to how “good” the corresponding base pair formation is considered. In this case, each production rule is given a score equal to the number of Hydrogen bonds the corresponding pairing contributes to the overall structure. The more the number of Hydrogen bonds, the more stable the structure. The Nussinov algorithm, as discussed in the previous lecture, then finds the optimal parse of a string with this grammar. The Nussinov algorithm is a simple DP algorithm, and is described in Fig4. Fig4. The Nussinov Algorithm Here, F(i, j) is the score of the optimal fold, or configuration, of the substring xi…xj. The different productions of the CFG that the different steps of the Nussinov algorithmcorrespond to are shown on the right hand side of the steps. F(1, N) gives us the score of the optimal parse of the sequence. Note that the CFG is a very simplistic model of the RNA secondary structure. It doesn’t account for things like loop size, composition of the loops, etc. But it gives a reasonable prediction of what the secondary RNA structure is. Stochastic CFGs (SCFGs) Stochastic CFGs are CFGs where each production rule has an associated probability. They are very useful in modeling the RNA secondary structure. To draw an analogy, Stochastic CFGs are to CFGs what HMMs are to finite automata. The counterparts of HMM states in a stochastic CFG are the non-terminals. The rules with a non-terminal on the left hand side are equivalent to the transitions “out of” that non-terminal. Hence, the probabilities of the rules with some non-terminal V on the left hand side should be such that they add up to 1. As an offshoot, it should become very clear that it would be extremely undesirable to have a non-terminal without any rules going “out of” it, since any derivation including that non-terminal would never terminate. With such an analogy, a number of computational problems analogous to those for HMMs become apparent. They are - Finding the optimal alignment between a sequence and a SCFG (Decoding) - Finding the probability that a sequence is generated by a given SCFG (Evaluation) - Given a set of sequences, estimate the parameters of the SCFG (Learning) Decoding, for instance, is interesting because the most likely parse would give us the locations of the Hydrogen bonds in the optimal RNA secondary structure. We now consider each of these problems. Evaluation Recall how evaluation was done using HMMs. We defined two dynamic programming (DP) variables; forward and backward. We then used these 2 variables to find the probability that the given string was produced by the HMM. In particular, Forward: fl(i) = P(x1…xi, πi = l) Backward: bk(i) = P(xi+1…xN | πi = k) Then, we could compute the probability of a string x = x1 x2 …xN being generated by the model to be P(x) = Σk fk(N)ak0 = Σk a0k ek(x1) bk(1) For SCFGs, we define 2 analogous DP variables, viz. Inside and Outside.The Inside variable a(i, j, V) gives the probability that the substring xi…xj of x is generated by the non-terminal V. The Outside variable b(i, j, V) gives the probability that the entire string x except for the substring xi…xj is generated by S (the root non-terminal) and the excluded part is rooted at V. However b(i, j, V) does not calculate the probability that xi…xj is generated by V. We now define a normal form for CFGs called Chomsky Normal Form (CNF). This is useful because for one, it has been proved that every CFG can be converted to an equivalent one in CNF, and moreover, once we assume that CFGs are in Chomsky Normal Form, various algorithms can be described and analyzed very elegantly. Chomsky Normal Form (CNF) In CNF, the only productions allowed are those of the following type: X à YZ X à a That is, the only allowed productions are those where a non-terminal goes either to 2 non-terminals or to a terminal. As an example of converting a CFG to an equivalent one in CNF, consider the following CFG and an associated derivation tree for a string in the grammar,

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4-5 out of 14 pages.

Stanford CS 262 - Lecture 17

Sign up for free to view:

Please select your school