November 09! CSE 182!CSE182-L8 Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene FindingNovember 09! CSE 182!QUIZ! • Question: • your ‘friend’ likes to gamble. • He tosses a coin: HEADS, he gives you a dollar. TAILS, you give him a dollar. • Usually, he uses a fair coin, but ‘once in a while’, he uses a loaded coin. • Can you say if he is cheating? • What fraction of the times does he load the coin?Regular expressions as motifs • What is a regular expression? • Given a regular expression pattern and a database, find all sequences that match the pattern. • Given a sequence as query, and a database of r.e. patterns, find all of the patterns in the sequence. • http://ca.expasy.org/prosite/ November 09! CSE 182!Fa 07! CSE182!Regular Expressions • Concise representation of a set of strings over alphabet ∑. • Described by a string over • R is a r.e. if and only if € Σ,⋅,∗,+{ }€ R = {ε} Base caseR = {σ},σ∈ ΣR = R1+ R2 Union of stringsR = R1⋅ R2 ConcatenationR = R1* 0 or more repetitionsFa 07! CSE182!Regular Expression • Q: Let ∑={A,C,E} – Is (A+C)*EEC* a regular expression? – *(A+C)? – AC*..E? • Q: When is a string s in a regular expression?!– R =(A+C)*EEC*!– Is CEEC in R?!– AEC?!– ACEE?!Fa 07! CSE182!Regular Expression & Automata Every R.E can be expressed by an automaton (a directed graph) with the following properties: – The automaton has a start and end node – Each edge is labeled with a symbol from ∑, or ε Suppose R is described by automaton A! S ∈ R if and only if there is a path from start to end in A, labeled with s.!Fa 07! CSE182!Examples: Regular Expression & Automata • (A+C)*EEC* C!A!C!start! end!E! E!November 09! CSE 182!Constructing automata from R.E • R = {ε} • R = {σ}, σ ∈ ∑ • R = R1 + R2 • R = R1 · R2 • R = R1*November 09! CSE 182!Matching Regular expressions • A string s belongs to R if and only if, there is a path from START to END in RA, labeled by s. • Given a regular expression R (automaton RA), and a database D, is there a string D[b..c] that matches RA (D[b..c] ∈ R) • Simpler Q: Is D[1..c] accepted by the automaton of R?November 09! CSE 182!Alg. For matching R.E. • If D[1..c] is accepted by the automaton RA – There is a path labeled D[1]…D[c] that goes from START to END in RA D[1]!D[2]! D[c]!November 09! CSE 182!Alg. For matching R.E. • If D[1..c] is accepted by the automaton RA – There is a path labeled D[1]…D[c] that goes from START to END in RA – There is a path labeled D[1]..D[c-1] from START to node u, and a path labeled D[c] from u to the END D[1] .. D[c-1]!D[c]!u!November 09! CSE 182!D.P. to match regular expression • Define: – A[u,σ] = Automaton node reached from u after reading σ – Eps(u): set of all nodes reachable from node u using epsilon transitions. – N[c] = subset of nodes reachable from START node after reading D[1..c] – Q: when is v ∈ N[c] ε!November 09! CSE 182!• Q: when is v ∈ N[c]? • A: If for some u ∈ N[c-1], w = A[u,D[c]], • v ∈ {w}+ Eps(w) D.P. to match regular expressionNovember 09! CSE 182!AlgorithmNovember 09! CSE 182!The final step • We have answered the question: – Is D[1..c] accepted by R? – Yes, if END ∈ N[c] • We need to answer – Is D[l..c] (for some l, and some c) accepted by R € D[l..c] ∈ R ⇔ D[1..c] ∈ Σ∗RNovember 09! CSE 182!Regular expressions as Protein sequence motifs A Fam(B) C-X-[DE]-X{10,12}-C-X-C--[STYLV]!C! E! F!• Problem: if there is a mis-match, the sequence is not accepted.November 09! CSE 182!Representation 2: Profiles • Profiles versus regular expressions – Regular expressions are intolerant to an occasional mis-match. – The Union operation (I+V+L) does not quantify the relative importance of I,V,L. It could be that V occurs in 80% of the family members. – Profiles capture some of these ideas.November 09! CSE 182!Profiles • Start with an alignment of strings of length m, over an alphabet A, • Build an |A| X m matrix F=(fki) • Each entry fki represents the frequency of symbol k in position i 0.71!0.14!0.14!0.28!November 09! CSE 182!Profiles • Start with an alignment of strings of length m, over an alphabet A, • Build an |A| X m matrix F=(fki) • Each entry fki represents the frequency of symbol k in position i 0.71!0.14!0.14!0.28!November 09! CSE 182!Scoring matrices • Given a sequence s, does it belong to the family described by a profile? • We align the sequence to the profile, and score it • Let S(i,j) be the score of aligning position i of the profile to residue sj • The score of an alignment is the sum of column scores. s sj iNovember 09! CSE 182!Scoring Profiles € S(i, j) = fkik∑M rk,sj[ ]k i s fki Scoring MatrixNovember 09! CSE 182!Domain analysis via profiles • Given a database of profiles of known domains/families, we can query our sequence against each of them, and choose the high scoring ones to functionally characterize our sequences. • What if the sequence matches some other sequences weakly (using BLAST), but does not match any known profile?November 09! CSE 182!Psi-BLAST idea • Iterate: – Find homologs using Blast on query – Discard very similar homologs – Align, make a profile, search with profile. – Why is this more sensitive? Seq Db!--In the next iteration, the red sequence will be thrown out. --It matches the query in non-essential residuesNovember 09! CSE 182!Psi-BLAST speed • Two time consuming steps. 1. Multiple alignment of homologs 2. Searching with Profiles. 1. Does the keyword search idea work? • Multiple alignment: – Use ungapped multiple alignments only • Pigeonhole principle again: – If profile of length m must score >= T – Then, a sub-profile of length l must score >= lT|/m – Generate all l-mers that score at least lT|/M – Search using an automatonNovember 09! CSE 182!Representation 3: HMMs • Building good profiles relies upon good alignments. – Difficult if there are gaps in the alignment. – Psi-BLAST/BLOCKS etc. work with gapless alignments. • An
View Full Document