L4 Linear space Scoring matrices October 09! CSE 182!October 09! CSE 182!Alignment (Linear Space) • Score computation € i2= i%2; i1= (i −1)%2;S[i2, j] = maxS[i1, j −1] + C(si,tj)S[i1, j] + C(si,−)S[i2, j −1] + C(−,tj) For i = 1 to n! For j = 1 to m !€ S[i, j] = maxS[i −1, j −1] + C(si,tj)S[i −1, j] + C(si,−)S[i, j −1] + C(−,tj) October 09! CSE 182!Linear Space Alignment • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)October 09! CSE 182!Linear Space (cont’d) • At i=n/2, we know scores of all the optimal paths ending at that row. • Define F[j] = S[n/2,j] • One of these j is on the true path. Which one?October 09! CSE 182!Backward alignment • Let Sb[i,j] be the optimal score of aligning s[i+1..n] with t[j+1..m] € Sb[i, j] = maxSb[i + 1, j + 1] + C(si,tj)Sb[i + 1, j] + C(si,−)Sb[i, j + 1] + C(−,tj) • Boundary cases? • Sb[n,j]? Sb[m,j]?October 09! CSE 182!Backward alignment • Let Sb[i,j] be the optimal score of aligning s[i+1..n] with t[j+1..m] • Define B[j] = Sb[n/2,j] • One of these j is on the true path. Which one?October 09! CSE 182!Forward, Backward computation • At the optimal coordinate, j – F[j]+B[j]=S[n,m] • In O(nm) time, and O(m) space, we can compute one of the coordinates on the optimum path.October 09! CSE 182!Linear Space Alignment • Align(1..n,1..m) – For all 1<=j <= m • Compute F[j]=S(n/2,j) – For all 1<=j <= m • Compute B[j]=Sb(n/2,j) – j* = maxj {F[j]+B[j] } – X = Align(1..n/2,1..j*) – Y = Align(n/2+1..n,j*+1..m) – Return X,j*,YOctober 09! CSE 182!Linear Space complexity • T(nm) = c.nm + T(nm/2) = O(nm) • Space = O(m)October 09! CSE 182!Silly Quiz • Name a famous Bioinformatics Researcher • Name a famous Bioinformatics Researcher who is a womanScoring Matrices • We have seen that affine gap penalties help concentrate the gaps in small regions. • What about substitution errors. Are all substitutions alike? October 09! CSE 182!October 09! CSE 182!Scoring DNA • DNA has structure.October 09! CSE 182!DNA scoring matrices • So far, we considered a simple match/mismatch criterion. • The nucleotides can be grouped into Purines (A,G) and Pyrimidines. • Nucleotide substitutions within a group (transitions) are more likely than those across a group (transversions)The scoring problem • When comparing hum and dros., more mismatches are likely just by chance.. If we expect to see 70% of the residues to be mutated, then a 50% identity is great. • A different scoring function is needed for hum and drosphila October 09! CSE 182!hum!mus!dros!October 09! CSE 182!Scoring proteins • Scoring protein sequence alignments is a much more complex task than scoring DNA – Not all substitutions are equal • Problem was first worked on by Pauling and collaborators • In the 1970s, Margaret Dayhoff created the first similarity matrices. – “One size does not fit all” – Homologous proteins which are evolutionarily close should be scored differently than proteins that are evolutionarily distant – Different proteins might evolve at different rates and we need to normalize for thatFrequency based scoring • Our goal is to score each column in the alignment • Comparing against expectation: – Think about alignments of pairs of random sequences, and compute the probability that A and B appear together just by chance PR(A,B) – Compute the probability of A and B appearing together in the alignment of related sequences (orthologs) PO(A,B) • A good score function? October 09! CSE 182! A B € logPO(A,B)PR(A,B) October 09! CSE 182!PAM 1 distance • Two sequences are 1 PAM apart if they differ in 1 % of the residues. • PAM1(a,b) = Pr[residue b substitutes residue a, when the sequences are 1 PAM apart] 1% mismatch!October 09! CSE 182!PAM1 matrix • Align many proteins that are very similar – Is this a problem? • 1 PAM evolutionary distance represents the time in which 1% of the residues have changed • Estimate the frequency Pb|a of residue a being substituted by residue b. • PAM1(a,b) = Pa|b = Pr(b will mutate to an a after 1 PAM evolutionary distance) • Scoring matrix – S(a,b) = log10(Pab/PaPb) = log10(Pb|a/Pb)October 09! CSE 182!PAM 1 • Top column shows original, and left column shows replacement residue = PAM1(a,b) = Pr(a|b)October 09! CSE 182!PAM distance • Two sequences are 1 PAM apart when they differ in 1% of the residues. • When are 2 sequences 2 PAMs apart? 1 PAM 1 PAM 2 PAMOctober 09! CSE 182!Generating Higher PAMs • PAM2(a,b) = ∑c PAM1(a,c). PAM1 (c,b) • PAM2 = PAM1 * PAM1 (Matrix multiplication) • PAM250 – = PAM1*PAM249 – = PAM1250 =!a!a!b!c!b!c!PAM2!PAM1!PAM1!October 09! CSE 182!Note: This is not the score matrix: !What happens as you keep increasing the power?!October 09! CSE 182!Scoring residues • A reasonable score function C(a,b) is given as follows: – Look at ‘high quality’ alignments – C(a,b) should be high when a,b are seen together more often than is expected by chance – C(a,b) should be low, otherwise. • How often would you expect to see a,b together just by chance? – Pa Pb • Let Pab be the probability that a and b are aligned in a high-quality alignment • A good scoring function is the log-odds score – C(a,b)= log10 (Pab/PaPb)October 09! CSE 182!Scoring alignments • To compute Pab, we need ‘high-quality’ alignments • How can you get quality alignments? – Use SW (But that needs the scoring function) – Build alignments manually – Use Dayhoff’s theory to extrapolate from high identity alignmentsOctober 09! CSE 182!Scoring using PAM matrices • Suppose we know that two sequences are 250 PAMs apart. • S(a,b) = log10(Pab/PaPb)= log10(Pa|b/Pa) = log10(PAM250(a,b)/Pa) • How does it help? – S250(A,V) >> S1(A,V) – Scoring of hum vs. Dros should be using a higher PAM matrix than scoring hum vs. mus. – An alignment with a smaller % identity could still have a higher score and be more significant hum!mus!dros!October 09! CSE
View Full Document