Unformatted text preview:

Multiple Sequence Alignment CMSC 423 Multiple Sequence Alignment Multiple sequence alignment find more subtle patterns find common patterns between all sequence MSA The multiple sequence alignment problem Input Sequences S1 S2 Sm Let M be a MSA between these sequences Let dM Si Sj be the score of the alignment between Si and Sj implied by M SP Score M i j dM Si Sj Sum of all pairwise alignment scores Goal find M to minimize SP Score M But this is NP hard SP Score in a Picture SP Score M i j dM Si Sj sum of all the scores of the pairwise alignments implied by M Sj dM Si Sj Si MSA A multiple sequence alignment MSA implies a pairwise alignment between every pair of sequences This implied alignment need not be optimal however match 1 a mismatch 1 gap 2 Sequences AT A T AT AT AT AOptimal MSA T AT AT 2 2 4 Optimal Alignment between A and T A T 1 A A A A A A A A A A A A A A A A 1 2 1 1 2 1 1 2 2 1 2 Slow Dynamic Programming Suppose you had just 3 sequences Apply the same DP idea as sequence alignment for 2 sequences but now with a 3 dimensional matrix i 1 j k i j k i j k 1 i 1 j k 1 i 1 j 1 k i 1 j 1 k 1 i j 1 k i j 1 k 1 DP Recurrence for 3 sequences cost x y z A i 1 j 1 k 1 i j k cost x A i 1 j k i cost xi yj A i 1 j 1 k A i j k min cost yj zk A i j 1 k 1 cost yj A i j 1 k cost x z A i 1 j k 1 i k cost zk A i j k 1 i 1 j k Every possible for the gaps 2k pattern i j k 1 i 1 j k 1 i 1 j 1 k i 1 j 1 k 1 i j k i j 1 k i j 1 k 1 Running time n3 subproblems each takes 23 time O n3 time For k sequences nk subproblems each takes 2k time for the max and k2 to compute cost O k2nk2k Even O n3 is often too slow for the length of sequences encountered in practice One solution approximation algorithm Star Alignment Approximation Sj Si dM Si Sj Si Sc SP Score dM Si Sc Star Score i dM Si Sc Star Alignment Algorithm Input sequences S1 S2 Sk Build all O k2 pairwise alignments Let Sc the sequence in S1 S2 Sk that is closest to the others That is choose Sc to minimize i c d Sc Si Iteratively align all other sequences to Sc d Sc Si dM Si Sj Sc Sc Iterative Alignment Build a multiple sequence alignment up from pairwise alignments Start with an alignment between Sc and some other sequence SC YFPHFDLSHGSAQVKAHGKKVGDALTLAVGHLDDLPGAL S1 YFPHFDLSHG AQVKG KKVADALTNAVAHVDDMPNAL Add 3rd sequence say S2 and use the SC S2 alignment as a guide adding spaces into the MSA as needed SC S2 alignment SC YFPHF DLS HGSAQVKAHGKKVGDALTLAVGHL DDLPGAL S2 FFPKFKGLTTADQLKKSADVRWHAERII NAVNDAVASMDDTEKMS New SC S1 S2 alignment red gaps added in S1 SC YFPHF DLS HGSAQVKAHGKKVGDALTLAVGHL DDLPGAL S1 YFPHF DLS HG AQVKG KKVADALTNAVAHV DDMPNAL S2 FFPKFKGLTTADQLKKSADVRWHAERII NAVNDAVASMDDTEKMS Continue with S3 S4 Performance Assume the cost function satisfies the triangle inequality cost x y cost x z cost z y Example cost A C cost A T cost T C cost of 1 mutation from A C cost of a mutation from A T and then from T C STAR cost of star alignment under SP score OPT cost of optimal multiple sequence alignment under SP score Theorem If cost satisfies the triangle inequality then STAR 2 OPT Example if optimal alignment has cost 10 the star alignment will have cost 20 Proof 1 Theorem If cost satisfies the triangle inequality then STAR 2OPT STAR 2 OPT For some B we will prove the 2 statements STAR 2B OPT B This will imply STAR 2B 2 OPT B Proof 2 Theorem If cost satisfies the triangle inequality then STAR 2OPT 2 STAR by triangle inequality because STAR alignment is optimal for pairs involving Sc distribute dSTAR Si Sj defn of SP score ij dSTAR Si Sc dSTAR Sc Sj ij d Si Sc d Sc Sj ij ij 2k d Si Sc i d Sc Sj ij d Si Sc sums are the same and each term appears k of sequences times Proof 3 Theorem If cost satisfies the triangle inequality then STAR 2OPT 2 OPT optimal pairwise alignment is pairwise alignment induced by any MSA sum of all cost of all pairwise alignments is the sum of k different stars We chose Sc because it was the lowest cost star dOPT Si Sj ij d Si Sj ij k i d Si Sc defn of SP score End of Proof For some B we will prove the 2 statements This will imply STAR 2B OPT B 2 ST AR 2 OP T 2k k STAR 2B 2 OPT B d Si Sc i i d Si Sc 2k i d Si Sc STAR 2 OPT k i d Si Sc Consensus Sequence For every column j choose c that minimizes i cost c Si j S1 S2 S3 S4 CO typically this means the most common letter YFPHF DLS HGSAQVKAHGKKVG DALTLAVAHLDDLPGAL YFPHF DLS HG AQVKG GKKVA DALTNAVAHVDDMPNAL FFPKFKGLTTADQLKKSADVRWHAERII NAVNDAVASMDDTEKMS LFSFLKGTSEVP QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATL YFPHFKDLS HGSAQVKAHGKKVG DALTLAVAHVDDTPGAL Consensus is a summarization of the whole alignment Sometimes the MSA problem is formulated as find MSA M that minimizes Consensus sequence is sometimes used as an estimate for the ancestral sequence i dM CO Si Profiles Another way to summarize an MSA S1 S2 S3 S4 ACG TT GA ATC GTCGA ACGCGA CC ACGCGT TA Column in the alignment Character 1 2 3 4 5 6 7 8 9 A C G T 1 0 0 0 0 0 75 0 25 0 5 0 0 25 0 0 0 0 0 0 75 0 0 75 0 0 25 0 0 0 25 0 75 0 0 0 0 5 0 0 0 0 Call this profile matrix R 0 75 0 25 0 25 0 25 0 0 5 0 0 0 25 0 0 75 0 0 Fraction of time given column had the given character Profile based Alignment gap in profile introduced to better fit sequence R 1 2 3 4 5 6 7 8 9 A 1 0 0 0 0 0 25 0 0 0 75 C 0 0 75 0 25 0 5 0 0 0 25 0 25 0 25 G 0 0 0 75 0 0 75 0 0 0 5 0 T 0 0 25 0 0 0 25 0 75 0 0 25 0 0 0 0 0 5 0 0 0 75 0 0 A C C AG A C GA A i 1 j 1 …


View Full Document

UMD CMSC 423 - Multiple Sequence Alignment

Documents in this Course
Midterm

Midterm

8 pages

Lecture 7

Lecture 7

15 pages

Load more
Loading Unlocking...
Login

Join to view Multiple Sequence Alignment and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Multiple Sequence Alignment and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?