Multiple Sequence AlignmentCMSC 423Multiple Sequence AlignmentMultiple sequence alignment: find more subtle patterns & find common patterns between all sequence.MSAInput: Sequences: S1, S2, ..., SmLet M be a MSA between these sequences.Let dM(Si, Sj) be the score of the alignment between Si and Sj implied by M.SP-Score(M) = ∑i,j dM(Si, Sj) = Sum of all pairwise alignment scores.•Goal: find M to minimize SP-Score(M).•But this is NP-hard.•The multiple sequence alignment problem:SP-Score in a PictureSP-Score(M) = ∑i,j dM(Si, Sj) SiSjdM(Si, Sj) = sum of all the scores of the pairwise alignments implied by M.MSA•A multiple sequence alignment (MSA) implies a pairwise alignment between every pair of sequences.•This implied alignment need not be optimal, however:match = -1, a mismatch = 1, gap = 2Sequences: AT, A, T, AT, ATATA- -TATATOptimal MSA:A TOptimal Alignment between A and T:(A,A), (A,-), (A, A), (A,A), (A, -), (A,A), (A, A) (-,A), (-,A), (A,A)-1 + 2 -1 -1 +2 -1 -1 +2 +2 -1 = +2+2 +2 = 4+1Slow Dynamic ProgrammingSuppose you had just 3 sequences.Apply the same DP idea as sequence alignment for 2 sequences, but now with a 3-dimensional matrix(i, j, k)(i, j-1, k)(i-1, j, k)(i-1, j-1, k)(i-1, j-1, k-1) (i, j-1, k-1)(i-1, j, k-1)(i, j, k-1)DP Recurrence for 3 sequences(i, j, k)(i, j-1, k)(i-1, j, k)(i-1, j-1, k)(i-1, j-1, k-1) (i, j-1, k-1)(i-1, j, k-1)(i, j, k-1)Every possible 2k pattern for the gaps.A[i, j, k]=mincost(xi,yj,zk)+A[i − 1,j− 1,k− 1]cost(xi, −, −)+A[i − 1,j,k]cost(xi,yj, −)+A[i − 1,j− 1,k]cost(−,yj,zk)+A[i, j − 1,k− 1]cost(−,yj, −)+A[i, j − 1,k]cost(xi, −,zk)+A[i − 1,j,k− 1]cost(−, −,zk)+A[i, j, k − 1]Running time•n3 subproblems, each takes 23 time O(n3) time.•For k sequences: nk subproblems, each takes 2k time for the max and k2 to compute cost() O(k2nk2k)•Even O(n3) is often too slow for the length of sequences encountered in practice.•One solution: approximation algorithm.Star Alignment ApproximationSiSjdM(Si, Sj) SP-ScoreStar-Score =ScdM(Si, Sc) Si∑i dM(Si, Sc)Star Alignment Algorithm•Build all O(k2) pairwise alignments.•Let Sc = the sequence in S1, S2, ..., Sk that is closest to the others. That is, choose Sc to minimize:•Iteratively align all other sequences to Sc.Input: sequences S1, S2, ..., Sk∑i≠c d(Sc, Si) Sc Scd(Sc, Si)dM(Si, Sj)Iterative Alignment•Build a multiple sequence alignment up from pairwise alignments.SC YFPHFDLSHGSAQVKAHGKKVGDALTLAVGHLDDLPGALS1 YFPHFDLSHG-AQVKG--KKVADALTNAVAHVDDMPNALStart with an alignment between Sc and some other sequence:Add 3rd sequence, say S2, and use the SC - S2 alignment as a guide, adding spaces into the MSA as needed. SC YFPHF-DLS-----HGSAQVKAHGKKVGDALTLAVGHL----DDLPGALS1 YFPHF-DLS-----HG-AQVKG--KKVADALTNAVAHV----DDMPNALS2 FFPKFKGLTTADQLKKSADVRWHAERII----NAVNDAVASMDDTEKMSSC YFPHF-DLS-----HGSAQVKAHGKKVGDALTLAVGHL----DDLPGALS2 FFPKFKGLTTADQLKKSADVRWHAERII----NAVNDAVASMDDTEKMSSC - S2 alignment:New {SC, S1, S2} alignment (red gaps added in S1):Continue with S3, S4, ...Performancecost(x,y) ≤ cost(x, z) + cost(z,y)Assume the cost function satisfies the triangle inequality:Example: cost(A, C) ≤ cost(A, T) + cost(T,C)cost of 1 mutation from A→Ccost of a mutation from A→T and then from T→CTheorem. If cost satisfies the triangle inequality, then STAR ≤ 2×OPT.STAR = cost of star alignment under SP-scoreOPT = cost of optimal multiple sequence alignment (under SP-score)Example: if optimal alignment has cost 10, the star alignment will have cost ≤ 20.Proof (1)Theorem. If cost satisfies the triangle inequality, then STAR ≤ 2OPT.STAROPT≤ 2OPT ≥ BSTAR ≤ 2B=⇒STAROPT≤2BB=2For some B we will prove the 2 statements:This will imply:Proof (2)Theorem. If cost satisfies the triangle inequality, then STAR ≤ 2OPT.defn of SP-scoreby triangle inequalitybecause STAR alignment is optimal for pairs involving Scdistribute ∑sums are the same and each term appears ≤ k (# of sequences) times.2 · STAR =�ijdSTAR(Si,Sj)≤�ij(dSTAR(Si,Sc)+dSTAR(Sc,Sj))=�ij(d(Si,Sc)+d(Sc,Sj))=�ijd(Si,Sc)+�ijd(Sc,Sj)≤ 2k�id(Si,Sc)Proof (3)Theorem. If cost satisfies the triangle inequality, then STAR ≤ 2OPT.2 · OPT =�ijdOPT(Si,Sj)≥�ijd(Si,Sj)≥ k�id(Si,Sc)defn of SP-scoreoptimal pairwise alignment is ≤ pairwise alignment induced by any MSAsum of all cost of all pairwise alignments is = the sum of k different stars.We chose Sc because it was the lowest-cost star.End of ProofOPT ≥ BSTAR ≤ 2B=⇒STAROPT≤2BB=2For some B we will prove the 2 statements:This will imply:2 · ST AR ≤ 2k�id(Si,Sc)2 · OPT ≥ k�id(Si,Sc)=⇒STAROPT≤2k�id(Si,Sc)k�id(Si,Sc)=2Consensus SequenceS1 YFPHF-DLS-----HGSAQVKAHGKKVG-----DALTLAVAHLDDLPGAL S2 YFPHF-DLS-----HG-AQVKG—GKKVA-----DALTNAVAHVDDMPNAL S3 FFPKFKGLTTADQLKKSADVRWHAERII-----NAVNDAVASMDDTEKMS S4 LFSFLKGTSEVP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLCO YFPHFKDLS-----HGSAQVKAHGKKVG-----DALTLAVAHVDDTPGALFor every column j,choose c ∈ ∑ that minimizes ∑i cost(c, Si[j])(typically this means the most common letter)•Consensus is a summarization of the whole alignment.•Consensus sequence is sometimes used as an estimate for the ancestral sequence.•Sometimes the MSA problem is formulated as: find MSA M that minimizes:∑i dM(CO, Si)Profiles•Another way to summarize an MSA:S1 ACG-TT-GAS2 ATC-GTCGAS3 ACGCGA-CCS4 ACGCGT-TA123456789ACGT-100000.25000.7500.750.250.5000.250.250.25000.7500.75000.5000.25000.250.7500.2500000.5000.7500Column in the alignmentCharacterFraction of time given column had the given characterCall this profile matrix RProfile-based AlignmentA[i, j] = maxA[i − 1,j− 1] + P (xi,j) align xito column jA[i − 1,j] + gap introduce gap into profileA[i, j − 1] + P (““,j) introduce gap into x1234ACGT-100000.750.250.5000.75000.25000000.55678900.25000.75000.250.250.250.75000.500.250.7500.250000.7500A-C CG GA C Agap in profile introduced to better fit sequenceAP (x, j)=�c∈Σsim(x, c) × R[c, j]Score of matching character x with column j of the profile:sim(x,c) = how similar character x is to character c.R =Recap•Multiple sequence alignments (MSAs) are a fundamental tool. They help reveal subtle patterns, compute consistent distances between sequences, etc.•Quality of MSAs often measured using the SP-score: sum of the scores of the pairwise alignments implied by the MSA.•Same
View Full Document