DOC PREVIEW
UMD CMSC 423 - Multiple Sequence Alignment

This preview shows page 1-2-19-20 out of 20 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Multiple Sequence AlignmentCMSC 423Multiple Sequence AlignmentMultiple sequence alignment: find more subtle patterns & find common patterns between all sequence.MSAInput: Sequences: S1, S2, ..., SmLet M be a MSA between these sequences.Let dM(Si, Sj) be the score of the alignment between Si and Sj implied by M.SP-Score(M) = ∑i,j dM(Si, Sj) = Sum of all pairwise alignment scores.•Goal: find M to minimize SP-Score(M).•But this is NP-hard.•The multiple sequence alignment problem:SP-Score in a PictureSP-Score(M) = ∑i,j dM(Si, Sj) SiSjdM(Si, Sj) = sum of all the scores of the pairwise alignments implied by M.MSA•A multiple sequence alignment (MSA) implies a pairwise alignment between every pair of sequences.•This implied alignment need not be optimal, however:match = -1, a mismatch = 1, gap = 2Sequences: AT, A, T, AT, ATATA- -TATATOptimal MSA:A TOptimal Alignment between A and T:(A,A), (A,-), (A, A), (A,A), (A, -), (A,A), (A, A) (-,A), (-,A), (A,A)-1 + 2 -1 -1 +2 -1 -1 +2 +2 -1 = +2+2 +2 = 4+1Slow Dynamic ProgrammingSuppose you had just 3 sequences.Apply the same DP idea as sequence alignment for 2 sequences, but now with a 3-dimensional matrix(i, j, k)(i, j-1, k)(i-1, j, k)(i-1, j-1, k)(i-1, j-1, k-1) (i, j-1, k-1)(i-1, j, k-1)(i, j, k-1)DP Recurrence for 3 sequences(i, j, k)(i, j-1, k)(i-1, j, k)(i-1, j-1, k)(i-1, j-1, k-1) (i, j-1, k-1)(i-1, j, k-1)(i, j, k-1)Every possible 2k pattern for the gaps.A[i, j, k]=mincost(xi,yj,zk)+A[i − 1,j− 1,k− 1]cost(xi, −, −)+A[i − 1,j,k]cost(xi,yj, −)+A[i − 1,j− 1,k]cost(−,yj,zk)+A[i, j − 1,k− 1]cost(−,yj, −)+A[i, j − 1,k]cost(xi, −,zk)+A[i − 1,j,k− 1]cost(−, −,zk)+A[i, j, k − 1]Running time•n3 subproblems, each takes 23 time 󲰛 O(n3) time.•For k sequences: nk subproblems, each takes 2k time for the max and k2 to compute cost() 󲰛 O(k2nk2k)•Even O(n3) is often too slow for the length of sequences encountered in practice.•One solution: approximation algorithm.Star Alignment ApproximationSiSjdM(Si, Sj) SP-ScoreStar-Score =ScdM(Si, Sc) Si∑i dM(Si, Sc)Star Alignment Algorithm•Build all O(k2) pairwise alignments.•Let Sc = the sequence in S1, S2, ..., Sk that is closest to the others. That is, choose Sc to minimize:•Iteratively align all other sequences to Sc.Input: sequences S1, S2, ..., Sk∑i≠c d(Sc, Si) Sc Scd(Sc, Si)dM(Si, Sj)Iterative Alignment•Build a multiple sequence alignment up from pairwise alignments.SC YFPHFDLSHGSAQVKAHGKKVGDALTLAVGHLDDLPGALS1 YFPHFDLSHG-AQVKG--KKVADALTNAVAHVDDMPNALStart with an alignment between Sc and some other sequence:Add 3rd sequence, say S2, and use the SC - S2 alignment as a guide, adding spaces into the MSA as needed. SC YFPHF-DLS-----HGSAQVKAHGKKVGDALTLAVGHL----DDLPGALS1 YFPHF-DLS-----HG-AQVKG--KKVADALTNAVAHV----DDMPNALS2 FFPKFKGLTTADQLKKSADVRWHAERII----NAVNDAVASMDDTEKMSSC YFPHF-DLS-----HGSAQVKAHGKKVGDALTLAVGHL----DDLPGALS2 FFPKFKGLTTADQLKKSADVRWHAERII----NAVNDAVASMDDTEKMSSC - S2 alignment:New {SC, S1, S2} alignment (red gaps added in S1):Continue with S3, S4, ...Performancecost(x,y) ≤ cost(x, z) + cost(z,y)Assume the cost function satisfies the triangle inequality:Example: cost(A, C) ≤ cost(A, T) + cost(T,C)cost of 1 mutation from A→Ccost of a mutation from A→T and then from T→CTheorem. If cost satisfies the triangle inequality, then STAR ≤ 2×OPT.STAR = cost of star alignment under SP-scoreOPT = cost of optimal multiple sequence alignment (under SP-score)Example: if optimal alignment has cost 10, the star alignment will have cost ≤ 20.Proof (1)Theorem. If cost satisfies the triangle inequality, then STAR ≤ 2OPT.STAROPT≤ 2OPT ≥ BSTAR ≤ 2B=⇒STAROPT≤2BB=2For some B we will prove the 2 statements:This will imply:Proof (2)Theorem. If cost satisfies the triangle inequality, then STAR ≤ 2OPT.defn of SP-scoreby triangle inequalitybecause STAR alignment is optimal for pairs involving Scdistribute ∑sums are the same and each term appears ≤ k (# of sequences) times.2 · STAR =�ijdSTAR(Si,Sj)≤�ij(dSTAR(Si,Sc)+dSTAR(Sc,Sj))=�ij(d(Si,Sc)+d(Sc,Sj))=�ijd(Si,Sc)+�ijd(Sc,Sj)≤ 2k�id(Si,Sc)Proof (3)Theorem. If cost satisfies the triangle inequality, then STAR ≤ 2OPT.2 · OPT =�ijdOPT(Si,Sj)≥�ijd(Si,Sj)≥ k�id(Si,Sc)defn of SP-scoreoptimal pairwise alignment is ≤ pairwise alignment induced by any MSAsum of all cost of all pairwise alignments is = the sum of k different stars.We chose Sc because it was the lowest-cost star.End of ProofOPT ≥ BSTAR ≤ 2B=⇒STAROPT≤2BB=2For some B we will prove the 2 statements:This will imply:2 · ST AR ≤ 2k�id(Si,Sc)2 · OPT ≥ k�id(Si,Sc)=⇒STAROPT≤2k�id(Si,Sc)k�id(Si,Sc)=2Consensus SequenceS1 YFPHF-DLS-----HGSAQVKAHGKKVG-----DALTLAVAHLDDLPGAL S2 YFPHF-DLS-----HG-AQVKG—GKKVA-----DALTNAVAHVDDMPNAL S3 FFPKFKGLTTADQLKKSADVRWHAERII-----NAVNDAVASMDDTEKMS S4 LFSFLKGTSEVP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLCO YFPHFKDLS-----HGSAQVKAHGKKVG-----DALTLAVAHVDDTPGALFor every column j,choose c ∈ ∑ that minimizes ∑i cost(c, Si[j])(typically this means the most common letter)•Consensus is a summarization of the whole alignment.•Consensus sequence is sometimes used as an estimate for the ancestral sequence.•Sometimes the MSA problem is formulated as: find MSA M that minimizes:∑i dM(CO, Si)Profiles•Another way to summarize an MSA:S1 ACG-TT-GAS2 ATC-GTCGAS3 ACGCGA-CCS4 ACGCGT-TA123456789ACGT-100000.25000.7500.750.250.5000.250.250.25000.7500.75000.5000.25000.250.7500.2500000.5000.7500Column in the alignmentCharacterFraction of time given column had the given characterCall this profile matrix RProfile-based AlignmentA[i, j] = maxA[i − 1,j− 1] + P (xi,j) align xito column jA[i − 1,j] + gap introduce gap into profileA[i, j − 1] + P (““,j) introduce gap into x1234ACGT-100000.750.250.5000.75000.25000000.55678900.25000.75000.250.250.250.75000.500.250.7500.250000.7500A-C CG GA C Agap in profile introduced to better fit sequenceAP (x, j)=�c∈Σsim(x, c) × R[c, j]Score of matching character x with column j of the profile:sim(x,c) = how similar character x is to character c.R =Recap•Multiple sequence alignments (MSAs) are a fundamental tool. They help reveal subtle patterns, compute consistent distances between sequences, etc.•Quality of MSAs often measured using the SP-score: sum of the scores of the pairwise alignments implied by the MSA.•Same


View Full Document

UMD CMSC 423 - Multiple Sequence Alignment

Documents in this Course
Midterm

Midterm

8 pages

Lecture 7

Lecture 7

15 pages

Load more
Download Multiple Sequence Alignment
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Multiple Sequence Alignment and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Multiple Sequence Alignment 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?