Princeton COS 551 - Multiple Sequence Alignments

Unformatted text preview:

Multiple Sequence AlignmentsCOS551, Fall 2003Global Multiple Sequence Alignment (MSA)• Ex: MSA of 4 sequences MQPILLLV, MLRLL, MKILLL, and MPPVLILV:MQPILLLVMLR—LL--MK-ILLL-MPPVLILVNo column is all gapsMotivation• Multiple sequence alignments are used for many reasons, including:– to detect regions of variability or conservation in a family of proteins– to provide stronger evidence than pairwise similarity for structural and functional inferences– first step in phylogenetic reconstruction, in RNA secondary structure prediction, and in building profiles (probabilistic models) for protein families or DNA signals.Similarity Measures• For pairwise alignments, we aligned sequences to maximize the similarity score. • With multiple sequences, not obvious what best way to score an alignment is• Sum-of-pairs (SP) is a commonly studied similarity measure for MSAsSum-of-pairs (SP) Measure• Each column is scored by summing the scores of all pairs of symbols in that column. • E.g., match = 1, a mismatch = -1, gap = -2 I-IV= score(I,-) + score(I, I) +score(I,V) + score(-,I) + score (-,V) + score(I,V)= -2 + 1 + -1 + -2 + -2 + -1 = -7Is SP a good measure?• column in alignment : A,A,A,C• SP score = 1+1-1+1-1-1=0• But maybe evolutionary history described by:single C A mutation can explain the data, and thus SP tends to overcount mutationsOptimal pairwise alignments (Review)• Used dynamic programming • If two length n sequences: (n+1) x (n+1) array• Fill out each box in the array by considering what happens in the last column – 3 choices: align last letters from both sequences, align last letter from 1stsequence with gap, align last letter from 2ndsequence with gap – O(n2) algorithmFinding optimal MSAsCan use dynamic programming to find optimal solutionsIf have k sequences of length n, array is of size (n+1)kIn considering last column, have 2k- 1 choices– E.g., align last letters from all sequences; align last letter from one sequence and gaps in all others, etc.• Running time is exponential in the number of sequences ! • Impractical … MSA packages use heuristicsProgressive alignment heuristic• basic idea: compute pairwise alignments and merge alignments consistently• E.g., Align acg, cga, gac. Get optimalpairwise alignments:acg- -acg cga--cga gac- -gacProgressive alignment heuristicMerge using Merge using Merge usingalignments alignments alignmentswith 1stsequence with 3rdsequence with 2ndsequence-acg---cgagac----acgcga---gac-acg---cga---gacOrder of merging matters ! Note once a gap, always a gap …1&21&32&3ClustalW Package• ClustalW is a popular heuristic package for computing MSAs, • Based on progressive alignment• We’ll go over its main ideas via an example of aligning 7 globin sequences• Keep in mind what types of problems the algorithm might have on real data!Progressive Alignment: ClustalW Package1. Determine all pairwise alignments between sequences and determine degrees of similarity between each pair.2. Construct a "rough" similarity tree3. Combine the alignments starting from the most closely related groups to most distantly related groups, while maintaining the "once a gap, always a gap" policy.Step 1: Pairwisealignment & distance• Given k sequences, determine all pairwise global alignments• Use pairwise alignments to determine distances between pairs of sequences– E.g., sequences QKLMN & KLVN, alignment is:QKLMN-KLVNDistance= # mismatches / #cols with no gaps= ¼Underestimate of actual distance!Compute all distancesGlobintype1 2 3 4 5 6 7Hbb_human1 -Hbb_horse2 .17 -Hba_human3 .59 .60 -Hba_horse4 .59 .59 .13 -Myg_whale5 .77 .77 .75 .75 -Cyng_lamprey6 .81 .82 .73 .74 .80 -Lgb_lupus7 .87 .86 .86 .88 .93 .90 --distances between 0 and 1-smaller distances, closer seqsStep 2: Construct “rough” similarity tree• Distance matrix is fed into an algorithm that will build a tree relating these sequences (Neighbor-joining, more in future lecture) • Ideally, path length in tree between sequences is equal to distance in matrix (cannot always maintain this)Neighbor Joining Treedistance between Hbb_human and Hbb_horse tree is .081 + .084 = .165 which is close to .17 from matrixStep 3: Combine alignments• Start from the most closely related groups to most distantly related groups (start from tips to root in tree), while maintaining the "once a gap, always a gap" policy.• E.g., first align hba_human & hba_horse; then hbb_human & hbb_horse; then hba’s with hbb’s; then add to that alignment whale, lamprey and lupus in turnAligning pairs of alignments• Can solve optimally using dynamic programming• Similarity between a column in 2 alignments is now the average similarity between the sequencesAligning AligmentsAlignment 1: ATACCAAlignment 2: TCAFETAT-ETATF-AGTFDScore 1stcolumn of 1stalignment against 2ndcolumn in the other alignments using:= 1/8(score(A,C) + score(A,A) + score(A,A) + score(A,G) + score(C,C) + score(C,A) + score(C,A) + score(C,G))Weighting Sequences• Note that when aligning alignments, we are just averaging over all sequences• If have some very closely related sequences, this is problematic (duplicate information)• Will use tree to weight our sequences, with highly diverged sequences getting larger weightsWeighting Sequences• Use length from root to sequences to compute weights  increased weights for more divergent species• If 2 or more sequences share a branch, length of branch is split amongst sequences  reduced weight for related sequences• Use these weights when scoring alignments of alignments (instead of just averaging equally)Weighting SequencesLgb_lupus: weight of .442Hba_human: weight of .055 + .216/2 + .061/4 + .015/5 + .062/6 = .194Caveats for MSAs and ClustalW• Progressive alignment says nothing about the optimum MSA (sum-of-pairs or any other measure). • Initial errors from "once a gap, always a gap" are propagated/compounded • More than one optimum pairwise alignment possible, yet we are committing ourselves to only one at the outsetCaveats for MSAs and ClustalW• Order in which we add sequences to the alignment (e.g. based on the guide tree) changes alignment.• Parameter setting always an issue with alignments. (Which matrices, gap penalties?) • If any pair of sequences are less than 25% identical, then the alignments are prone to be bad. • In


View Full Document

Princeton COS 551 - Multiple Sequence Alignments

Documents in this Course
Load more
Download Multiple Sequence Alignments
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Multiple Sequence Alignments and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Multiple Sequence Alignments 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?