Protein Multiple AlignmentPapersOutlineIntroductionSlide 5Slide 6BackgroundSlide 8BackgroundSlide 10MUSCLESlide 12Three StagesStage 1: Draft ProgressiveSlide 15Slide 16Slide 17Slide 18Stage 2: Improved ProgressiveSlide 20Slide 21Slide 22Slide 23Stage 3: RefinementSlide 25Slide 26Slide 27Slide 28MUSCLE ReviewSlide 30Hidden Markov Models (HMMs)Pairwise AlignmentLazy Teacher AnalogyViterbi vs MEAProbConsNotationSlide 37Step 1: Computation of posterior-probability matricesStep 2: Computation of expected accuraciesStep 2: Computation of expected accuracies (continued)ConsistencyStep 3: Probabilistic consistency transformationStep 3: Probabilistic consistency transformation (continued)Slide 44Step 4: Computation of guide treeStep 5: Progressive alignmentPost-processing step: iterative refinementProbCons overviewSlide 49ConclusionResultsReliability ScoresQuestions?ReferencesReferences (continued)1Protein Multiple Alignmentby Konstantin Davydov2PapersMUSCLE: a multiple sequence alignment method with reduced time and space complexity by Robert C EdgarProbCons: Probabilistic Consistency-based Multiple Sequence Alignment by Chuong B. Do, Mahathi S. P. Mahabhashyam, Michael Brudno, and Serafim Batzoglou3OutlineIntroductionIntroductionBackgroundMUSCLEProbConsConclusion4IntroductionWhat is multiple protein alignment?Given N sequences of amino acids x1, x2 … xN:Insert gaps in each of the xis so that–All sequences have the same length–Score of the global map is maximumACCTGCAACTTCAAACCTGCA--AC--TTCAA5IntroductionMotivationPhylogenetic tree estimationSecondary structure predictionIdentification of critical regions6OutlineIntroductionBackgroundBackgroundMUSCLEProbConsConclusion7BackgroundAligning two sequencesACCTGCAACTTCAAACCTGCA--AC--TTCAA8BackgroundAGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGAAGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTCxyz9BackgroundUnfortunately, this can get very expensiveAligning N sequences of length L requires a matrix of size LN, where each square in the matrix has 2N-1 neighbors This gives a total time complexity of O(2N LN)10OutlineIntroductionBackgroundMUSCLEMUSCLEProbConsConclusion11MUSCLE12MUSCLEBasic Strategy: A progressive alignment is built, to which horizontal refinement is appliedThree stagesAt end of each stage, a multiple alignment is available and the algorithm can be terminated13Three StagesDraft ProgressiveImproved ProgressiveRefinement14Stage 1: Draft ProgressiveSimilarity Measure –Calculated using k-mer counting.ACCATGCGAATGGTCCACAATGk-mer: ATGCCAscore: 3215Stage 1: Draft ProgressiveDistance estimate –Based on the similarities, construct a triangular distance matrix.XXXX0.6XXX0.80.2XX0.30.70.5XX1X2X3X4X3X2X1X416Stage 1: Draft ProgressiveTree construction –From the distance matrix we construct a treeXXXX0.6XXX0.80.2XX0.30.70.5XX1X1X2X3X4X2X3X4X1X4X2X3X3X2X4X1X3X2X4X117Stage 1: Draft Progressive18Stage 1: Draft ProgressiveProgressive alignment –A progressive alignment is built by following the branching order of the tree. This yields a multiple alignment of all input sequences at the root.X1X4X2X3X3X3X2X2X4X4X1X1Alignment of X1, X2, X3, X419Stage 2: Improved ProgressiveAttempts to improve the tree and uses it to build a new progressive alignment. This stage may be iterated.X1X4X3X2XXXXXXXXXXX1X2X3X4X3X2X1X420Stage 2: Improved ProgressiveSimilarity Measure –Similarity is calculated for each pair of sequences using fractional identity computed from their mutual alignment in the current multiple alignmentTCC--AATCA--GATCA--AAG--ATACT--CTGCTCC--AATCA--AA21Stage 2: Improved ProgressiveTree construction –A tree is constructed by computing a Kimura distance matrix and applying a clustering method to itXXXXXXXXXXX1X2X3X4X3X2X1X422Stage 2: Improved ProgressiveTree comparison –The new tree is compared to the previous tree by identifying the set of internal nodes for which the branching order has changed23Stage 2: Improved ProgressiveProgressive alignment –A new progressive alignment is built X2X4X1X3X3X3X1X1X4X4X2X2New Alignment24Stage 3: RefinementPerforms iterative refinement25Stage 3: RefinementChoice of bipartition –An edge is removed from the tree, dividing the sequences into two disjoint subsets X5X4X2X3X1X5X1X4X2X326Stage 3: RefinementProfile Extraction –The multiple alignment of each subset is extracted from current multiple alignment. Columns made up of indels only are removedTCC--AATCA--GATCA--AAG--ATACT--CTGCX1X3X4X5X2TCC--AATCA--AATCA--GAT--CTGCG--ATACTCCAATCAAA27Stage 3: RefinementRe-alignment –The two profiles are then realigned with each other using profile-profile alignment.TCA--GAT--CTGCG--ATACT--CCAAT--CAAATCA--GAT--CTGCG--ATACTCCAATCAAA28Stage 3: RefinementAccept/Reject –The score of the new alignment is computed, if the score is higher than the old alignment, the new alignment is retained, otherwise it is discarded.T--CCAAT--CAAATCA--GAT--CTGCG--ATACTCC--AATCA--GATCA--AAG--ATACT--CTGCNew OldOR29MUSCLE ReviewPerformance–For alignment of N sequences of length L–Space complexity: O(N2+L2)–Time complexity: O(N4+NL2)–Time complexity without refinement: O(N3+NL2)30OutlineIntroductionBackgroundMUSCLEProbConsConclusion31Hidden Markov Models (HMMs)MJ IAGCC-AGC-GCCCAGTIMMMJMMMXY--YX--:X:Y32Pairwise AlignmentViterbi Algorithm–Picks the alignment that is most likely to be the optimal alignment–However: the most likely alignment is not the most accurate–Alternative: find the alignment of maximum expected accuracy33Lazy Teacher Analogy10 students take a 10-question true-false quizHow do you make the answer key?Viterbi Approach: Use the answer sheet of the best studentMEA Approach: Weighted majority vote4. F4. F4. T4. F4. F4. F4. F4. F4. F4. TA-AAB A-AB+B+B+B- B-C34Viterbi vs MEAViterbi–Picks the alignment with the highest chance of being completely correctMaximum Expected Accuracy–Picks the alignment with the highest expected number of correct predictions35ProbConsBasic Strategy: Uses Hidden Markov Models (HMM) to predict the probability of an alignment.Uses Maximum Expected Accuracy instead of the Viterbi alignment.5 steps36NotationGiven N sequencesS = {s1, s2, … sN}a* is the optimal alignment37ProbConsStep 1: Computation of posterior-probability matricesStep 2: Computation of expected accuraciesStep 3:
View Full Document