Multiple Sequence AlignmentsMLAGAN: progressive alignment of DNAMLAGAN: main stepsMLAGAN: multi-anchoringWhole-genome alignment Human/Mouse/RatInsertion/Deletion Rate AnalysisHeuristics to improve multiple alignmentsIterative RefinementSlide 9Slide 10Slide 11Slide 12Restricted MDPTree RefinementA* for Multiple AlignmentsSlide 16Consistency – T-CoffeeSlide 18T – Coffee LayoutGenerating Primary LibraryPrimary LibraryCombining the librariesLibrary ExtensionRunning TimeT-Coffee compared with other methodsGene RecognitionReadingGene expressionGene structureSlide 30Slide 31Slide 32Approaches to gene findingHMMs for single species gene finding: Generalized HMMsSlide 35Slide 36Slide 37Better way to do it: negative binomialBiology of SplicingConsensus splice sitesSlide 41Splice Site ModelsSlide 43Multiple Sequence AlignmentsAlgorithmsMLAGAN: progressive alignment of DNAGiven N sequences, phylogenetic treeAlign pairwise, in order of the tree (LAGAN)HumanBaboonMouseRatMLAGAN: main stepsGiven a collection of sequences, and a phylogenetic tree1. Find local alignments for every pair of sequences x, y2. Find anchors between every pair of sequences, similar to LAGAN anchoring3. Progressive alignment•Multi-Anchoring based on reconciling the pairwise anchors•LAGAN-style limited-area DP4. Optional refinement stepsMLAGAN: multi-anchoringXZYZX/YZTo anchor the (X/Y), and (Z) alignments:Whole-genome alignment Human/Mouse/RatInsertion/Deletion Rate AnalysisHeuristics to improve multiple alignments•Iterative refinement schemes•A*-based search•Consistency•Simulated Annealing•…Iterative RefinementOne problem of progressive alignment:•Initial alignments are “frozen” even when new evidence comesExample:x: GAAGTTy: GAC-TTz: GAACTGw: GTACTGFrozen!Now clear correct y = GA-CTTIterative RefinementAlgorithm (Barton-Stenberg):1. Align most similar xi, xj2. Align xk most similar to (xixj)3. Repeat 2 until (x1…xN) are aligned4. For j = 1 to N,Remove xj, and realign to x1…xj-1xj+1…xN5. Repeat 4 until convergenceNote: Guaranteed to convergeIterative RefinementFor each sequence y1. Remove y2. Realign y(while rest fixed)xyzx,z fixed projectionallow y to varyIterative RefinementExample: align (x,y), (z,w), (xy, zw):x: GAAGTTAy: GAC-TTAz: GAACTGAw: GTACTGAAfter realigning y:x: GAAGTTAy: G-ACTTA + 3 matchesz: GAACTGAw: GTACTGAIterative RefinementExample not handled well:x: GAAGTTAy1: GAC-TTAy2: GAC-TTAy3: GAC-TTAz: GAACTGAw: GTACTGARealigning any single yi changes nothingRestricted MDPRun MDP, restricted to radius R from mxyzRunning Time: O(2N RN-1 L)Tree RefinementRun 3D-DP, restricted to radius R from m, for each tree nodexyzRunning Time: ~7R2 LNR: RadiusL: Alignment LengthN: Number of SequencesA* for Multiple AlignmentsReview of the A* algorithmvSTARTGOALg(v)h(v)•g(v) is the cost so far•h(v) is an estimate of the minimum cost from v to GOAL•f(v) ≥ g(v) + h(v) is the minimum cost of a path passing by v1. Expand v with the smallest f(v)2. Never expand v, if f(v) ≥ shortest path to the goal found so farA* for Multiple Alignments•Nodes: Cells in the DP matrix•g(v): alignment cost so far•h(v): sum-of-pairs of individual pairwise alignments•Initial minimum alignment cost estimate: sum-of-pairs of global pairwise alignmentsvSTARTGOALg(v)h(v)Consistency – T-Coffeezxyxiyjyj’zkT – Coffee LayoutLALIGN CLUSTALWGenerating Primary LibraryA A A B B B A AC CB B BC C CClustalW Primary Library Lalign Primary Library (10 top scoring non–intersecting Local (Global pairwise alignment) (Pairwise alignment)Library has information for each N(N-1)/2 sequence pairs.Primary LibrarySeq A GARFIELD THE LAST FAT CAT Seq B GARFIELD THE FAST CAT - - - Prim. Weight = 88Seq A GARFIELD THE LAST FA-T CAT Seq C GARFIELD THE VERY FAST --- Prim. Weight = 77Seq A GARFIELD THE LAST FAT CAT Seq D - - - -- - -- THE - - - - FAT CAT Prim. Weight = 100Seq B GARFIELD THE - - - - FAST CAT Seq C GARFIELD THE VERY FAST CAT Prim. Weight = 100Seq B GARFIELD THE FAST CAT Seq D - - - - -- - THE FA-T CAT Prim. Weight = 100Seq C GARFIELD THE VERY FAST CAT Seq D - - - -- -- - THE - - - - FA-T CAT Prim. Weight = 100Combining the librariesSeq A GARFIELD THE LAST FAT CATSeq B GARFIELD THE FAST CAT - - -Primary weight(ClustalW)=88Primary Weight(Lalign)=88W(A(G),B(G)) = 88 + 88 = 176If a pair is duplicated across the two libraries, it is merged into single entry with weight = sum of two weightspairs of residue that did not occur are not present ( weight 0 )Library Extension• Complete extension requires examination of all triplets.• Not all bring information ( eg. A and B through D ).• Weight of a pair = weights gathered through examination of all triplets involving that pair.Running Time•Complexity of entire procedure:O(N2 * L2) + O(N3*L) + O(N3) + O(N*L2)O(N2 L2) - pair-wise library computationO(N3 L) - library extensionO(N3) - computation of NJ treeO(N L2) - progressive alignment computationWhere:L – average sequence lengthN – number of sequencesT-Coffee compared with other methodsMethod Cat1(81) Cat2(23) Cat3(4) Cat4(12) Cat5(11) Total(141)Dialign 71.0 25.2 35.1 74.7 80.4 61.5ClustalW 78.5 32.2 42.5 65.7 74.3 66.4Prrp 78.6 32.5 50.2 51.1 82.7 66.4T-Coffee 80.7 37.3 52.9 83.2 88.7 72.1Gene RecognitionCredits for slides:Marina AlexanderssonLior PachterSerge SaxonovReading•GENSCAN•EasyGene•SLAM•TwinscanOptional:Chris Burge’s ThesisGene expressionProteinRNADNAtranscriptiontranslationCCTGAGCCAACTATTGATGAAPEPTIDECCUGAGCCAACUAUUGAUGAAGene structureexon1exon2 exon3intron1 intron2transcriptiontranslationsplicingexon = codingintron = non-codingFinding genesStart codonATG5’3’Exon 1Exon 2Exon 3Intron 1 Intron 2Stop codonTAG/TGA/TAASplice sitesApproaches to gene finding•HomologyBLAST, Procrustes.•Ab initioGenscan, Genie, GeneID.•HybridsGenomeScan, GenieEST, Twinscan, SGP, ROSETTA, CEM, TBLASTX, SLAM.HMMs for single species gene finding: Generalized HMMsHMMs for gene findingGTCAGAGTAGCAAAGTAGACACTCCAGTAACGCexonexonexonintronintronintergeneintergeneGHMM for gene findingTAA A A A A A A A A A A AA AAT T T T T T T T T T T T T T TG GGG G G G GGGG G G G GC C C C C C CExon1 Exon2 Exon3durationObserved duration timesBetter way to do it: negative binomial•EasyGene:Prokaryoticgene-finderLarsen TS, Krogh A•Negative binomial with n = 3Biology of
View Full Document