Rapid Global AlignmentsSlide 2Saving cells in DPSlide 4The Problem: Find a Chain of Local AlignmentsSparse Dynamic ProgrammingSparse Dynamic Programming – L.I.S.Sparse Dynamic Programming – L.I.S.Sparse LCS expressed as LISSlide 10Sparse Dynamic Programming for LISSparse DP for rectangle chainingSlide 13Slide 14ExampleTime AnalysisExamplesGene RecognitionGene structureSlide 20Slide 21Needles in a HaystackGene FindingSignals for Gene FindingSlide 25Exon and Intron LengthsNucleotide CompositionSlide 28Splice SitesHMMs for Gene RecognitionSlide 31Duration HMMs for Gene RecognitionGenscanUsing Comparative InformationUsing Comparative InformationPatterns of ConservationComparison-based Gene FindersTwinscanSLAM – Generalized Pair HMMNSCAN—Multiple Species Gene PredictionSlide 41Performance ComparisonCONTRASTSlide 44CONTRAST - FeaturesCONTRAST – SVM accuraciesCONTRAST - DecodingCONTRAST - TrainingSlide 49Slide 50CS262 Lecture 9, Win07, BatzoglouRapid Global AlignmentsHow to align genomic sequences in (more or less) linear timeCS262 Lecture 9, Win07, BatzoglouCS262 Lecture 9, Win07, BatzoglouSaving cells in DP1. Find local alignments2. Chain -O(NlogN) L.I.S.3. Restricted DPCS262 Lecture 9, Win07, BatzoglouMethods to CHAIN Local AlignmentsSparse Dynamic ProgrammingO(N log N)CS262 Lecture 9, Win07, BatzoglouThe Problem: Find a Chain of Local Alignments(x,y) (x’,y’)requiresx < x’y < y’Each local alignment has a weightFIND the chain with highest total weightCS262 Lecture 9, Win07, BatzoglouSparse Dynamic Programming15 3 24 16 20 4 24 3 11 1842024311151141820•Imagine a situation where the number of hits is much smaller than O(nm) – maybe O(n) insteadCS262 Lecture 9, Win07, BatzoglouSparse Dynamic Programming – L.I.S. •Longest Increasing Subsequence•Given a sequence over an ordered alphabetx = x1, …, xm•Find a subsequences = s1, …, sks1 < s2 < … < skCS262 Lecture 9, Win07, BatzoglouSparse Dynamic Programming – L.I.S.Let input be w: w1,…, wnINITIALIZATION:L: last LIS elt. array L[0] = -inf L[1] = w1 L[2…n] = +infB: array holding LIS elts; B[0] = 0P: array of backpointers// L[j]: smallest jth element wi of j-long LIS seen so farALGORITHMfor i = 2 to n { Find j such that L[j – 1] < w[i] ≤ L[j] L[j] w[i]B[j] iP[i] B[j – 1]}That’s it!!!•Running time?CS262 Lecture 9, Win07, BatzoglouSparse LCS expressed as LISCreate a sequence w•Every matching point (i, j), is inserted into w as follows:•For each column j = 1…m, insert in w the points (i, j), in decreasing row i order•The 11 example points are inserted in the order given•a = (y, x), b = (y’, x’) can be chained iffa is before b in w, andy < y’15 3 24 16 20 4 24 3 11 18642 71 8109511342024311151141820xyCS262 Lecture 9, Win07, BatzoglouSparse LCS expressed as LISCreate a sequence ww = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10)Consider now w’s elements as ordered lexicographically, where •(y, x) < (y’, x’) if y < y’Claim: An increasing subsequence of w is a common subsequence of x and y15 3 24 16 20 4 24 3 11 18642 71 8109511342024311151141820xyCS262 Lecture 9, Win07, BatzoglouSparse Dynamic Programming for LISExample:w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10) L = [L1] [L2] [L3] [L4] [L5] … 1. (4,2)2. (3,3)3. (3,3) (10,5)4. (2,5) (10,5)5. (2,5) (8,6)6. (1,6) (8,6)7. (1,6) (3,7)8. (1,6) (3,7) (4,8)9. (1,6) (3,7) (4,8) (7,9)10. (1,6) (3,7) (4,8) (5,9)11. (1,6) (3,7) (4,8) (5,9) (9,10)Longest common subsequence:s = 4, 24, 3, 11, 1815 3 24 16 20 4 24 3 11 18642 71 8109511342024311151141820xyCS262 Lecture 9, Win07, BatzoglouSparse DP for rectangle chaining•1,…, N: rectangles•(hj, lj): y-coordinates of rectangle j•w(j): weight of rectangle j•V(j): optimal score of chain ending in j•L: list of triplets (lj, V(j), j)L is sorted by lj: smallest (North) to largest (South) valueL is implemented as a balanced binary treeyhlCS262 Lecture 9, Win07, BatzoglouSparse DP for rectangle chainingMain idea: •Sweep through x-coordinates•To the right of b, anything chainable to a is chainable to b•Therefore, if V(b) > V(a), rectangle a is “useless” for subsequent chaining•In L, keep rectangles j sorted with increasing lj-coordinates sorted with increasing V(j) scoreV(b)V(a)CS262 Lecture 9, Win07, BatzoglouSparse DP for rectangle chainingGo through rectangle x-coordinates, from lowest to highest:1. When on the leftmost end of rectangle i:a. j: rectangle in L, with largest lj < hib. V(i) = w(i) + V(j)2. When on the rightmost end of i:a. k: rectangle in L, with largest lk lib. If V(i) > V(k):i. INSERT (li, V(i), i) in Lii. REMOVE all (lj, V(j), j) with V(j) V(i) & lj liijkIs k ever removed?CS262 Lecture 9, Win07, BatzoglouExamplexya: 5c: 3b: 6d: 4e: 225691011121415161. When on the leftmost end of rectangle i:a. j: rectangle in L, with largest lj < hib. V(i) = w(i) + V(j)2. When on the rightmost end of i:a. k: rectangle in L, with largest lk lib. If V(i) > V(k):i. INSERT (li, V(i), i) in Lii. REMOVE all (lj, V(j), j) with V(j) V(i) & lj lia b c d eV5LliV(i)i55a8118c11 12911b1512d1316133CS262 Lecture 9, Win07, BatzoglouTime Analysis1. Sorting the x-coords takes O(N log N)2. Going through x-coords: N steps3. Each of N steps requires O(log N) time:•Searching L takes log N•Inserting to L takes log N•All deletions are consecutive, so log N per deletion•Each element is deleted at most once: N log N for all deletions•Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search treeCS262 Lecture 9, Win07, BatzoglouExamplesHuman Genome BrowserABCCS262 Lecture 9, Win07, BatzoglouGene RecognitionCS262 Lecture 9, Win07, BatzoglouGene structureexon1exon2 exon3intron1 intron2transcriptiontranslationsplicingexon = protein-codingintron = non-codingCodon:A triplet of nucleotides that is converted to one amino acidCS262 Lecture 9, Win07, BatzoglouWhere are the genes?Where are the genes?Where are the genes?Where are the genes?CS262 Lecture 9, Win07, BatzoglouCS262 Lecture 9, Win07, BatzoglouNeedles in a HaystackCS262 Lecture 9, Win07, Batzoglou•Classes of Gene predictorsAb initio•Only look at the genomic DNA of target genomeDe novo•Target genome + aligned informant genome(s)EST/cDNA-based & combined approaches•Use aligned ESTs or cDNAs + any other kind of evidenceGene FindingEXON EXON EXON EXON
View Full Document