Rapid Global AlignmentsSlide 2Main IdeaSaving cells in DPSlide 5The Problem: Find a Chain of Local AlignmentsSparse Dynamic Programming – L.I.S.Sparse LCS expressed as LISSlide 9Sparse Dynamic Programming for LISSparse DP for rectangle chainingSlide 12Slide 13ExampleTime AnalysisExamplesWhole-genome alignment Rat—Mouse—HumanSlide 18Protein ClassificationPDB GrowthProtein classificationProtein structure classificationStructure Classification DatabasesSlide 24PSI-BLASTProfiles & Profile HMMsSlide 27Profile HMMsSlide 29Slide 30Alignment of a protein to a profile HMMHow to build a profile HMMResources on the webClassification with Profile HMMsSlide 35Generative ModelsSlide 37Slide 38Slide 39Discriminative Models -- SVMk-mer based SVMs for protein classificationSVMs will find a few support vectorsBenchmarksCS262 Lecture 15, Win06, BatzoglouRapid Global AlignmentsHow to align genomic sequences in (more or less) linear timeCS262 Lecture 15, Win06, BatzoglouCS262 Lecture 15, Win06, BatzoglouMain IdeaGenomic regions of interest contain islands of similarity, such as genes1. Find local alignments2. Chain an optimal subset of them3. Refine/complete the alignmentSystems that use this idea to various degrees:MUMmer, GLASS, DIALIGN, CHAOS, AVID, LAGAN, TBA, & othersCS262 Lecture 15, Win06, BatzoglouSaving cells in DP1. Find local alignments2. Chain -O(NlogN) L.I.S.3. Restricted DPCS262 Lecture 15, Win06, BatzoglouMethods to CHAIN Local AlignmentsSparse Dynamic ProgrammingO(N log N)CS262 Lecture 15, Win06, BatzoglouThe Problem: Find a Chain of Local Alignments(x,y) (x’,y’)requiresx < x’y < y’Each local alignment has a weightFIND the chain with highest total weightCS262 Lecture 15, Win06, BatzoglouSparse Dynamic Programming – L.I.S.Let input be w: w1,…, wnINITIALIZATION:L: 1-indexed array, L[1] w1B: 0-indexed array of backpointers; B[0] = 0P: array used for traceback// L[j]: smallest last element wi of j-long LIS seen so farALGORITHMfor i = 2 to n { Find j such that L[j] < w[i] ≤ L[j+1] L[j+1] w[i]B[j+1] iP[i] B[j]}That’s it!!!•Running time?CS262 Lecture 15, Win06, BatzoglouSparse LCS expressed as LISCreate a sequence w•Every matching point (i, j), is inserted into w as follows:•For each column j = 1…m, insert in w the points (i, j), in decreasing row i order•The 11 example points are inserted in the order given•a = (y, x), b = (y’, x’) can be chained iffa is before b in w, andy < y’15 3 24 16 20 4 24 3 11 18642 71 8109511342024311151141820xyCS262 Lecture 15, Win06, BatzoglouSparse LCS expressed as LISCreate a sequence ww = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10)Consider now w’s elements as ordered lexicographically, where •(y, x) < (y’, x’) if y < y’Claim: An increasing subsequence of w is a common subsequence of x and y15 3 24 16 20 4 24 3 11 18642 71 8109511342024311151141820xyCS262 Lecture 15, Win06, BatzoglouSparse Dynamic Programming for LISExample:w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10) L =1. (4,2)2. (3,3)3. (3,3) (10,5)4. (2,5) (10,5)5. (2,5) (8,6)6. (1,6) (8,6)7. (1,6) (3,7)8. (1,6) (3,7) (4,8)9. (1,6) (3,7) (4,8) (7,9)10. (1,6) (3,7) (4,8) (5,9)11. (1,6) (3,7) (4,8) (5,9) (9,10)Longest common subsequence:s = 4, 24, 3, 11, 1815 3 24 16 20 4 24 3 11 18642 71 8109511342024311151141820xyCS262 Lecture 15, Win06, BatzoglouSparse DP for rectangle chaining•1,…, N: rectangles•(hj, lj): y-coordinates of rectangle j•w(j): weight of rectangle j•V(j): optimal score of chain ending in j•L: list of triplets (lj, V(j), j)L is sorted by lj: smallest (top) to largest (bottom) valueL is implemented as a balanced binary treeyhlCS262 Lecture 15, Win06, BatzoglouSparse DP for rectangle chainingMain idea: •Sweep through x-coordinates•To the right of b, anything chainable to a is chainable to b•Therefore, if V(b) > V(a), rectangle a is “useless” – remove it•In L, keep rectangles j sorted with increasing lj-coordinates sorted with increasing V(j) scoreV(b)V(a)CS262 Lecture 15, Win06, BatzoglouSparse DP for rectangle chainingGo through rectangle x-coordinates, from lowest to highest:1. When on the leftmost end of rectangle i:a. j: rectangle in L, with largest lj < hib. V(i) = w(i) + V(j)2. When on the rightmost end of i:a. k: rectangle in L, with largest lk lib. If V(i) V(k):i. INSERT (li, V(i), i) in Lii. REMOVE all (lj, V(j), j) with V(j) V(i) & lj liijkCS262 Lecture 15, Win06, BatzoglouExamplexy1: 53: 32: 64: 45: 22569101112141516CS262 Lecture 15, Win06, BatzoglouTime Analysis1. Sorting the x-coords takes O(N log N)2. Going through x-coords: N steps3. Each of N steps requires O(log N) time:•Searching L takes log N•Inserting to L takes log N•All deletions are consecutive, so log N per deletion•Each element is deleted at most once: N log N for all deletions•Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search treeCS262 Lecture 15, Win06, BatzoglouExamplesHuman Genome BrowserABCCS262 Lecture 15, Win06, BatzoglouWhole-genome alignment Rat—Mouse—HumanCS262 Lecture 15, Win06, BatzoglouNext 2 years: 20+ mammals, & many other animals, will be sequenced & alignedCS262 Lecture 15, Win06, BatzoglouProtein ClassificationCS262 Lecture 15, Win06, BatzoglouPDB GrowthNew PDB structuresCS262 Lecture 15, Win06, BatzoglouProtein classification•Number of protein sequences grow exponentially•Number of solved structures grow exponentially•Number of new folds identified very small (and close to constant)•Protein classification canGenerate overview of structure typesDetect similarities (evolutionary relationships) between protein sequencesMorten Nielsen,CBS, BioCentrum, DTUSCOP release 1.67, Class # folds # superfamilies # familiesAll alpha proteins 202 342 550All beta proteins 141 280 529Alpha and beta proteins (a/b) 130 213 593Alpha and beta proteins (a+b) 260 386 650Multi-domain proteins 40 40 55Membrane & cell surface 42 82 91Small proteins 72 104 162Total 887 1447 2630CS262 Lecture 15, Win06, BatzoglouProtein worldProtein foldProtein structure classificationProtein superfamilyProtein familyMorten Nielsen,CBS, BioCentrum, DTUCS262 Lecture 15, Win06, BatzoglouStructure Classification Databases•SCOPManual classification (A. Murzin)scop.berkeley.edu•CATHSemi manual classification (C. Orengo)www.biochem.ucl.ac.uk/bsm/cath •FSSPAutomatic classification
View Full Document