CMSC423: Bioinformatic Algorithms, Databases and ToolsLecture 15Genome assemblyCMSC423 Fall 2008 2Admin•Project questions?CMSC423 Fall 2008 3Questions/answers• Why do you need a multiple alignment for phylogeny?• What is the running time of the neighbor-joining algorithm, given k sequences of length L?•What is the parsimony score of the following tree, and what are the labels at internal nodes?CTAGTTACTG2112ACTG2211ACTG2322ACTG4434ACTG5535CMSC423 Fall 2008 4Reading assignment•http://www.cbcb.umd.edu/research/assembly_primer.shtml•Chapter 4.5 – coverage statistics• Chapter 8 – genome assemblyCMSC423 Fall 2008 5Shotgun sequencingshearingsequencingassemblyoriginal DNA (hopefully)CMSC423 Fall 2008 6Overview of termsAssemblyScaffoldingCMSC423 Fall 2008 7Shortest common superstring problemGiven a set of strings, Σ=(s1, ..., sn), determine the shortest string Ssuch that every si is a sub-string of S. NP-hardapproximations: 4, 3, 2.89, ... Greedy algorithm (4-approximation)phrap, TIGR Assembler, CAP...ACAGGACTGCACAGATTGATAG ACTGCACAGATTGATAGCTGA...CMSC423 Fall 2008 8Greedy algorithm detailsCompute all pairwise overlaps*Pick best (e.g. in terms of alignment score) overlapJoin corresponding readsRepeat from * until no more joins possible• How do you compute an overlap alignment?• Hint: modify Smith-Waterman dynamic programming algorithmCMSC423 Fall 2008 9Repeats (where greedy fails)AAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACMSC423 Fall 2008 10Impact of randomness – non-uniform coverage 1 2 3 4 5 6 Coverage Contig Reads Imagine raindrops on a sidewalkCMSC423 Fall 2008 11Lander-Waterman statisticsL = read lengthT = minimum overlapG = genome sizeN = number of readsc = coverage (NL / G)σ = 1 – T/LE(#islands) = Ne-cσ E(island size) = L(ecσ – 1) / c + 1 – σcontig = island with 2 or more readsSee chapter 4.5CMSC423 Fall 2008 12All pairs alignment•Needed by the assembler•Try all pairs – must consider ~ n2 pairs•Smarter solution: only n x coverage (e.g. 8) pairs are possible–Build a table of k-mers contained in sequences (single pass through the genome)–Generate the pairs from k-mer table (single pass through k-mer table) k-mer A B C D H I F G ECMSC423 Fall 2008 13Additional pairwise-alignment details• 4 types of overlaps• Often – assume first read is “forward”• Representing the alignment• Why not store length of overlap?NormalInnieOutieAnti-normalA-hang B-hangCMSC423 Fall 2008 14Overlap-layout-consensusMain entity: readRelationship between reads: overlap12345678912 3 4 5 6 7 8 912 312 312 3123132132ACCTGAACCTGAAGCTGAACCAGACMSC423 Fall 2008 15Paths through graphs and assembly•Hamiltonian circuit: visit each node (city) exactly once, returning to the start A B D C E H G I F A B C D H I F G E GenomeCMSC423 Fall 2008 16Sequencing by hybridizationAAAAAAACAAAGAAATAACAAACGAACTAAGA...probes - all possible k-mersAACAGTAGCTAGATGAACA TAGC AGAT ACAG AGCT GATG CAGT GCTA AGTA CTAG GTAG TAGACMSC423 Fall 2008 17Assembling SBH data Main entity: oligomer (overlap)Relationship between oligomers: adjacencyACCTGATGCCAATTGCACT...CTGAT follows CCTGA (they share 4 nucleotides: CTGA)Problem: given all the k-mers, find the original stringIn assembly: fake the SBH experiment - break the reads into k-mersCMSC423 Fall 2008 18Eulerian circuit•Eulerian circuit: visit each edge (bridge) exactly once and come back to the start ACCTAGATTGAGGTCGCCTAGATTGAGGTCGACCTAGATTGAGGTCCMSC423 Fall 2008 19deBruijn graph• Nodes – set of k-mers obtained from the reads• Edges – link k-mers that overlap by k-1 lettersACCAGTGCA CCAGTGCAT•This formulation particularly useful for very short reads• Solution – Eulerian path through the graph•Note – multiple Eulerian paths possible (exponential number) due to repeatsCMSC423 Fall 2008 20deBruijn graph of Mycoplasma genitaliumCMSC423 Fall 2008 21Read-length vs. genome
View Full Document