CMSC423 Bioinformatic Algorithms Databases and Tools Lecture 15 Genome assembly Admin Project questions CMSC423 Fall 2008 2 Questions answers Why do you need a multiple alignment for phylogeny What is the running time of the neighbor joining algorithm given k sequences of length L What is the parsimony score of the following tree and what are the labels at internal nodes ACTG 5535 ACTG 4434 ACTG 2112 C CMSC423 Fall 2008 ACTG 2211 T G ACTG 2322 T A T 3 Reading assignment http www cbcb umd edu research assembly primer shtml Chapter 4 5 coverage statistics Chapter 8 genome assembly CMSC423 Fall 2008 4 Shotgun sequencing shearing sequencing original DNA hopefully CMSC423 Fall 2008 assembly 5 Overview of terms Assembly Scaffolding CMSC423 Fall 2008 6 Shortest common superstring problem Given a set of strings s1 sn determine the shortest string S such that every si is a sub string of S ACAGGACTGCACAGATTGATAG NP hard ACTGCACAGATTGATAGCTGA approximations 4 3 2 89 Greedy algorithm 4 approximation phrap TIGR Assembler CAP CMSC423 Fall 2008 7 Greedy algorithm details Compute all pairwise overlaps Pick best e g in terms of alignment score overlap Join corresponding reads Repeat from until no more joins possible How do you compute an overlap alignment Hint modify Smith Waterman dynamic programming algorithm CMSC423 Fall 2008 8 Repeats where greedy fails AAAAAAAAAAAAAAAAAAAA AAAAAA AAAAAA AAAAAA AAAAAA AAAAAA AAAAAA AAAAAA CMSC423 Fall 2008 AAAAAA AAAAAA AAAAAA AAAAAA AAAAAA AAAAAA AAAAAA AAAAAA 9 6 5 4 3 2 1 Coverage Impact of randomness non uniform coverage Contig Reads Imagine raindrops on a sidewalk CMSC423 Fall 2008 10 Lander Waterman statistics L read length T minimum overlap G genome size N number of reads c coverage NL G 1 T L E islands Ne c E island size L ec 1 c 1 contig island with 2 or more reads See chapter 4 5 CMSC423 Fall 2008 11 All pairs alignment Needed by the assembler Try all pairs must consider n2 pairs Smarter solution only n x coverage e g 8 pairs are possible Build a table of k mers contained in sequences single pass through the genome Generate the pairs from k mer table single pass through kmer table E G k mer A B F C I H CMSC423 Fall 2008 D 12 Additional pairwise alignment details 4 types of overlaps Often assume first read is forward Normal Innie Outie Anti normal Representing the alignment A hang B hang Why not store length of overlap CMSC423 Fall 2008 13 Overlap layout consensus Main entity read Relationship between reads overlap 1 4 2 2 1 1 1 3 3 2 2 4 3 3 7 5 5 6 7 2 1 9 8 3 3 1 2 CMSC423 Fall 2008 8 6 9 ACCTGA ACCTGA AGCTGA ACCAGA 1 2 1 3 3 2 14 Paths through graphs and assembly Hamiltonian circuit visit each node city exactly once returning to the start B C D E G A A E G F I H F Genome C I H CMSC423 Fall 2008 B D 15 Sequencing by hybridization AAAA AAAC AAAG AAAT AACA AACG AACT AAGA AACAGTAGCTAGATG AACA TAGC AGAT ACAG AGCT GATG CAGT GCTA AGTA CTAG GTAG TAGA probes all possible k mers CMSC423 Fall 2008 16 Assembling SBH data Main entity oligomer overlap Relationship between oligomers adjacency ACCTGATGCCAATTGCACT CTGAT follows CCTGA they share 4 nucleotides CTGA Problem given all the k mers find the original string In assembly fake the SBH experiment break the reads into k mers CMSC423 Fall 2008 17 Eulerian circuit Eulerian circuit visit each edge bridge exactly once and come back to the start ACCTAGATTGAGGTCG ACCTAGATTGAGGTC CMSC423 Fall 2008 CCTAGATTGAGGTCG 18 deBruijn graph Nodes set of k mers obtained from the reads Edges link k mers that overlap by k 1 letters ACCAGTGCA CCAGTGCAT This formulation particularly useful for very short reads Solution Eulerian path through the graph Note multiple Eulerian paths possible exponential number due to repeats CMSC423 Fall 2008 19 deBruijn graph of Mycoplasma genitalium CMSC423 Fall 2008 20 Read length vs genome complexity CMSC423 Fall 2008 21
View Full Document
Unlocking...