DOC PREVIEW
UMD CMSC 423 - Lecture 15 Genome assembly

This preview shows page 1-2-20-21 out of 21 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CMSC423: Bioinformatic Algorithms, Databases and ToolsLecture 15Genome assemblyCMSC423 Fall 2008 2Admin•Project questions?CMSC423 Fall 2008 3Questions/answers• Why do you need a multiple alignment for phylogeny?• What is the running time of the neighbor-joining algorithm, given k sequences of length L?•What is the parsimony score of the following tree, and what are the labels at internal nodes?CTAGTTACTG2112ACTG2211ACTG2322ACTG4434ACTG5535CMSC423 Fall 2008 4Reading assignment•http://www.cbcb.umd.edu/research/assembly_primer.shtml•Chapter 4.5 – coverage statistics• Chapter 8 – genome assemblyCMSC423 Fall 2008 5Shotgun sequencingshearingsequencingassemblyoriginal DNA (hopefully)CMSC423 Fall 2008 6Overview of termsAssemblyScaffoldingCMSC423 Fall 2008 7Shortest common superstring problemGiven a set of strings, Σ=(s1, ..., sn), determine the shortest string Ssuch that every si is a sub-string of S. NP-hardapproximations: 4, 3, 2.89, ... Greedy algorithm (4-approximation)phrap, TIGR Assembler, CAP...ACAGGACTGCACAGATTGATAG ACTGCACAGATTGATAGCTGA...CMSC423 Fall 2008 8Greedy algorithm detailsCompute all pairwise overlaps*Pick best (e.g. in terms of alignment score) overlapJoin corresponding readsRepeat from * until no more joins possible• How do you compute an overlap alignment?• Hint: modify Smith-Waterman dynamic programming algorithmCMSC423 Fall 2008 9Repeats (where greedy fails)AAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACMSC423 Fall 2008 10Impact of randomness – non-uniform coverage 1 2 3 4 5 6 Coverage Contig Reads Imagine raindrops on a sidewalkCMSC423 Fall 2008 11Lander-Waterman statisticsL = read lengthT = minimum overlapG = genome sizeN = number of readsc = coverage (NL / G)σ = 1 – T/LE(#islands) = Ne-cσ E(island size) = L(ecσ – 1) / c + 1 – σcontig = island with 2 or more readsSee chapter 4.5CMSC423 Fall 2008 12All pairs alignment•Needed by the assembler•Try all pairs – must consider ~ n2 pairs•Smarter solution: only n x coverage (e.g. 8) pairs are possible–Build a table of k-mers contained in sequences (single pass through the genome)–Generate the pairs from k-mer table (single pass through k-mer table) k-mer A B C D H I F G ECMSC423 Fall 2008 13Additional pairwise-alignment details• 4 types of overlaps• Often – assume first read is “forward”• Representing the alignment• Why not store length of overlap?NormalInnieOutieAnti-normalA-hang B-hangCMSC423 Fall 2008 14Overlap-layout-consensusMain entity: readRelationship between reads: overlap12345678912 3 4 5 6 7 8 912 312 312 3123132132ACCTGAACCTGAAGCTGAACCAGACMSC423 Fall 2008 15Paths through graphs and assembly•Hamiltonian circuit: visit each node (city) exactly once, returning to the start A B D C E H G I F A B C D H I F G E GenomeCMSC423 Fall 2008 16Sequencing by hybridizationAAAAAAACAAAGAAATAACAAACGAACTAAGA...probes - all possible k-mersAACAGTAGCTAGATGAACA TAGC AGAT ACAG AGCT GATG CAGT GCTA AGTA CTAG GTAG TAGACMSC423 Fall 2008 17Assembling SBH data Main entity: oligomer (overlap)Relationship between oligomers: adjacencyACCTGATGCCAATTGCACT...CTGAT follows CCTGA (they share 4 nucleotides: CTGA)Problem: given all the k-mers, find the original stringIn assembly: fake the SBH experiment - break the reads into k-mersCMSC423 Fall 2008 18Eulerian circuit•Eulerian circuit: visit each edge (bridge) exactly once and come back to the start ACCTAGATTGAGGTCGCCTAGATTGAGGTCGACCTAGATTGAGGTCCMSC423 Fall 2008 19deBruijn graph• Nodes – set of k-mers obtained from the reads• Edges – link k-mers that overlap by k-1 lettersACCAGTGCA CCAGTGCAT•This formulation particularly useful for very short reads• Solution – Eulerian path through the graph•Note – multiple Eulerian paths possible (exponential number) due to repeatsCMSC423 Fall 2008 20deBruijn graph of Mycoplasma genitaliumCMSC423 Fall 2008 21Read-length vs. genome


View Full Document

UMD CMSC 423 - Lecture 15 Genome assembly

Documents in this Course
Midterm

Midterm

8 pages

Lecture 7

Lecture 7

15 pages

Load more
Download Lecture 15 Genome assembly
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 15 Genome assembly and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 15 Genome assembly 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?