UMD CMSC 858W - Genome assembly - D2734640

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 858W> Genome assembly

UMD CMSC 858W - Genome assembly

School name University of Maryland, College Park

Course Cmsc 858w- Advanced Topics In Theory Of Computing: Algorithms For Biosequence Analysis

Pages 5

Download Save

Unformatted text preview:

Genome assembly(for more info see: http://www.cbcb.umd.edu/research/assembly_primer.shtml)IntroductionSequencing technologies can only "read" short fragments from a genome. Reconstructing the entire sequence of the genome, thus, requires that these fragments be joined together in a jigsaw-puzzle-like process. Note that, in order for the reconstruction to even be possible, the individual sequences must be sampled from random locations in the genome. Also, enough sequences must be sampled to ensure that the individual sequences overlap, i.e. enough information is available to decide which sequences should be joined together. The process through which the sequences are generated is called "shotgun sequencing", and involves the random shearing (through a physical process) into small fragments of a collection of copies of the genome of interest. Is assembly possible: Lander Waterman statisticsNote that it is not even clear that the assembly of a genome from small pieces should even be possible. Given that the process through which the sequences are generated is random, it is possible that certain parts of the genome will remain uncovered unless an impractical amount of sequences are generated. To assess the theoretical feasibility of the assembly of shotgun sequencing data, Eric Lander and Mike Waterman developed a statistical analysis based on Poisson statistics. Briefly, if some events occur uniformly at random (e.g. the start of a sequencing fragment along a genome can be assumed to be chosen uniformly at random), the number of events occurring within a given time interval is represented by a Poisson distribution. Given an average "arrival rate" λ (# of events occurring within a given interval of time), the probability that exactly n events occur within the same interval is expressed by the formula:f n , =ne−n!In the context of sequencing we are interested in finding intervals that contain no events (n=0) - these would represent gaps in the coverage of the genome by sequences.The Lander-Waterman statistics estimate the number of gaps in coverage (conversely the number of contiguous DNA segments) that can be expected given the following set of parameters:• G - genome length• n - number of sequences• L - length of sequences• c = nL/G - depth of coverage (number of times genome over-sampled by the set ofsequences)• t - the amount by which two sequences need to overlap in order to computationally detect this overlap• σ = (L-t)/LAmong other numbers, the L-W statistics provide estimates for the expected number of contigs: n e−c As can be seen from the figure above, the expected number of contigs rapidly decreases once coverage exceeds about 8-10-fold, i.e. after over-sampling a genome by about 10 times, the assembly should be theoretically possible.Shortest common superstring and greedy algorithmA simple formulation of the assembly problem as an optimization problem phrases the problem as the Shortest Common Superstring Problem: find the shortest string that contains all the sequences as substrings. In other words, find the most parsimonious "explanation" for the set of sequences. A fair amount of work went into this problem in the 80s-90s - the problem was shown to be NP-hard, and a series of approximation algorithms were developed that approach an approximation factor of 2 (the reconstructed string is at most twice as long as the optimal). A simple greedy algorithm can be proven to yield a 4-approximation, however it is conjectured that this algorithm is actually 2-optimal, given that no example has yet been found where the greedy assembly algorithm has generated a worse approximation.The greedy assembly algorithm proceeds as follows:1. compare all sequences in a pairwise fashion to identify sequences that overlap each other.2. pick the sequences that overlap each other the best and merge themDrawing 1: Lander-Waterman: expected # of contigs given coverage3. repeat step 2 until no more sequences can be merged, or the remaining overlaps conflict with existing contigs.While this algorithm is only an approximation, it has been extremely successful in practice - most early genome assemblers (phrap, TIGR Assembler, CAP) relied on this simple greedy heuristic, and were successful in reconstructing a range of genomes. Graph formulations for assemblyThe parsimony definition of assembly implicit in the SCS problem can be easily seen not to be relevant in a biological setting, primarily due to the presence of repeated DNA sequences (repeats) within the genomes of most organisms. These redundant sequences would be collapsed into a single "unit" by any algorithm that attempts to solve the SCS problem.Instead other optimization criteria have been proposed that attempt to capture the biological nature of the problem. Myers proposed, for example, that we should phrase the assembly problem as the task of reconstructing a layout of the sequences that is consistent (in terms of Kolmogorov statistics) with the characteristics of the random process that generated the sequences. Unfortunately this formulation is hard to translate into a practical algorithm, though it is important to keep in mind especially in the context of validation.Instead, most modern assemblers formulate the assembly as a graph traversal problem.Overlap-layout-consensus/string graphA first formulation creates a graph that represents each sequence as a separate node, and creates an edge between any two nodes whose corresponding sequences overlap. In this formulation, we want to find a traversal of the graph that uses all the sequences, each exactly once (you cannot use a same sequence in multiple places in the genome), i.e. we are looking for a Hamiltonian path - a well known NP-hard problem. Recently, Myers has shown that through a few simplifications, including the removal of transitive edges, the problem can be rephrased as a Chinese Postman problem (find the shortest tour that traverses each edge in a graph at least once) - a problem that can be solved in polynomial time. The transformed graph is termed the assembly string graph, and the paper describing this concept is listed on the course website.DeBruijn graphAn alternative formulation for the assembly problem arose form early explorations of a sequencing technology called "sequencing by hybridization". In this approach, one could find out if a given k-mer occurred in the genome being sequenced, i.e. the assembly problem

View Full Document


School:
Email:
New Password:
Confirm Password:

UMD CMSC 858W - Genome assembly

Sign up for free to view:

Please select your school