DNA SequencingOutlineThe Basic Shotgun Sequencing Strategy Step 1: Fragment SequencingSlide 4Step 1: Generating Read (Sanger Method)Step 2: Shortest Superstring ProblemShortest Superstring Problem: ExampleA Greedy Algorithm for SSPA Special Case of SSPSequencing By HybridizationHybridization on DNA Arrayl-mer compositionSlide 13Slide 14Different sequences – the same spectrumThe SBH ProblemHow do we solve the SBH problem efficiently?Detour… Graph AlgorithmsThe Bridge Obsession ProblemFormalization of Königsberg Bridge Problem: Graph & Eulerian CycleHamiltonian Cycle ProblemBalanced GraphsEuler TheoremEuler Theorem: ProofAlgorithm for Constructing an Eulerian CycleAlgorithm for Constructing an Eulerian Cycle (cont’d)Algorithm for Constructing an Eulerian Cycle (cont’d)Euler Theorem: ExtensionEnd of Detour… Let’s see how graph algorithms can help DNA sequencing…Reducing SSP to TSP (Traveling Salesman Problem)Reducing SSP to TSP (cont’d)SSP to TSP: An ExampleSBH: Hamiltonian Path ApproachSBH: Hamiltonian Path ApproachSlide 35SBH: Eulerian Path ApproachSlide 37Some Difficulties with SBHThe Problem of RepeatsWhat You Should KnowDNA Sequencing(Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005ChengXiang ZhaiDepartment of Computer ScienceUniversity of Illinois, Urbana-ChampaignMany slides are taken/adapted from http://www.bioalgorithms.info/slides.htmOutline•The Basic Shotgun Sequencing Strategy•The shortest superstring problem–Graph algorithms•Sequencing by HybridizationThe Basic Shotgun Sequencing StrategyStep 1: Fragment Sequencingcut many times at random (Shotgun)genomic segmentGet one or two reads from each segment~500 bp ~500 bpThe Basic Shotgun Sequencing StrategyStep 2: Fragment AssemblyCover region with ~7-fold redundancyOverlap reads and extend to reconstruct the original genomic regionreadsStep 1: Generating Read (Sanger Method)1. Start at primer (restriction site)2. Grow DNA chain3. Include ddNTPs 4. Stops reaction at all possible points5. Separate products by length, using gel electrophoresisTAA ... T …TStep 2: Shortest Superstring Problem•Problem: Given a set of strings, find a shortest string that contains all of them•Input: Strings s1, s2,…., sn•Output: A string s that contains all strings s1, s2,…., sn as substrings, such that the length of s is minimized•Complexity: NP – complete How likely is the found s indeed the original genome? s approaches the genome as n if no sequencing error and fragmentation is randomShortest Superstring Problem: ExampleHow do we solve such a problem efficiently? - Greedy algorithms (approximation) - Efficient algorithms exist for special cases and are related to “graph algorithms”A Greedy Algorithm for SSP•For each pair of (segment) strings, compute an overlap score•Merge the pair with the highest score•Repeat until no more strings can be merged•If multiple strings are left, any concatenation of them would be a solution. Think about an example when this algorithm is not optimal…A Special Case of SSP•When each segment is an L-mer (L-gram), linear algorithm exists!•This makes it attractive to do “sequencing by hybridization”…Sequencing By Hybridization•Attach all possible DNA probes of length l (e.g., l = 8) to a flat surface, each probe at a distinct and known location. This set of probes is called the DNA array.•Apply a solution containing fluorescently labeled DNA fragment to the array.•The DNA fragment hybridizes with those probes that are complementary to substrings of length l of the fragment, allowing us to see which l-mers match the DNA fragmentHybridization on DNA Arrayl-mer composition•Define Spectrum ( s, l ) as the unordered multiset of all possible (n – l + 1) l-mers in a string s of length n•The order of individual elements in Spectrum ( s, l ) does not matterl-mer composition•For example, for s = TATGGTGC all of the following are equivalent representations of Spectrum ( s, 3 ): {TAT, ATG, TGG, GGT, GTG, TGC} {ATG, GGT, GTG, TAT, TGC, TGG} {TGG, TGC, TAT, GTG, GGT, ATG}l-mer composition•For example, for s = TATGGTGC all of the following are equivalent representations of Spectrum ( s, 3 ): {TAT, ATG, TGG, GGT, GTG, TGC} {ATG, GGT, GTG, TAT, TGC, TGG} {TGG, TGC, TAT, GTG, GGT, ATG} We usually choose the lexicographically maximal representation as the canonical one.Different sequences – the same spectrum•Different sequences may have the same spectrum: Spectrum(GTATCT,2)= Spectrum(GTCTAT,2)= {AT, CT, GT, TA, TC}The SBH Problem•Goal: Reconstruct a string from its l-mer composition•Input: A set S, representing all l-mers from an (unknown) string s•Output: String s such that Spectrum ( s,l ) = S and the length of s is minimumHow likely is the found s indeed the original genome? s approaches the genome as l if no sequencing errorHow do we solve the SBH problem efficiently? The solution is related to graph algorithms…Detour… Graph AlgorithmsThe Bridge Obsession ProblemBridges of KönigsbergFind a tour crossing every bridge just onceLeonhard Euler, 1735Formalization of Königsberg Bridge Problem: Graph & Eulerian Cycle•Graph G=(V,E)–V= Vertices; E= Edges•Eulerian cycle: A cycle that visits every edge exactly once•Linear time algorithm existsMore complicated KönigsbergHamiltonian Cycle Problem•Find a cycle that visits every vertex exactly once•NP – complete Game invented by Sir William Hamilton in 1857Balanced Graphs •A graph is balanced if for every vertex the number of incoming edges equals to the number of outgoing vertices: in(v)=out(v)Euler Theorem•A graph is balanced if for every vertex the number of incoming edges equals to the number of outgoing vertices: in(v)=out(v)•Theorem: A connected graph is Eulerian if and only if each of its vertices is balanced.Euler Theorem: Proof•Eulerian balanced for every edge entering v (incoming edge) there exists an edge leaving v (outgoing edge). Therefore in(v)=out(v)•balanced Eulerian ???Algorithm for Constructing an Eulerian Cycle a. Start with an arbitrary vertex v and form an arbitrary cycle with unused edges until a dead end is reached. Since the graph is Eulerian this dead end is necessarily the starting point, i.e., vertex v.Algorithm for Constructing an Eulerian Cycle (cont’d)b. If cycle from (a) above is not an Eulerian cycle, it must contain a
View Full Document