LEHIGH CSE 397 - Sequencing and Sequence Assembly

Unformatted text preview:

Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 25Slide 26Slide 27Slide 28Slide 29Slide 30Slide 31Slide 32Slide 33Slide 34Slide 35Slide 36Slide 37Slide 38Slide 39Slide 40Slide 41Slide 42Slide 431Sequencing and Sequence Assembly--overview of the genome sequenceing processPresented by NIE , LanCSE497Feb.24, 20042IntroductionQ: What is SequenceA: To sequence a DNA molecule is to obtain the string of bases that it contains. Also know as readQ: How to sequenceA: Recall the Sanger Sequencing technology mentioned in Chapter 13IntroductionCut DNA at each base:A,C,G,T Fragment’s migrate distance is inversely proportional to their sizeSanger SequencingTCGCGATAGCTGTGCTA Run gel and read off sequence4IntroductionLimitation The size of DNA fragments that can be read in this way is about 700 bps Problem Most genomes are enormous (e.g 108 base pair in case of human).So it is impossible to be sequenced directly! This is called Large-Scale Sequencing5IntroductionSolutionBreak the DNA into small fragments randomlySequence the readable fragment directlyAssemble the fragment together to reconstruct the original DNAScaffolder gapsSolving a one-dimensional jigsaw puzzle with millions of pieces(without the box) !61. Break2. Sequence3. Assemble4. Scaffolder5. Conclusion7BreakDNA can be cutten into pieces through mechanical means8Issues in BreakHow?•CoverageThe whole fragments provide an 8X oversampling of the genome•RandomLibraries with pieces sizes of 2,4,6,10, 12 and 40 k bp were produced•CloneObtaining several copies of the original genome and fragments91. Break2. Sequence3. Assemble4. Scaffolder5. Conclusion10 SequencecloneDirected sequencing(GEL)GTCCAGCCTQ: can we read the fragment from both end?111. Break2. Sequence3. Assemble4. Scaffolder5. Conclusion123. AssembleA Simple Example ACCGT CGTGC TTACOverlap: The suffix of a fragment is same as the prefix of another. Assemble: align multiple fragments into single continuous sequence based on fragment overlap--ACCGT----CGTGCTTACTTACCGTGC133. Assemblefragmentsfragments assembletargetoriginalcontig1 contig2 gap14A simple modelThe simplest, naive approximation of DNA assemble corresponds to Shortest Superstring Problem(SCS): Given a set of string s1, ... , sn, find the shortest string s such that each si appears as a substring of s.--ACCGT----CGTGCTTACTTACCGTGC15 (1) Overlap step Create an overlap graph in which every node is a fragment and edges indicate an overlap (2) Layout step Determine which overlaps will be used in the final assembly, find an optimal spanning forest on the overlap graph16Overlap stepFinding overlap Compare each fragment with other fragments to find whether there’s overlap on its end part and another’s beginning part. We call ‘a overlap b’ when a’s suffix equal to b’s prefix17Overlap stepOverlap graphDirected, weighted graph G(V,E,w)V: set of fragmentsE : set of directed edge indicates the overlap between two fragments. An edge <a,b,w> means an overlap between a and b with weight w. this equal to suffix(a,w)=prefix(b,w)18ExamplewzxusyW=AGTATTGGCAATC Z=AATCGATGU=ATGCAAACCTX=CCTTTTGGY=TTGGCAATCAS=AATCAGG4339 4519Layout stepLooking for shortest common superstring is the same as looking for path of maxium weightUsing greedy algorithm to select a edge with the best weight at every step.The selected edge is checked by Rule. If this check is accepted, the edge is accepted, otherwise omit this edgeRule: for either node on this edge, indegree and outdegree <=1; Acyclic20At last the fragments merged together , from the point of graph, it is a forest of hamitonian paths(a path through the graph that contains each node at most once)., each path correspond to a contig21ExamplewzxusyW=AGTATTGGCAATC Z=AATCGATGU=ATGCAAACCTX=CCTTTTGGY=TTGGCAATCAS=AATCAGG4339 45W->Y->SAGTATTGGCAATC TTGGCAATCA AATCAGGAGTATTGGCAATCAGGZ->U->XAATCGATG ATGCAAACCT CCTTTTGG AATCGATGCAAACCT TTTGG22Geedy Algorithm is neither optimal nor complete, and will introduce gap223GCCATGCTGCATCan’t correctly model the assembly problem due to complication in the real problem instance23Complication with Assemble Sequencing errors. Most sequencers have around 1% error in the best case.Unknown orientation. Could have sequenced either strand.Bias in the reads. Not all regions of the sequence will be covered equally.Repeats. There is much repetitive sequence, especially in human and higher plants24Sequenceing Errors Fragments contains3 kinds of errors: insert, deletion, substitution Possibility :Substitutions ( 0.5-2% ), insert and deletion occur roughly 10 times less frequentlyhttp://compbio.uchsc.edu/Hunter_lab/Hunter/bioi7711/lecture6.ppt25x:ACCGTY:CGTGCZ:TTACU:TACCGTProblems with the simple model - ErrorsAG3xyu z352xyu z--ACCGT----CGTGCTTAC-TACCGTTTACCGTGC26Problems with the simple model - ErrorsSolutionAllow for bounded number of mismatches between overlapping fragments ----- Approximate overlapsCriterion: minimum overlap length(40 bps), error rate(less than 6% mismatches )How? Using semi-global alignment to find the best match between the suffix of one sequence and the prefix of another.27semi-global alignmentScore system: 1 for matches, -1 for mismatches, -2 for gapsInitializing the first row and first column of zero, ignore gap in both extremitiesAlgorithm is same as global comparisionSearch last column for higest score and obtain alignment by tracing back to start point ( overlap of x over y). overlap of y over x corresponds to the max in the last row0000000000000……y000x280 0 0 0 0 00 -1 1 1 -1 -20 -1 -1 0 2 00 1 -1 -2 1 10 -1 0 -2 -1 20 -1 -2 -1 -1 00 -1 0 -1 -2 -2CGATGC A C C G TOverlap: y->x CGATGC--- ----ACCGTY:X:Overlap:x->yACCG-T— --CGATGC29x:ACCGTY:CGTGCZ:TTACU:TACCGTProblems with the simple model - ErrorsAG3xyu z352--ACCGT----CGTGCTTAC-TACCGTTTACCGTGC--ACCG-T----CGATGCTT-C-TAGCGTTTACCGTGCCriterion1.Score>-32. Mismatch<22xyu z0-2030Problems with the simple model - Unkown orientationUnknowns Orientation:Fragments can be read from both of the DNA strands.SolutionTry all possible combinationCACGTACGTACTACGGTACTCACGTACGTCGTAGTAGTACCACGT-ACGT--CGTAGT-----AGTACCACGTAGTACTGAzxyZ’X’Y’31Problems with the simple


View Full Document

LEHIGH CSE 397 - Sequencing and Sequence Assembly

Download Sequencing and Sequence Assembly
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Sequencing and Sequence Assembly and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Sequencing and Sequence Assembly 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?