DNA SequencingThe Walking MethodSlide 3Advantages & Disadvantages of Hierarchical SequencingWalking off a Single SeedSlide 6Walking off several seeds in parallelSlide 8Whole-Genome Shotgun SequencingWhole Genome Shotgun SequencingARACHNE: Steps to Assemble a Genome1. Find Overlapping ReadsSlide 13Slide 141. Find Overlapping Reads (cont’d)2. Merge Reads into ContigsRepeats, errors, and contig lengthsSlide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 254. Derive Consensus SequenceSimulated Whole Genome ShotgunMaking a Simulated ReadHuman 22, Results of SimulationsNeurospora crassa Genome (Real Data)Mouse GenomeNext few lecturesDNA SequencingThe Walking Method1. Build a very redundant library of BACs with sequenced clone-ends (cheap to build)2. Sequence some “seed” clones3. “Walk” from seeds using clone-ends to pick library clones that extend left & rightWalking: An ExampleAdvantages & Disadvantages of Hierarchical SequencingHierarchical SequencingADV. Easy assemblyDIS. Build library & physical map; redundant sequencingWhole Genome Shotgun (WGS)ADV. No mapping, no redundant sequencingDIS. Difficult to assemble and resolve repeatsThe Walking method – motivationSequence the genome clone-by-clone without a physical mapThe only costs involved are:Library of end-sequenced clones (cheap)SequencingWalking off a Single Seed•Low redundant sequencing•Many sequential stepsWalking off a single clone is impractical Cycle time to process one clone: 1-2 months1. Grow clone2. Prepare & Shear DNA3. Prepare shotgun library & perform shotgun4. Assemble in a computer5. Close remaining gapsA mammalian genome would need 15,000 walking steps !Walking off several seeds in parallel•Few sequential steps•Additional redundant sequencingIn general, can sequence a genome in ~5 walking steps, with <20% redundant sequencingEfficient InefficientUsing Two LibrariesSolution: Use a second library of small clonesMost inefficiency comes from closing a small ocean with a much larger cloneWhole-Genome Shotgun SequencingWhole Genome Shotgun Sequencingcut many times at randomgenomeforward-reverse paired readsplasmids (2 – 10 Kbp)cosmids (40 Kbp)known dist~500 bp~500 bpARACHNE: Steps to Assemble a Genome1. Find overlapping reads4. Derive consensus sequence..ACGATTACAATAGGTT..2. Merge good pairs of reads into longer contigs3. Link contigs to form supercontigs1. Find Overlapping Reads•Sort all k-mers in reads (k ~ 24)TAGATTACACAGATTACTAGATTACACAGATTAC|||||||||||||||||•Find pairs of reads sharing a k-mer•Extend to full alignment – throw away if not >95% similarT GATAGA| ||TACATAGT||1. Find Overlapping ReadsOne caveat: repeatsA k-mer that appears N times, initiates N2 comparisonsALU: 1,000,000 timesSolution:Discard all k-mers that appear more than c Coverage, (c ~ 10)1. Find Overlapping ReadsCreate local multiple alignments from the overlapping readsTAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGA1. Find Overlapping Reads (cont’d)•Correct errors using multiple alignmentTAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGAC: 20C: 35T: 30C: 35C: 40C: 20C: 35C: 0C: 35C: 40•Score alignments•Accept alignments with good scoresA: 15A: 25A: 40A: 25- A: 15A: 25A: 40A: 25A: 02. Merge Reads into ContigsMerge reads up to potential repeat boundariesrepeat regionRepeats, errors, and contig lengths•Repeats shorter than read length are OK•Repeats with more base pair diffs than sequencing error rate are OK•To make a smaller portion of the genome appear repetitive, try to:Increase read lengthDecrease sequencing error rateRole of error correction:Discards ~90% of single-letter sequencing errorsdecreases error rate decreases effective repeat content increases contig length2. Merge Reads into Contigs•Ignore non-maximal reads•Merge only maximal reads into contigsrepeat region2. Merge Reads into Contigs•Ignore “hanging” reads, when detecting repeat boundariessequencing errorrepeat boundary???ba?????Unambiguous•Insert non-maximal reads whenever unambiguous2. Merge Reads into Contigs3. Link Contigs into SupercontigsToo dense: Overcollapsed?(Myers et al. 2000)Inconsistent links: Overcollapsed?Normal densityFind all links between unique contigs3. Link Contigs into SupercontigsConnect contigs incrementally, if 2 linksFill gaps in supercontigs with paths of overcollapsed contigs3. Link Contigs into SupercontigsDefine G = ( V, E )V := contigs E := ( A, B ) such that d( A, B ) < C Reason to do so: Efficiency; full shortest paths cannot be computed3. Link Contigs into Supercontigsd ( A, B )Contig AContig B3. Link Contigs into SupercontigsContig AContig BDefine T: contigs linked to either A or BFill gap between A and B if there is a path in G passing only from contigs in T4. Derive Consensus SequenceDerive multiple alignment from pairwise read alignmentsTAGATTACACAGATTACTGA TTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAAACTATAG TTACACAGATTATTGACTTCATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGGGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAA CTADerive each consensus base by weighted votingSimulated Whole Genome Shotgun•Known genomesFlu, yeast, fly, Human chromosomes 21, 22•Make “realistic” shotgun reads •Run ARACHNE•Align output with genome and compareMaking a Simulated ReadSimulated reads have error patterns taken from random real readsERRORIZERSimulated readartificial shotgun readreal readHuman 22, Results of SimulationsPlasmid/ Cosmid cov10 X / 0.5 X 5 X / 0.5 X 3 X/ 0 XN50 contig 353 Kb 15 Kb 2.7 KbMean contig 142 Kb 10.6 Kb 2.0 KbN50 scaffold 3 Mb 3 Mb 4.1 KbAvg base qual41 32 26% > 2 kb 97.3 91.1 67Neurospora crassa Genome (Real Data)• 40 Mb genome, shotgun sequencing complete (WI-CGR)Coverage:1705 contigs368 supercontigs• 1% uncovered (of finished BACs)• Evaluated assembly using 1.5Mb of finished BACsEfficiency:Time: 20 hrMemory: 9 GbAccuracy:< 3 misassemblies compared with 1 Gb of finished sequenceErrors/106 letters:Subst. 260Indel: 164Mouse GenomeImproved version of ARACHNE assembled the mouse genomeSeveral heuristics of iteratively:Breaking supercontigs that are suspiciousRejoining supercontigsSize of problem: 32,000,000 readsTime: 15 days, 1 processorMemory: 28 GbN50 Contig size: 16.3 Kb 24.8 Kb
View Full Document