DNA SequencingSlide 2Whole Genome Shotgun SequencingFragment Assembly (in whole-genome shotgun sequencing)Fragment AssemblySteps to Assemble a Genome1. Find Overlapping ReadsSlide 8Slide 9Slide 102. Merge Reads into ContigsSlide 12Slide 13Slide 14Slide 15Slide 16Overlap graph after forming contigsRepeats, errors, and contig lengthsSlide 19Slide 20Slide 21Slide 224. Derive Consensus SequenceSome AssemblersQuality of assembliesQuality of assemblies—mouseQuality of assemblies—mouseQuality of assemblies—ratHistory of WGAGenomes SequencedSome new sequencing technologiesMolecular Inversion ProbesSingle Molecule Array for Genotyping—SolexaNanopore SequencingSlide 35Nanopore Sequencing—AssemblyPyrosequencingPyrosequencing on a chipPyrosequencing SignalPyrosequencing—AssemblyPolony SequencingSome future directions for sequencingSlide 43Slide 44DNA SequencingCS262 Lecture 10, Win06, BatzoglouSome Terminologyinsert a fragment that was incorporated in a circular genome, and can be copied (cloned)vector the circular genome (host) that incorporated the fragmentBAC Bacterial Artificial Chromosome, a type of insert–vector combination, typically of length 100-200 kbread a 500-900 long word that comes out of a sequencing machinecoverage the average number of reads (or inserts) that cover a position in the target DNA pieceshotgun the process of obtaining many reads sequencing from random locations in DNA, to detect overlaps and assembleCS262 Lecture 10, Win06, BatzoglouWhole Genome Shotgun Sequencingcut many times at randomgenomeforward-reverse paired readsplasmids (2 – 10 Kbp)cosmids (40 Kbp)known dist~500 bp~500 bpFragment Assembly(in whole-genome shotgun sequencing)Fragment AssemblyGiven N reads…Given N reads…Where N ~ 30 Where N ~ 30 million…million…We need to use a We need to use a linear-time linear-time algorithmalgorithmCS262 Lecture 10, Win06, BatzoglouSteps to Assemble a Genome1. Find overlapping reads4. Derive consensus sequence..ACGATTACAATAGGTT..2. Merge some “good” pairs of reads into longer contigs3. Link contigs to form supercontigsSome Terminologyread a 500-900 long word that comes out of sequencermate pair a pair of reads from two endsof the same insert fragmentcontig a contiguous sequence formed by several overlapping readswith no gapssupercontig an ordered and oriented set(scaffold) of contigs, usually by mate pairsconsensus sequence derived from thesequene multiple alignment of reads in a contigCS262 Lecture 10, Win06, Batzoglou1. Find Overlapping Readsaaactgcagtacggatctaaactgcag aactgcagt… gtacggatct tacggatctgggcccaaactgcagtacgggcccaaa ggcccaaac… actgcagta ctgcagtacgtacggatctactacacagtacggatc tacggatct… ctactacac tactacaca(read, pos., word, orient.)aaactgcagaactgcagtactgcagta… gtacggatctacggatctgggcccaaaggcccaaacgcccaaact…actgcagtactgcagtacgtacggatctacggatctacggatcta…ctactacactactacaca(word, read, orient., pos.)aaactgcagaactgcagtacggatcta actgcagta actgcagtacccaaactgcggatctacctactacacctgcagtacctgcagtacgcccaaactggcccaaacgggcccaaagtacggatcgtacggatctacggatcttacggatcttactacacaCS262 Lecture 10, Win06, Batzoglou1. Find Overlapping Reads•Find pairs of reads sharing a k-mer, k ~ 24•Extend to full alignment – throw away if not >98% similarTAGATTACACAGATTACTAGATTACACAGATTAC|||||||||||||||||T GATAGA| ||TACATAGT|| •Caveat: repeatsA k-mer that occurs N times, causes O(N2) read/read comparisonsALU k-mers could cause up to 1,000,0002 comparisons•Solution:Discard all k-mers that occur “too often”•Set cutoff to balance sensitivity/speed tradeoff, according to genome at hand and computing resources availableCS262 Lecture 10, Win06, Batzoglou1. Find Overlapping ReadsCreate local multiple alignments from the overlapping readsTAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGACS262 Lecture 10, Win06, Batzoglou1. Find Overlapping Reads•Correct errors using multiple alignmentTAGATTACACAGATTACTGATAGATTACACAGATTACTGATAGATTACACAGATTATTGATAGATTACACAGATTACTGATAG-TTACACAGATTACTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG-TTACACAGATTATTGATAGATTACACAGATTACTGATAG-TTACACAGATTATTGAinsert Areplace T with Ccorrelated errors—probably caused by repeats disentangle overlapsTAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG-TTACACAGATTATTGATAGATTACACAGATTACTGATAG-TTACACAGATTATTGAIn practice, error correction removes up to 98% of the errorsCS262 Lecture 10, Win06, Batzoglou2. Merge Reads into Contigs•Overlap graph:Nodes: reads r1…..rnEdges: overlaps (ri, rj, shift, orientation, score)Note:of course, we don’tknow the “color” ofthese nodesReads that comefrom two regions ofthe genome (blueand red) that containthe same repeatCS262 Lecture 10, Win06, Batzoglou2. Merge Reads into ContigsWe want to merge reads up to potential repeat boundariesrepeat regionUnique ContigOvercollapsed ContigCS262 Lecture 10, Win06, Batzoglou2. Merge Reads into Contigs•Ignore non-maximal reads•Merge only maximal reads into contigsrepeat regionCS262 Lecture 10, Win06, Batzoglou2. Merge Reads into Contigs•Remove transitively inferable overlapsIf read r overlaps to the right reads r1, r2, and r1 overlaps r2, then (r, r2) can be inferred by (r, r1) and (r1, r2)r r1r2r3CS262 Lecture 10, Win06, Batzoglou2. Merge Reads into ContigsCS262 Lecture 10, Win06, Batzoglou2. Merge Reads into Contigs•Ignore “hanging” reads, when detecting repeat boundariessequencing errorrepeat boundary???baab…CS262 Lecture 10, Win06, BatzoglouOverlap graph after forming contigsUnitigs:Gene Myers, 95CS262 Lecture 10, Win06, BatzoglouRepeats, errors, and contig lengths•Repeats shorter than read length are easily resolvedRead that spans across a repeat disambiguates order of flanking regions•Repeats with more base pair diffs than sequencing error rate are OKWe throw overlaps between two reads in different copies of the repeat•To make the genome appear less repetitive, try to:Increase read lengthDecrease sequencing error rateRole of error correction:Discards up to 98% of single-letter sequencing errorsdecreases error rate decreases effective repeat content increases contig lengthCS262 Lecture 10, Win06, Batzoglou•Insert non-maximal reads whenever unambiguous2. Merge Reads into ContigsCS262 Lecture 10, Win06, Batzoglou3. Link Contigs
View Full Document