DOC PREVIEW
Stanford CS 262 - Lecture Notes

This preview shows page 1-2-3-26-27-28 out of 28 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Slide 1Whole Genome Shotgun SequencingFragment Assembly (in whole-genome shotgun sequencing)Fragment AssemblySteps to Assemble a Genome1. Find Overlapping ReadsSlide 7Slide 8Slide 92. Merge Reads into ContigsSlide 11Slide 12Slide 13Slide 14Slide 15Overlap graph after forming contigsRepeats, errors, and contig lengthsSlide 18Slide 19Slide 20Slide 214. Derive Consensus SequenceSome AssemblersQuality of assemblies—mouseQuality of assemblies—mouseQuality of assemblies—ratHistory of WGAGenomes SequencedCS262 Lecture 11, Win07, BatzoglouSome Terminologyinsert a fragment that was incorporated in a circular genome, and can be copied (cloned)vector the circular genome (host) that incorporated the fragmentBAC Bacterial Artificial Chromosome, a type of insert–vector combination, typically of length 100-200 kbread a 500-900 long word that comes out of a sequencing machinecoverage the average number of reads (or inserts) that cover a position in the target DNA pieceshotgun the process of obtaining many reads sequencing from random locations in DNA, to detect overlaps and assembleCS262 Lecture 11, Win07, BatzoglouWhole Genome Shotgun Sequencingcut many times at randomgenomeforward-reverse paired readsplasmids (2 – 10 Kbp)cosmids (40 Kbp)known dist~500 bp~500 bpCS262 Lecture 11, Win07, BatzoglouFragment Assembly(in whole-genome shotgun sequencing)CS262 Lecture 11, Win07, BatzoglouFragment AssemblyGiven N reads…Given N reads…Where N ~ 30 Where N ~ 30 million…million…We need to use a We need to use a linear-time linear-time algorithmalgorithmCS262 Lecture 11, Win07, BatzoglouSteps to Assemble a Genome1. Find overlapping reads4. Derive consensus sequence..ACGATTACAATAGGTT..2. Merge some “good” pairs of reads into longer contigs3. Link contigs to form supercontigsSome Terminologyread a 500-900 long word that comes out of sequencermate pair a pair of reads from two endsof the same insert fragmentcontig a contiguous sequence formed by several overlapping readswith no gapssupercontig an ordered and oriented set(scaffold) of contigs, usually by mate pairsconsensus sequence derived from thesequene multiple alignment of reads in a contigCS262 Lecture 11, Win07, Batzoglou1. Find Overlapping Readsaaactgcagtacggatctaaactgcag aactgcagt… gtacggatct tacggatctgggcccaaactgcagtacgggcccaaa ggcccaaac… actgcagta ctgcagtacgtacggatctactacacagtacggatc tacggatct… ctactacac tactacaca(read, pos., word, orient.)aaactgcagaactgcagtactgcagta… gtacggatctacggatctgggcccaaaggcccaaacgcccaaact…actgcagtactgcagtacgtacggatctacggatctacggatcta…ctactacactactacaca(word, read, orient., pos.)aaactgcagaactgcagtacggatcta actgcagta actgcagtacccaaactgcggatctacctactacacctgcagtacctgcagtacgcccaaactggcccaaacgggcccaaagtacggatcgtacggatctacggatcttacggatcttactacacaCS262 Lecture 11, Win07, Batzoglou1. Find Overlapping Reads•Find pairs of reads sharing a k-mer, k ~ 24•Extend to full alignment – throw away if not >98% similarTAGATTACACAGATTACTAGATTACACAGATTAC|||||||||||||||||T GATAGA| ||TACATAGT|| •Caveat: repeatsA k-mer that occurs N times, causes O(N2) read/read comparisonsALU k-mers could cause up to 1,000,0002 comparisons•Solution:Discard all k-mers that occur “too often”•Set cutoff to balance sensitivity/speed tradeoff, according to genome at hand and computing resources availableCS262 Lecture 11, Win07, Batzoglou1. Find Overlapping ReadsCreate local multiple alignments from the overlapping readsTAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGACS262 Lecture 11, Win07, Batzoglou1. Find Overlapping Reads•Correct errors using multiple alignmentTAGATTACACAGATTACTGATAGATTACACAGATTACTGATAGATTACACAGATTATTGATAGATTACACAGATTACTGATAG-TTACACAGATTACTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG-TTACACAGATTATTGATAGATTACACAGATTACTGATAG-TTACACAGATTATTGAinsert Areplace T with Ccorrelated errors—probably caused by repeats disentangle overlapsTAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG-TTACACAGATTATTGATAGATTACACAGATTACTGATAG-TTACACAGATTATTGAIn practice, error correction removes up to 98% of the errorsCS262 Lecture 11, Win07, Batzoglou2. Merge Reads into Contigs•Overlap graph:Nodes: reads r1…..rnEdges: overlaps (ri, rj, shift, orientation, score)Note:of course, we don’tknow the “color” ofthese nodesReads that comefrom two regions ofthe genome (blueand red) that containthe same repeatCS262 Lecture 11, Win07, Batzoglou2. Merge Reads into ContigsWe want to merge reads up to potential repeat boundariesrepeat regionUnique ContigOvercollapsed ContigCS262 Lecture 11, Win07, Batzoglou2. Merge Reads into Contigs•Ignore non-maximal reads•Merge only maximal reads into contigsrepeat regionCS262 Lecture 11, Win07, Batzoglou2. Merge Reads into Contigs•Remove transitively inferable overlapsIf read r overlaps to the right reads r1, r2, and r1 overlaps r2, then (r, r2) can be inferred by (r, r1) and (r1, r2)r r1r2r3CS262 Lecture 11, Win07, Batzoglou2. Merge Reads into ContigsCS262 Lecture 11, Win07, Batzoglou2. Merge Reads into Contigs•Ignore “hanging” reads, when detecting repeat boundariessequencing errorrepeat boundary???baab…CS262 Lecture 11, Win07, BatzoglouOverlap graph after forming contigsUnitigs:Gene Myers, 95CS262 Lecture 11, Win07, BatzoglouRepeats, errors, and contig lengths•Repeats shorter than read length are easily resolvedRead that spans across a repeat disambiguates order of flanking regions•Repeats with more base pair diffs than sequencing error rate are OKWe throw overlaps between two reads in different copies of the repeat•To make the genome appear less repetitive, try to:Increase read lengthDecrease sequencing error rateRole of error correction:Discards up to 98% of single-letter sequencing errorsdecreases error rate  decreases effective repeat content  increases contig lengthCS262 Lecture 11, Win07, Batzoglou•Insert non-maximal reads whenever unambiguous2. Merge Reads into ContigsCS262 Lecture 11, Win07, Batzoglou3. Link Contigs into SupercontigsToo dense OvercollapsedInconsistent links  Overcollapsed?Normal densityCS262 Lecture 11, Win07, BatzoglouFind all links between unique contigs3. Link Contigs into SupercontigsConnect contigs incrementally, if  2 forward-reverse linkssupercontig(aka scaffold)CS262


View Full Document

Stanford CS 262 - Lecture Notes

Documents in this Course
Lecture 8

Lecture 8

38 pages

Lecture 7

Lecture 7

27 pages

Lecture 4

Lecture 4

12 pages

Lecture 1

Lecture 1

11 pages

Biology

Biology

54 pages

Lecture 7

Lecture 7

45 pages

Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?