DOC PREVIEW
U of I CS 498 - Whole Genome Sequencing

This preview shows page 1-2-15-16-31-32 out of 32 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Whole Genome SequencingOutlineSlide 3Repeat TypesThe sequencing errors, repeats, and the complexity of genomes make it necessary to use many heuristics in practice…Strategies for whole-genome sequencingHierarchical Sequencing vs. Whole Genome ShotgunWhole Genome Shotgun SequencingFragment AssemblyRead CoverageEnough CoverageLander-Waterman ModelRepeats, Errors, and Read lengthsOverlap-Layout-ConsensusOverlapOverlapping ReadsOverlapping Reads and RepeatsFinding Overlapping ReadsFinding Overlapping Reads (cont’d)LayoutMerge Reads into ContigsMerge Reads into Contigs (cont’d)Slide 23Slide 24Link Contigs into SupercontigsLink Contigs into Supercontigs (cont’d)Slide 27Slide 28Slide 29ConsensusDerive Consensus SequenceWhat You Should KnowWhole Genome Sequencing(Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005ChengXiang ZhaiDepartment of Computer ScienceUniversity of Illinois, Urbana-ChampaignMost slides are taken/adapted from Serafim Batzoglou’s lecturesOutline•Practical challenges in genome sequencing•Whole genome sequencing strategies•Sequencing coverage (Lander-Waterman model)•Overlap-Layout-Consensus approachChallenges with Fragment Assembly•Sequencing errors~1-2% of bases are wrong•Repeats•Computation: ~ O( N2 ) where N = # readsfalse overlap due to repeatBacterial genomes: 5%Mammals: 50%Repeat Types•Low-Complexity DNA (e.g. ATATATATACATA…)•Microsatellite repeats (a1…ak)N where k ~ 3-6(e.g. CAGCAGTAGCAGCACCAG)•Transposons/retrotransposons –SINE Short Interspersed Nuclear Elements(e.g., Alu: ~300 bp long, 106 copies)–LINE Long Interspersed Nuclear Elements~500 - 5,000 bp long, 200,000 copies–LTR retroposons Long Terminal Repeats (~700 bp) at each end•Gene Families genes duplicate & then diverge•Segmental duplications ~very long, very similar copiesThe sequencing errors, repeats, and the complexity of genomes make it necessary to use many heuristics in practice…The Shortest Superstring formulation is an over-simplification of the problemStrategies for whole-genome sequencing 1. Hierarchical – Clone-by-clone yeast, worm, humani. Break genome into many long fragmentsii. Map each long fragment onto the genomeiii. Sequence each fragment with shotgun2. Online version of (1) – Walking rice genomei. Break genome into many long fragmentsii. Start sequencing each fragment with shotguniii. Construct map as you go3. Whole Genome Shotgun fly, human, mouse, rat, fuguOne large shotgun pass on the whole genomeHierarchical Sequencing vs. Whole Genome Shotgun•Hierarchical Sequencing–Advantages: Easy assembly–Disadvantages: •Build library & physical map; •Redundant sequencing•Whole Genome Shotgun (WGS)–Advantages: No mapping, no redundant sequencing–Disadvantages: Difficult to assemble and resolve repeatsWhole Genome Shotgun appears to get more popular…Whole Genome Shotgun Sequencingcut many times at randomgenomeforward-reverse paired readsknown dist~500 bp~500 bpFragment AssemblyCover region with ~7-fold redundancyOverlap reads and extend to reconstruct the original genomic regionreadsRead CoverageLength of genomic segment: GNumber of reads: NLength of each read: LDefinition: Coverage C = NL/ GCEnough CoverageHow much coverage is enough?According to the Lander-Waterman model:Assuming uniform distribution of reads, C=7 results in 1 gap per 1,000 nucleotidesLander-Waterman Model•Major Assumptions–Reads are randomly distributed in the genome–The number of times a base is sequenced follows a Poisson distribution•Implications–G= genome length, L=read length, N = # reads–Mean of Poisson: =LN/G (coverage)–% bases not sequenced: p(X=0) =0.0009 = 0.09%–Total gap length: p(X=0)*G–Total number of gaps: p(X=0)*N( )!xep X xxll-= =Average timesThis model was used to plan the Human Genome Project…Repeats, Errors, and Read lengths•Repeats shorter than read length are OK•Repeats with more base pair diffs than sequencing error rate are OK•To make a smaller portion of the genome appear repetitive, try to:–Increase read length–Decrease sequencing error rateRole of error correction:Discards ~90% of single-letter sequencing errorsdecreases error rate  decreases effective repeat content However, we have only limited read length.Many heuristics have been introduced to handle repeats…Overlap-Layout-Consensus Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERAOverlap: find potentially overlapping readsLayout: merge reads into contigs and contigs into supercontigsConsensus: derive the DNA sequence and correct read errors..ACGATTACAATAGGTT..Overlap•Find the best match between the suffix of one read and the prefix of another•Due to sequencing errors, need to use dynamic programming to find the optimal overlap alignment•Apply a filtration method to filter out pairs of fragments that do not share a significantly long common substringOverlapping ReadsTAGATTACACAGATTACTAGATTACACAGATTAC|||||||||||||||||•Sort all k-mers in reads (k ~ 24)•Find pairs of reads sharing a k-mer•Extend to full alignment – throw away if not >95% similarT GATAGA| ||TACATAGT||Overlapping Reads and Repeats•A k-mer that appears N times, initiates N2 comparisons•For an Alu that appears 106 times  1012 comparisons – too much•Solution:Discard all k-mers that appear more than t  Coverage, (t ~ 10)Finding Overlapping ReadsCreate local multiple alignments from the overlapping readsTAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGAFinding Overlapping Reads (cont’d)•Correct errors using multiple alignmentTAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGAC: 20C: 35T: 30C: 35C: 40C: 20C: 35C: 0C: 35C: 40•Score alignments•Accept alignments with good scoresA: 15A: 25A: 40A: 25- A: 15A: 25A: 40A: 25A: 0Multiple alignments will be covered later in the course…Layout•Repeats are a major challenge•Do two aligned fragments really overlap, or are they from two copies of a repeat?Merge Reads into ContigsMerge reads up to potential repeat boundariesrepeat regionMerge Reads into Contigs (cont’d)•Ignore non-maximal reads•Merge only maximal reads into contigsrepeat regionMerge Reads into Contigs (cont’d)•Ignore “hanging” reads, when detecting repeat boundariessequencing errorrepeat boundary???baMerge Reads into


View Full Document

U of I CS 498 - Whole Genome Sequencing

Documents in this Course
Lecture 5

Lecture 5

13 pages

LECTURE

LECTURE

39 pages

Assurance

Assurance

44 pages

LECTURE

LECTURE

36 pages

Pthreads

Pthreads

29 pages

Load more
Download Whole Genome Sequencing
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Whole Genome Sequencing and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Whole Genome Sequencing 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?