DOC PREVIEW
CMU BSC 03711 - barnacle4

This preview shows page 1-2-3-4-5 out of 15 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1Barnacle:An assembly algorithm for Clone-based Sequences of Whole GenomesMartin Farach-ColtonJoint work with: Vicky ChoiOutline• Introduction to Sequencing• Human Genome Project & the Sequence Assembly Problem• The Barnacle Algorithm– Details of the input– The basic idea• Comparison with NCBI’s public assembly• ConclusionDNA Sequencing• Sequencing is the process of determining the sequence of nucleotides of a region of DNA.• How do we find the sequence of a piece of DNA?Basic Operations for Sequencing• Direct Sequencing• Directed Reads• End Sequencing• Clone-Probe IncidenceDirect Sequencing• For short pieces (< 500bp)– We can determine complete sequence• Called Direct Sequencing– This is the workhorse of sequencing– Relatively fast & cheap• ~ 1% error rateGreedy Assembly aka Shotgun Sequencing• Make many copies of DNA• Cut each piece in a different way– Now 500bp pieces have overlap• Repeat until done:– Find sequences of maximal overlap • (must try reverse compliment)– Merge them, and add merged sequence to set• Assembled pieces need not form one piece– So they have gaps once assembled into contigs2Shotgun Sequencing (Draft)TargetCopiesShotgunSequence each (short) pieceShotgun Sequencing (Draft)TargetCopiesShotgunSequence each (short) pieceShotgun Sequencing (Draft)TargetCopiesShotgunSequence each (short) pieceShotgun Sequencing (Draft)TargetCopiesShotgunSequence AssemblySequence each (short) pieceConsensusContigsDirected Reads• Given a long sequence that only occurs once in the genome…– It can be extended by Directed Reads– These are 500bp at a time.– You can iterate.– Each iteration is slow and expensive.• You can connect contigs with directed readsShotgun Sequencing (Final)TargetCopiesShotgunSequence AssemblySequence each (short) pieceConsensus3Shotgun Sequencing (Final)TargetCopiesShotgunSequence AssemblyFinalSequence each (short) pieceConsensusDirected ReadWhy aren’t we done?• Lab errors limit process.– Can get false matches or miss true matches– Can get more exotic errors (more later)•Repeats– Human genome is repeat-rich•>50% repeats• 50-500kbp duplicated regions with >98% identity– 500bp fragments from different repeats can be merged.• How can we tell if we are merging from different repeats?– Repeats are the unsolved problem of sequencing! Shotgun Sequencing History• 1980s: 5 to 10 Kbp• 1990: 40 Kbp• 1995: 1.8Mbp (H. Influenzae)• 2000: 120 Mbp (Drosophila)» Except for repeated regionsShotgun Sequencing Limitation• We noted that you can have false merges.False overlapDirected reads aren’t going to help merge false contigs!Shotgun Sequencing Limitation• We noted that you can have false merges.• Once we’ve made a few bad choices, errors accumulate.• This limits the length of DNA that can be reliably sequenced by this method.• How can we shotgun longer sequences?Medium Length DNA• To scale methods up, we need operations to limit error propagation in longer pieces of DNA.• The specific operations we care about depend on DNA length.• Name of DNA pieces depend on how they are copied– Plasmid, Cosmids = a few Kbp– BACs, YACs = tens to a few hundred Kbp.4End-Sequencing• You can sequence 500 bp at each end of DNA.– They can be used to:• Keep fragment merging on track, because if two fragments are known to be e.g. 2000 bp apart and your merging doesn’t give that, you’ve got an error.• Tell the relative orientation of the pieces.– If it’s too long, the information derived is too sparse.– Plasmids are the right length (~c x 103 bp)Celera’s Shotgun Sequence• Get lots of plasmid information.• This constrains which pairs can be merged in shotgun sequence.– You merge bogus pairs with lower probability.– So you can merge longer stretches more reliably.• Or at least, that’s the idea.• They claim to have complete human genome.– Once again, repeat regions are not yet sequenced.– Plasmids can easily fit within some repeats!Probe-Clone Incidence• You can tell if a piece of DNA (clone) has some particular substring (probe).• If clone too short, unlikely to have the probe. • If clone too long, too likely to have the probe. • BACs are right length (~c x 104or c x 105bp)• Used to tell if two BACs overlap.∈?ProbeCloneClone-Probes & Physical Maps• Given a set of BACs from a Chromosome–A Physical Map is the approximate location of each BAC• Clone-Probe incidence matrices can be used to construct physical maps of BACs through– Interval Graph techniquesPhysical Mapping by ProbesACBEDFGXXXXXXXXXXXXXGFEDCBAInterval Graph5Interval Graphs• Suppose you have intervals on a line– Make a graph with:• A node for each interval• An edge between overlapping intervals• Suppose you have a graph so generated– Coming up with a set of matching intervals is called Interval Realization– A particular graph can have many different Interval RealizationsInterval RealizationsHierarchical Shotgun SequencingTargetBACsPhysical MapHierarchical Shotgun SequencingTargetBACsPhysical MapCover of SequenceHierarchical Shotgun SequencingTargetBACsPhysical MapCover of SequenceHierarchical Shotgun SequencingTargetBACsPhysical MapCover of SequenceShotgun Sequence of each cover BAC & assembleFinal Sequence6Hierarchical Shotgun Sequencing1. Copy target DNA2. Make BAC library3. Physically map all BACs4. Find a subset of BACs that cover target DNA5. Shotgun sequence only BACs in cover6. Fill in gaps between BACs7. Merge into consensus sequenceHierarchical Shotgun Sequencing• Sequencing each BAC lets you – Localize merging mistake to one BAC• Physical map lets you get covering of genome by BACs, so you end up doing less sequencing.– If sequencing were expensive & physical mapping cheap, this would be a good idea.Outline• Biological Background• Human Genome Project • The Barnacle Algorithm– Details of the input– The basic idea• Comparison with NCBI’s public assembly• ConclusionHuman Genome Project (HGP)• 1988: “Mapping and Sequencing the Human Genome”• 1990: HGP started in US• 2001: A “working draft” version• 2003: CompletedSequencing Approaches of HGP• Hierarchical Shotgun Sequencing.• The physical map was scheduled to take 5 years.• Genome centers had two choices:– Start sequencing before physical map was done.– Twiddle thumbs.Clone-based SequencingorMaking a Virtue of Necessity• Perhaps


View Full Document

CMU BSC 03711 - barnacle4

Documents in this Course
lecture

lecture

8 pages

Lecture

Lecture

3 pages

Homework

Homework

10 pages

Lecture

Lecture

17 pages

Delsuc05

Delsuc05

15 pages

hmwk1

hmwk1

2 pages

lecture

lecture

6 pages

Lecture

Lecture

10 pages

review

review

10 pages

Homework

Homework

10 pages

Midterm

Midterm

12 pages

lecture

lecture

11 pages

lecture

lecture

32 pages

Lecture

Lecture

7 pages

Lecture

Lecture

17 pages

Lecture

Lecture

12 pages

Lecture

Lecture

21 pages

Lecture

Lecture

11 pages

Lecture

Lecture

28 pages

Homework

Homework

13 pages

Logistics

Logistics

11 pages

lecture

lecture

11 pages

Lecture

Lecture

8 pages

Lecture

Lecture

9 pages

lecture

lecture

8 pages

Problem

Problem

6 pages

Homework

Homework

10 pages

Lecture

Lecture

9 pages

Problem

Problem

7 pages

hmwk4

hmwk4

7 pages

Problem

Problem

6 pages

lecture

lecture

16 pages

Problem

Problem

8 pages

Problem

Problem

6 pages

Problem

Problem

13 pages

lecture

lecture

9 pages

Problem

Problem

11 pages

Notes

Notes

7 pages

Lecture

Lecture

7 pages

Lecture

Lecture

10 pages

Lecture

Lecture

9 pages

Homework

Homework

15 pages

Lecture

Lecture

16 pages

Problem

Problem

15 pages

Load more
Download barnacle4
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view barnacle4 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view barnacle4 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?