DOC PREVIEW
U of I CS 498 - Graphs and DNA sequencing

This preview shows page 1-2-3-18-19-37-38-39 out of 39 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 39 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 39 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 39 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 39 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 39 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 39 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 39 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 39 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 39 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Graphs and DNA sequencingCS 498 SSSaurabh SinhaEulerian Cycle Problem• Find a cycle thatvisits every edgeexactly once• Linear timeHamiltonian Cycle Problem• Find a cycle thatvisits every vertexexactly once• NP – completeGame invented by Sir William Hamilton in 1857Graph theory in biology:a detourViruses Attack Bacteria• Normally bacteriophage T4 kills bacteria• However if T4 is mutated (e.g., an important gene isdeleted) it gets disabled and looses an ability to killbacteria• Suppose the bacteria is infected with two differentmutants each of which is disabled – would thebacteria still survive?• Amazingly, a pair of disabled viruses can kill abacteria.• How can it be explained?Benzer’s Experiment• Idea: infect bacteria with pairs of mutant T4bacteriophage (virus)• Each T4 mutant has an unknown intervaldeleted from its genome• If the two intervals overlap: T4 pair is missingpart of its genome and is disabled – bacteriasurvive• If the two intervals do not overlap: T4 pairhas its entire genome and is enabled –bacteria dieBenzer’s Experiment andGraphs• Construct an interval graph: each T4mutant is a vertex, place an edgebetween mutant pairs where bacteriasurvived (i.e., the deleted intervals inthe pair of mutants overlap)• Interval graph structure reveals whetherDNA is linear or branched DNAInterval Graph: Linear GenesInterval Graph: Branched GenesInterval Graph: ComparisonLinear genome Branched genomeSee http://en.wikipedia.org/wiki/Interval_graphEnd of detourDNA SequencingDNA Sequencing• Shear DNA intomillions of smallfragments• Read 500 – 700nucleotides at a timefrom the smallfragments (Sangermethod)Fragment Assembly• Computational Challenge: assembleindividual short fragments (reads) into asingle genomic sequence(“superstring”)Strategies for whole-genomesequencing1. Hierarchical – Clone-by-clone yeast, worm, humani. Break genome into many long fragmentsii. Map each long fragment onto the genomeiii. Sequence each fragment with shotgun2. Whole Genome Shotgun fly, human, mouse, rat, fuguOne large shotgun pass on the whole genomeUntil late 1990s the shotgun fragment assembly of humangenome was viewed as intractable problemShortest Superstring Problem• Problem: Given a set of strings, find ashortest string that contains all of them• Input: Strings s1, s2,…., sn• Output: A string s that contains allstrings s1, s2,…., sn as substrings, suchthat the length of s is minimized• Complexity: NP – completeShortest Superstring Problem: ExampleShortest Superstring Problem• Can be framed as Travelling SalesmanProblem (TSP).• Early sequencing algorithms used agreedy approach: merge a pair ofstrings with maximum overlap first– Conjectured to have performanceguarantee of 2.A completely different sequencing method:Sequencing by Hybridization• 1988: SBH suggested as anan alternative sequencingmethod. Nobody believed it willever work• 1991: Light directed polymersynthesis developed by SteveFodor and colleagues.• 1994: Affymetrix develops first64-kb DNA microarrayFirst microarray prototype (1989)First commercialDNA microarrayprototype w/16,000features (1994)500,000 featuresper chip (2002)How SBH Works• Attach all possible DNA probes of length l to aflat surface, each probe at a distinct and knownlocation. This set of probes is called the DNAarray.• Apply a solution containing fluorescently labeledDNA fragment to the array.• The DNA fragment hybridizes with those probesthat are complementary to substrings of length lof the fragment.How SBH Works (cont’d)• Using a spectroscopic detector, determine whichprobes hybridize to the DNA fragment to obtainthe l–mer composition of the target DNAfragment.• Apply a combinatorial algorithm to reconstructthe sequence of the target DNA fragment fromthe l – mer composition.l-mer composition• Spectrum ( s, l ) - unordered multiset ofall possible (n – l + 1) l-mers in a string sof length n• The order of individual elements inSpectrum ( s, l ) does not matterThe SBH Problem• Goal: Reconstruct a string from its l-mercomposition• Input: A set S, representing all l-mers from an(unknown) string s• Output: String s such that Spectrum(s,l ) = SDifferent from the Shortest Superstring ProblemSBH: Hamiltonian Path ApproachS = { ATG AGG TGC TCC GTC GGT GCA CAG } Path visited every VERTEX onceATG AGG TGC TCCHGTCGGTGCA CAGATGCAGG TCCSBH: Eulerian Path Approach S = { ATG, TGC, GTG, GGC, GCA, GCG, CGT } Vertices correspond to ( l – 1 ) – mers : { AT, TG, GC, GG, GT, CA, CG } Edges correspond to l – mers from SATGTCGCAGCTGGG Path visited every EDGE onceEuler Theorem• A graph is balanced if for every vertex thenumber of incoming edges equals to thenumber of outgoing edges: in(v)=out(v)• Theorem: A connected graph is Eulerian(has an Eulerian cycle) if and only if each ofits vertices is balanced.Euler Theorem: Proof• Eulerian → balanced for every edge entering v (incoming edge)there exists an edge leaving v (outgoingedge). Therefore in(v)=out(v)• Balanced → Eulerian ???Algorithm for Constructing an EulerianCyclea. Start with an arbitrary vertexv and form an arbitrary cyclewith unused edges until adead end is reached. Sincethe graph is Eulerian thisdead end is necessarily thestarting point, i.e., vertex v.Algorithm for Constructing an Eulerian Cycle(cont’d)b. If cycle from (a) is not anEulerian cycle, it mustcontain a vertex w, whichhas untraversed edges.Perform step (a) again,using vertex w as thestarting point. Once again,we will end up in thestarting vertex w.Algorithm for Constructing an Eulerian Cycle(cont’d)c. Combine the cyclesfrom (a) and (b) into asingle cycle and iteratestep (b).Back to traditional sequencing+ =ShakeDNAfragmentsVectorCirculargenome(bacterium,plasmid)Knownlocation(restrictionsite)Different Types of Vectors> 300,000Not used muchrecentlyYAC (Yeast ArtificialChromosome)70,000 - 300,000BAC (Bacterial ArtificialChromosome)40,000Cosmid2,000 - 10,000PlasmidSize of insert(bp)VECTORRead CoverageLength of genomic segment: LNumber of reads: n Coverage C = n l / LLength of each read: lHow much coverage is enough?Lander-Waterman model:Assuming uniform distribution of reads, C=10 results in 1gapped region per 1,000,000 nucleotidesCLander-Waterman Model• Major Assumptions– Reads are randomly distributed in the genome– The number of times a base is sequenced


View Full Document

U of I CS 498 - Graphs and DNA sequencing

Documents in this Course
Lecture 5

Lecture 5

13 pages

LECTURE

LECTURE

39 pages

Assurance

Assurance

44 pages

LECTURE

LECTURE

36 pages

Pthreads

Pthreads

29 pages

Load more
Download Graphs and DNA sequencing
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Graphs and DNA sequencing and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Graphs and DNA sequencing 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?