UVA CS 302 - Computing Genomes, Genomes Computing

Unformatted text preview:

1Class 26: Class 26: Computing Computing Genomes, Genomes, Genomes Genomes ComputingComputingDavid EvansDavid Evanshttp://www.cs.virginia.edu/evanshttp://www.cs.virginia.edu/evanscs302: Theory of Computationcs302: Theory of ComputationUniversity of Virginia Computer ScienceUniversity of Virginia Computer Science2Lecture 26: Computing Genomes and Vice VersaFinal Exam Sneak Preview• Handout available now• Honor policy: you may discuss these problems with others and use any resources you want until the Final• No notes or other resources may be used during the final• Intent is to give you an idea what to expect on the final and a chance to start thinking about some problems– Don’t attempt to memorize answers: need to understand things since the actual questions may be different3Lecture 26: Computing Genomes and Vice VersaMenu• Computing Genomes (PS6, Problem 6)– Crash course in biology• Busy Beaver result! • Computing with Genomes• Conclusion4Lecture 26: Computing Genomes and Vice VersaGenome Assembly ProblemIn order to assemble a genome, it is necessary to combine snippets from many reads into a single sequence. The input is a set of n genome snippets, each of which is a string of up to ksymbols. The output is the smallest single string that contains all of the input snippets as substrings.5Lecture 26: Computing Genomes and Vice VersaDNA• Sequence of nucleotides: adenine (A), guanine (G), cytosine (C), and thymine (T)• Two strands, A must attach to T and G must attach to CAGTCC6Lecture 26: Computing Genomes and Vice VersaCentral Dogma of Biology• RNA makes copies of DNA segments• RNA describes sequences of amino acids• Chains of amino acids make proteins• Proteins make usDNATranscriptionRNATranslationProteinImage from http://www.umich.edu/~protein/27Lecture 26: Computing Genomes and Vice VersaHuman Genome• 3 Billion Base Pairs– Each nucleotide is 2 bits (4 possibilities)– 3 B pairs * 1 byte/4 pairs = 750 MB• Every sequence of 3 base pairs one of 20 amino acids (or stop codon)– 21 possible codons, but 43= 64 possible – So, really only 750MB * (21/64) ~ 250 MB• Much of it (> 95%) is may be junk (doesn’t encode proteins, but some might be important)1 CD ~ 650 MB8Lecture 26: Computing Genomes and Vice VersaHuman Genome Race• UVa CLAS 1970• Yale PhD• Tenured Professor at U. MichiganFrancis Collins(Director of public National Center for Human Genome Research)(Picture from UVa Graduation 2001)Craig Venter(President of Celera Genomics)vs.• San Mateo College• Court-martialed• Denied tenure at SUNY Buffalo9Lecture 26: Computing Genomes and Vice VersaReading the GenomeWhitehead Institute, MIT10Lecture 26: Computing Genomes and Vice VersaGene Reading Machines• One read: about 700 base pairs• But…don’t know where they are on the chromosomeAGGCATACCAGAATACCCGTGATCCAGAATAAGCActualGenomeACCAGAATACCRead 1TCCAGAATAARead 2TACCCGTGATCCARead 311Lecture 26: Computing Genomes and Vice VersaGenome Assembly ProblemInput: Genome fragments (but without knowing where they are from)Ouput: The full genomeACCAGAATACCRead 1TCCAGAATAARead 2TACCCGTGATCCARead 312Lecture 26: Computing Genomes and Vice VersaDecision ProblemGA = { < { x1, x2, …, xn}, m > | where each xiis a string and there is a string X of length mthat includes all of the xistringsas substrings }If we had a decider for GA, can we find the length of the shortest common superstring?Yes. Try all possible m values from 1, 2, …, Σ|xi|.313Lecture 26: Computing Genomes and Vice VersaIs GA In Class NP?GA = { < { x1, x2, …, xn}, m > | where each xiis a string and there is a string X of length mthat includes all of the xistringsas substrings }Yes. The string X is a polynomial-verifiable certificate.- Check it includes each substring- Check its length is ≤ m14Lecture 26: Computing Genomes and Vice VersaIs GA NP-Complete?GA = { < { x1, x2, …, xn}, m > | where each xiis a string and there is a string X of length mthat includes all of the xistringsas substrings }15Lecture 26: Computing Genomes and Vice VersaNP-CompleteA language B is in NP-complete if:2. There is a polynomial-time reduction from every problem A ∈ NP to B.1. B ∈ NPBNPBNP?16Lecture 26: Computing Genomes and Vice VersaTo Prove NP-Hardness• Pick some known NP-Complete problem X.• Show that a polynomial-time solver for Ycould be used to build a polynomial-time solver for X.• This proves that there is no polynomial-time solver for Y (unless P = NP).17Lecture 26: Computing Genomes and Vice VersaPossible Choices…HAMPATHABCDE3SAT(a ∨ b ∨ c) ∧ (¬a ∨ b ∨ ¬c) …SUBSET-SUM{<{3, 5, 12, 13, 17}, 30}By definition, all must work. Every NP-Complete problem can be reduced to every NP-Complete problem.In practice, some will work much more easily than others. Try to pick a problem “close” to the target problem.18Lecture 26: Computing Genomes and Vice VersaBusy Beaver ChallengeRuixin Yang419Lecture 26: Computing Genomes and Vice VersaDoubly-Infinite Tape (Regular BB)20Lecture 26: Computing Genomes and Vice VersaOne-Way Infinite Tape (Challenge)21Lecture 26: Computing Genomes and Vice VersaWinning (3, 2) Machine22Lecture 26: Computing Genomes and Vice VersaCodeBusyBeaver32.cppBusyBeaver.cpp23Lecture 26: Computing Genomes and Vice VersaIs GA NP-Complete?GA = { < { x1, x2, …, xn}, m > | where each xiis a string and there is a string X of length mthat includes all of the xistringsas substrings }24Lecture 26: Computing Genomes and Vice VersaReducing HAMPATH to GA• Take an arbitrary input to HAMPATH:HAMPATH = { G, s, t | G represents a graph, and there is a path between s and t in G that includes every node in G exactly once }• Construct a corresponding input to GA such that the input is in GA if and only if the original input is in HAMPATH.So, we need to map the nodes and edges of G into thesubstrings input to GA.525Lecture 26: Computing Genomes and Vice VersaConsider Each NodeIncoming Edges(x, a)aOutgoing Edges(a, y)In a Hamiltonian path, for each node (except start and end), exactly one incoming edge, and one outgoing edge must be used.26Lecture 26: Computing Genomes and Vice VersaConsider Each NodeIncoming Edges(x, a)aOutgoing Edges(a, y)We need GA substrings to represent each edge, butsuch that only 1 can be actually followed. But, the GA superstring needs to include all substrings.Idea: make it so the untaken edges can loop back!27Lecture 26: Computing Genomes and Vice VersaSimple


View Full Document
Download Computing Genomes, Genomes Computing
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Computing Genomes, Genomes Computing and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Computing Genomes, Genomes Computing 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?