DOC PREVIEW
UTK CS 594 - Computational Gene Finding using HMMs

This preview shows page 1-2-14-15-30-31 out of 31 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 31 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 31 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 31 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 31 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 31 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 31 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 31 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Computational Gene Finding using HMMsOutlineGene Finding ProblemGeneral ComplicationsProkaryotic vs. Eukaryotic GenesThe Biological ModelExample: CpG islandsMarkov ChainMarkov Chains for CpG discriminationHistogram of log-odds scoresHMM for CpG islandsHMM (Hidden Markov Model)Most Probable State Path: the Viterbi Algorithm for decodingAlgorithm: ViterbiCpG Example: Viterbi AlgorithmHow to Build an HMMParameter Estimation for HMMs (Case 1)Parameter Estimation for HMMs (Case 2)HMM Architectural/Topology DesignHMM-based Gene FindingVEIL: Viterbi Exon-Intron LocatorGenieGenScan OverviewGenScan ArchitectureGenScan StatesAccuracy MeasuresTest DatasetsSlide 28Why not Perfect?What We Learned…Thank You!Computational Gene Finding using HMMsNagiza F. Samatova Computational Biology InstituteOak Ridge National [email protected]Gene Finding ProblemSimple Markov ModelsHidden Markov ModelsViterbi AlgorithmParameter EstimationGENSCANPerformance EvaluationGene Finding ProblemGeneral task: to identify coding and non-coding regions from DNAFunctional elements of most interest: promoters, splice sites, exons, introns, etc.Classification task:Gene finding is a classification task:Observations: nucleotidesClasses: labels of functional elementsComputational techniques:Neural networksDiscriminant analysisDecision treesHidden Markov Models (HMMs) Focus of today’s lectureGeneral ComplicationsNot all organisms use the same genetic code:But computational methods are usually “tuned” for some set of organismsOverlapping genes (esp., in viruses):Genes encoded in different reading fames on the same or on the complementary DNAProkaryotic vs. Eukaryotic genesProkaryotic vs. Eukaryotic GenesProkaryotessmall genomeshigh gene densityno introns (or splicing)no RNA processingsimilar promotersterminators importantoverlapping genesEukaryoteslarge genomeslow gene densityintrons (splicing)RNA processingheterogeneous promotersterminators not importantpolyadenylationThe Biological ModelEukaryotic Gene StructureExample: CpG islandsNotation: C-G – denotes the C-G base pair across the two DNA strandsCpG – denotes the dinucleotide CGMethylation process in the human genome:Very high chance of methyl-C mutating to T in CpG => CpG dinucleotides are much rarerBUT it is suppressed around the promoters of many genes => CpG dinucleotides are much more than elsewhereSuch regions are called CpG islandsA few hundred to a few thousand bases longProblems: Given a short sequence, does it come from a CpG island or not?How to find the CpG islands in a long sequenceMarkov ChainDefinition: A Markov chain is a triplet (Q, {p(x1 = s)}, A), where: Q is a finite set of states. Each state corresponds to a symbol in the alphabet Σ p is the initial state probabilities. A is the state transition probabilities, denoted by ast for each s, t ∈Q. For each s, t ∈Q the transition probability is: ast ≡P(xi = t|xi-1 = s)Property: The probability of each symbol xi depends only on the value of the preceding symbol xi-1 : P (xi | xi-1,…, x1) = P (xi | xi-1)Formula: The probability of the sequence: P(x) = P(xL,xL-1,…, x1) = P (xL | xL-1) P (xL-1 | xL-2)… P (x2 | x1) P(x1) Output: The output of the model is the set of states at each instant time => the set of states are observableMarkov Chains for CpG discriminationTraining Set: set of DNA sequences w/ known CpG islandsDerive two Markov chain models:‘+’ model: from the CpG islands‘-’ model: from the remainder of sequence Transition probabilities for each model:To use these models for discrimination, calculate the log-odds ratio:• A state for each of the four letters A,C, G, and T in the DNA alphabet• Arrow <-> probability of a residue following another residuet'st'ststccastcis the number of times letter t followed letter s in the CpG islandsiiiixxxxLiaa)P(x|)P(x|S(x)111logmodelmodellog+ A C G TA.180 .274 .426 .120C.171 .368 .274 .188G.161 .339 .375 .125T.079 .355 .384 .182A TGCaGTaACaGCaATHistogram of log-odds scores-0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.40510CpG islandsNon-CpGQ1: Given a short sequence x, does it come from CpG island (Yes-No question)?• S(x)Q2: Given a long sequence x, how do we find CpG islands in it (Where question)?• Calculate the log-odds score for a window of, say, 100 nucleotides around every nucleotide, plot it, and predict CpG islands as ones w/ positive values• Drawbacks: Window sizeHMM for CpG islandsBuild a single model that combines both Markov chains:‘+’ states: A+, C+, G+, T+Emit symbols: A, C, G, T in CpG islands‘-’ states: A-, C-, G-, T-Emit symbols: A, C, G, T in non-islandsIf a sequence CGCG is emitted by states (C+,G-,C-,G+), then:In general, we DO NOT know the path. How to estimate the path?How do we find CpG islands in a sequence?A+T+G+C+A-T-G-C-A: 0 C: 0 G: 1 T: 0A: 1 C: 0 G: 0 T: 0A: 0 C: 1 G: 0 T: 0A: 0 C: 0 G: 0 T: 1A: 0 C: 0 G: 1 T: 0A: 1 C: 0 G: 0 T: 0A: 0 C: 1 G: 0 T: 0A: 0 C: 0 G: 0 T: 1Note: Each set (‘+’ or ‘-’) has an additional set of transitions as in previous Markov chain0,,,,,01111)(GGCCGGCCaaaaaCGCGPHMM (Hidden Markov Model)Definition: An HMM is a 5-tuple (Q, V, p, A, E), where: Q is a finite set of states, |Q|=N V is a finite set of observation symbols per state, |V|=M p is the initial state probabilities. A is the state transition probabilities, denoted by ast for each s, t ∈Q. For each s, t ∈Q the transition probability is: ast ≡P(xi = t|xi-1 = s) E is a probability emission matrix, esk ≡P (vk at time t | qt = s)Property: Emissions and transitions are dependent on the current state only and not on the past.Output: Only emitted symbols are observable by the system but not the underlying random walk between states -> “hidden”Most Probable State Path:the Viterbi Algorithm for decodingNotation: – the sequence of states, or the pathj – the j-th state in the pathDecoding – Observation sequence => underlying statesThe most probable path (w/ the highest probability):* = argmax P(x, ) over all possible  (# of paths grows exponentially => brute force approach is


View Full Document

UTK CS 594 - Computational Gene Finding using HMMs

Documents in this Course
Load more
Download Computational Gene Finding using HMMs
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Computational Gene Finding using HMMs and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Computational Gene Finding using HMMs 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?