Computational Gene Finding using HMMsOutlineGene Finding ProblemGeneral ComplicationsProkaryotic vs. Eukaryotic GenesThe Biological ModelExample: CpG islandsMarkov ChainMarkov Chains for CpG discriminationHistogram of log-odds scoresHMM for CpG islandsHMM (Hidden Markov Model)Most Probable State Path: the Viterbi Algorithm for decodingAlgorithm: ViterbiCpG Example: Viterbi AlgorithmHow to Build an HMMParameter Estimation for HMMs (Case 1)Parameter Estimation for HMMs (Case 2)HMM Architectural/Topology DesignHMM-based Gene FindingVEIL: Viterbi Exon-Intron LocatorGenieGenScan OverviewGenScan ArchitectureGenScan StatesAccuracy MeasuresTest DatasetsSlide 28Why not Perfect?What We Learned…Thank You!Computational Gene Finding using HMMsNagiza F. Samatova Computational Biology InstituteOak Ridge National [email protected]Gene Finding ProblemSimple Markov ModelsHidden Markov ModelsViterbi AlgorithmParameter EstimationGENSCANPerformance EvaluationGene Finding ProblemGeneral task: to identify coding and non-coding regions from DNAFunctional elements of most interest: promoters, splice sites, exons, introns, etc.Classification task:Gene finding is a classification task:Observations: nucleotidesClasses: labels of functional elementsComputational techniques:Neural networksDiscriminant analysisDecision treesHidden Markov Models (HMMs) Focus of today’s lectureGeneral ComplicationsNot all organisms use the same genetic code:But computational methods are usually “tuned” for some set of organismsOverlapping genes (esp., in viruses):Genes encoded in different reading fames on the same or on the complementary DNAProkaryotic vs. Eukaryotic genesProkaryotic vs. Eukaryotic GenesProkaryotessmall genomeshigh gene densityno introns (or splicing)no RNA processingsimilar promotersterminators importantoverlapping genesEukaryoteslarge genomeslow gene densityintrons (splicing)RNA processingheterogeneous promotersterminators not importantpolyadenylationThe Biological ModelEukaryotic Gene StructureExample: CpG islandsNotation: C-G – denotes the C-G base pair across the two DNA strandsCpG – denotes the dinucleotide CGMethylation process in the human genome:Very high chance of methyl-C mutating to T in CpG => CpG dinucleotides are much rarerBUT it is suppressed around the promoters of many genes => CpG dinucleotides are much more than elsewhereSuch regions are called CpG islandsA few hundred to a few thousand bases longProblems: Given a short sequence, does it come from a CpG island or not?How to find the CpG islands in a long sequenceMarkov ChainDefinition: A Markov chain is a triplet (Q, {p(x1 = s)}, A), where: Q is a finite set of states. Each state corresponds to a symbol in the alphabet Σ p is the initial state probabilities. A is the state transition probabilities, denoted by ast for each s, t ∈Q. For each s, t ∈Q the transition probability is: ast ≡P(xi = t|xi-1 = s)Property: The probability of each symbol xi depends only on the value of the preceding symbol xi-1 : P (xi | xi-1,…, x1) = P (xi | xi-1)Formula: The probability of the sequence: P(x) = P(xL,xL-1,…, x1) = P (xL | xL-1) P (xL-1 | xL-2)… P (x2 | x1) P(x1) Output: The output of the model is the set of states at each instant time => the set of states are observableMarkov Chains for CpG discriminationTraining Set: set of DNA sequences w/ known CpG islandsDerive two Markov chain models:‘+’ model: from the CpG islands‘-’ model: from the remainder of sequence Transition probabilities for each model:To use these models for discrimination, calculate the log-odds ratio:• A state for each of the four letters A,C, G, and T in the DNA alphabet• Arrow <-> probability of a residue following another residuet'st'ststccastcis the number of times letter t followed letter s in the CpG islandsiiiixxxxLiaa)P(x|)P(x|S(x)111logmodelmodellog+ A C G TA.180 .274 .426 .120C.171 .368 .274 .188G.161 .339 .375 .125T.079 .355 .384 .182A TGCaGTaACaGCaATHistogram of log-odds scores-0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.40510CpG islandsNon-CpGQ1: Given a short sequence x, does it come from CpG island (Yes-No question)?• S(x)Q2: Given a long sequence x, how do we find CpG islands in it (Where question)?• Calculate the log-odds score for a window of, say, 100 nucleotides around every nucleotide, plot it, and predict CpG islands as ones w/ positive values• Drawbacks: Window sizeHMM for CpG islandsBuild a single model that combines both Markov chains:‘+’ states: A+, C+, G+, T+Emit symbols: A, C, G, T in CpG islands‘-’ states: A-, C-, G-, T-Emit symbols: A, C, G, T in non-islandsIf a sequence CGCG is emitted by states (C+,G-,C-,G+), then:In general, we DO NOT know the path. How to estimate the path?How do we find CpG islands in a sequence?A+T+G+C+A-T-G-C-A: 0 C: 0 G: 1 T: 0A: 1 C: 0 G: 0 T: 0A: 0 C: 1 G: 0 T: 0A: 0 C: 0 G: 0 T: 1A: 0 C: 0 G: 1 T: 0A: 1 C: 0 G: 0 T: 0A: 0 C: 1 G: 0 T: 0A: 0 C: 0 G: 0 T: 1Note: Each set (‘+’ or ‘-’) has an additional set of transitions as in previous Markov chain0,,,,,01111)(GGCCGGCCaaaaaCGCGPHMM (Hidden Markov Model)Definition: An HMM is a 5-tuple (Q, V, p, A, E), where: Q is a finite set of states, |Q|=N V is a finite set of observation symbols per state, |V|=M p is the initial state probabilities. A is the state transition probabilities, denoted by ast for each s, t ∈Q. For each s, t ∈Q the transition probability is: ast ≡P(xi = t|xi-1 = s) E is a probability emission matrix, esk ≡P (vk at time t | qt = s)Property: Emissions and transitions are dependent on the current state only and not on the past.Output: Only emitted symbols are observable by the system but not the underlying random walk between states -> “hidden”Most Probable State Path:the Viterbi Algorithm for decodingNotation: – the sequence of states, or the pathj – the j-th state in the pathDecoding – Observation sequence => underlying statesThe most probable path (w/ the highest probability):* = argmax P(x, ) over all possible (# of paths grows exponentially => brute force approach is
View Full Document