Unformatted text preview:

Computational Gene Finding using HMMs Nagiza F Samatova Computational Biology Institute Oak Ridge National Laboratory samatovan ornl gov Outline Gene Finding Problem Simple Markov Models Hidden Markov Models Viterbi Algorithm Parameter Estimation GENSCAN Performance Evaluation Gene Finding Problem General task to identify coding and non coding regions from DNA Functional elements of most interest promoters splice sites exons introns etc Classification task Gene finding is a classification task Observations nucleotides Classes labels of functional elements Computational techniques Neural networks Discriminant analysis Decision trees Hidden Markov Models HMMs Focus of today s lecture General Complications Not all organisms use the same genetic code But computational methods are usually tuned for some set of organisms Overlapping genes esp in viruses Genes encoded in different reading fames on the same or on the complementary DNA Prokaryotic vs Eukaryotic genes Prokaryotic vs Eukaryotic Genes Prokaryotes small genomes high gene density no introns or splicing no RNA processing similar promoters terminators important overlapping genes Eukaryotes large genomes low gene density introns splicing RNA processing heterogeneous promoters terminators not important polyadenylation The Biological Model Eukaryotic Gene Structure Example CpG islands Notation C G denotes the C G base pair across the two DNA strands CpG denotes the dinucleotide CG Methylation process in the human genome Very high chance of methyl C mutating to T in CpG CpG dinucleotides are much rarer BUT it is suppressed around the promoters of many genes CpG dinucleotides are much more than elsewhere Such regions are called CpG islands A few hundred to a few thousand bases long Problems Given a short sequence does it come from a CpG island or not How to find the CpG islands in a long sequence Markov Chain Definition A Markov chain is a triplet Q p x1 s A where Q is a finite set of states Each state corresponds to a symbol in the alphabet p is the initial state probabilities A is the state transition probabilities denoted by ast for each s t Q Output The output of the model is the set of states at For each s t Q the transition probability is ast P xi t xi 1 each instant time the set of states are observable s Property The probability of each symbol xi depends only on the value of the preceding symbol xi 1 P xi xi 1 x1 P xi xi 1 Formula The probability of the sequence P x P xL xL 1 x1 P xL xL 1 P xL 1 xL 2 P x2 x1 P x Markov Chains for CpG discrimination A aAT T aAC set of DNA sequences w known CpG islands aGT C a G GC A state for each of the four letters A C G and T in the DNA alphabet Arrow probability of a residue following another residue A C G T A 180 274 426 120 C 171 368 274 188 G 161 339 375 125 T 079 355 384 Training Set 182 Derive two Markov chain models model from the CpG islands model from the remainder of sequence Transition probabilities for each model a st c st t c st c st is the number of times letter t followed letter s in the CpG islands To use these models for discrimination calculate the log odds ratio P x model S x log P x model L i 1 log a x i 1 x i a x i 1 x i Histogram of log odds scores 10 5 NonCpG 0 0 4 0 3 0 2 0 1 CpG islands 0 0 1 0 2 0 3 0 4 Q1 Given a short sequence x does it come from CpG island Yes No question S x Q2 Given a long sequence x how do we find CpG islands in it Where question Calculate the log odds score for a window of say 100 nucleotides around every nucleotide plot it and predict CpG islands as ones w positive values Drawbacks Window size HMM for CpG islands How do we find CpG islands in a sequence A 1 C 0 G 0 T 0 A 0 C 1 G 0 T 0 A 0 C 0 G 1 T 0 A 0 C 0 G 0 T 1 A C G T Build a single model that combines both Markov chains states A C G T Emit symbols A C G T in CpG islands states A C G T A C G T A 1 C 0 G 0 T 0 A 0 C 1 G 0 T 0 A 0 C 0 G 1 T 0 A 0 C 0 G 0 T 1 Emit symbols A C G T in non islands If a sequence CGCG is emitted by states C G C G then P CGCG a0 C 1 aC G 1 aG C 1 aC G 1 aG 0 Note Each set or has an additional set of transitions as in previous Markov chain In general we DO NOT know the path How to estimate the path HMM Hidden Markov Model Definition An HMM is a 5 tuple Q V p A E where Q is a finite set of states Q N V is a finite set of observation symbols per state V M p is the initial state probabilities A is the state transition probabilities denoted by ast for each s t Q For each s t Q the transition probability is ast P xi t xi 1 s Only emitted symbols are observable by the system but not the Output underlying random walk between states hidden E is a probability emission matrix esk P vk at time t qt s Property Emissions and transitions are dependent on the current state only and not on the past Most Probable State Path the Viterbi Algorithm for decoding Notation the sequence of states or the path j the j th state in the path Decoding Observation sequence underlying states The most probable path w the highest probability argmax P x over all possible of paths grows exponentially brute force approach is impractical Can be found recursively using dynamic programming Viterbi algorithm pk i is the probability of the most pl i 1 el x i 1 max pk i akl probable path ending in state k with k observation i Foundation any partial subpath ending at a point along the true optimal path must itself be an optimal path leading up to that point So the optimal path can be found by incremental extensions of optimal subpaths Algorithm Viterbi p0 0 1 pk 0 0 for k 0 Initialization i 0 Recursion i 1 L pl i 1 el x i 1 max pk i akl k ptri l arg max k pk i 1 akl Termination Computational Complexity P x max k pk L ak 0 Brute Force O NL L arg max k pk L ak 0 Viterbi O L N2 Traceback i L 1 i 1 ptri i N number of states Q L sequence length x CpG Example Viterbi Algorithm Sequence CGCG p pl i 1 el x i 1 max pk i akl k ptri l arg …


View Full Document

UTK CS 594 - Computational Gene Finding using HMMs

Documents in this Course
Load more
Loading Unlocking...
Login

Join to view Computational Gene Finding using HMMs and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Computational Gene Finding using HMMs and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?