DOC PREVIEW
U of I CS 466 - LECTURE NOTES

This preview shows page 1-2-14-15-29-30 out of 30 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Gene finding with GeneMark.HMM(Lukashin & Borodovsky, 1997)CS 466Saurabh SinhaGene finding in bacteria• Large number of bacterial genomes sequenced (10at the time of paper, 1997)• Previous work: “Genemark” program identified geneas ORF that looks more like genes than non-genes.– Uses Markov chains of coding and non-coding sequence• 5’ (starting) boundary not well predicted– Resolution of start point ~ 100 nucleotidesGenemark.hmm• Builds on Genemark, but uses HMM for better prediction ofstart and stop• Given DNA sequence S = {b1,b2,….bL}• Find “functional sequence” A={a1,…aL} where each ai = 0 ifnon-coding, 1 if coding in forward strand, 2 if coding inreverse strand• Sounds like the Fair Bet Casino problem (sequence of cointypes “fair” or “biased”)• Find Pr(A | S) and report A that maximizes thisFunctional sequence• “A” carries information about where the codingfunction switched into non-coding (stop of gene) andvice versa.• Model sequence by HMM with different states for“coding” and “non-coding”• Maximum likelihood “A” is the optimal path throughthe HMM, given the sequence• Viterbi algorithm to solve this problemHidden Markov Model• In some states, choose (i) a length ofsequence to emit and (ii) the sequenceto emit• This is different from the Fair BetCasino problem. There, each stateemitted exactly one observation (H or T)Hidden Markov Model• “Typical” and “Atypical” gene states (one foreach of forward and reverse strands)• These two states emit coding sequence(between and excluding start and stopcodons) with different codon usage patterns• Clustering of E. coli genes showed that– majority of genes belong to one cluster (“Typical”)– many genes, believed to have been “horizontallytransferred” into the genome, belong to anothercluster (“Atypical”)Hidden State Trajectory “A”• This is similar to the “functional” sequencedefined earlier– except that we have one for each state, not onefor each nucleotide• Sequence of M hidden states ai havingduration di:– A = {(a1d1), (a2d2), …. (aMdM)}– ∑di = L• Find A* that maximizes Pr(A|S)Formulation• Find trajectory (path) A that has thehighest probability of occurringsimultaneously with the sequence S• Maximizing Pr(A,S) is the same asmaximizing Pr(A|S). Why ?! Pmax= P(A*,S) = max(a1d1)...(aMdM)ds=Ls =1M"Pr[(a1d1)...(aMdM),b1,b2...bL]Solution• Maximization problem solved by Viterbialgorithm (seen in previous lecture)Solutionmaximizing over all possible trajectoriesSolutionDefine (for dynamic progamming):the joint probability of a partial trajectory of m states (with thelast state being am) and a partial sequence of length l .transition prob.prob. of duration prob. of sequenceSolutionParameters of the HMM• Transition probability distributions, emissionprobability distributions• Fixed a priori– What was the other possibility ?– Learn parameters from data• Emission probabilities of coding sequence stateobtained from previous statistical studies: “What doesa coding sequence look like in general?”• Emission probabilities of non-coding sequenceobtained similarlyParameters of the HMM• Probability that a state “a” has duration “d” (i.e.,length of emission is d) is learned from frequencydistribution of lengths of known coding sequencesParameters of the HMM•… and non-coding sequencesParameters of the HMM• Emission probabilities of start codon fixed fromprevious studies– Pr(ATG)=0.905, Pr(GTG)=0.090, Pr(TTG)=0.005• Transition probabilities: Non-coding toTypical/Atypical coding state = 0.85/0.15Post-processing• As per the HMM, two genes cannotoverlap. In reality, genes may overlap !G2G1Post-processing• As per the HMM, two genes cannotoverlap. In reality, genes may overlap !G2G1Will predict second gene to begin hereWhat about the start codon for that second gene?Post-processing• As per the HMM, two genes cannotoverlap. In reality, genes may overlap !G2G1• Look for an RBS somewhere here.• Take each start codon here, and find RBS -19 to -4 bpupstream of itRibosome binding site (RBS)How to search for RBS?• Take 325 genes from E. coli (bacterium) withknown RBS• Align them using sequence alignment• Use this as a PWM to scan for RBSGene prediction in different species• The coding and non-coding stateemission probabilities need to betrained from each species for predictinggenes in that speciesGene prediction accuracy• Data set #1: all annotated E. coli genes• Data set #2: non-overlapping genes• Data set #3: Genes with known RBS• Data set #4: Genes with known startpositionsResultsVA: Viterbi algorithmPP: With post-processingResults• Gene overlap is an important factor• Performance goes up from 58% to 71% whenoverlapping genes are excluded from data set• Post-processing helps a lot– 58% --> 75% for data set #1• Missing genes: “False negatives” < 5%• “Wrong” gene predictions: “False positives” ~8%– Are they really false positives, or are they unannotatedgenes?Results• Compared with other programsResults• Robustness to parameter settings• Alternative set of transition probabilityvalues used• Little change in performance (~20%change in parameter values leads to <5% change in performance)Higher Order Markov models• Sequence emissions were modeled by a secondorder Markov chain.– Pr (Xi|Xi-1, Xi-2,…X1) = Pr (Xi|Xi-1, Xi-2)• Examined the effect of changing the “Markovorder” (0,1,3,4,5)• Even zeroth order Markov chain does pretty well.Higher Order Markov


View Full Document

U of I CS 466 - LECTURE NOTES

Download LECTURE NOTES
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view LECTURE NOTES and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view LECTURE NOTES 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?