MIT 6 047 - Conditional Random Fields for Computational Gene Prediction - D583054

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 047> Conditional Random Fields for Computational Gene Prediction

MIT 6 047 - Conditional Random Fields for Computational Gene Prediction

School name Massachusetts Institute of Technology

Course 6 047- Computational Biology

Pages 43

Download Save

Unformatted text preview:

Conditional Random Fields for Computational Gene PredictionGenome AnnotationEukaryotic Gene StructureGene Prediction with HMMLimitations of HMM Approach (1)Dependent EvidenceDependent Evidence in HMMsIndependencies in X Limitations Stem from Generative ModelingGenerative Modeling of P(X)Discriminative ModelsDiscriminative ModelLinear Chain CRFThe Basic IdeaUsing CRFsWhat Are Feature Functions?Example Feature FunctionAdding EvidenceDependent Evidence in CRFsA Strategy for Selecting FeaturesImplementing HMMs as CRFsAdding New EvidenceConditional Independence of YConditional-Generative PairsConditional-Generative PairsPractical Benefit of FactorizationUsing CRFsLabeling A SequenceDynamic ProgrammingBy Analogy With HMMCombined HMM and CRF InferenceUsing CRFsMaximum Likelihood LearningMaximum Likelihood LearningGradient SearchUsing CRFsCRF Applications to Gene PredictionConradMIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, EvolutionFall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.Conditional Random Fields for Computational Gene Prediction6.047/6.878 Computational Biology: Genomes, Networks, EvolutionLecture 18 November 4, 2008Genome Annotationggggctgagatgtaaattagaggagctggagaggagtgcttcagagtttgggttgctttaagaaagggtggttccgaattctcccgtggttggagggccgaatgtgggaggagggaggataccagaggcagggagaacttgagctttactgacactgttctttttctagctgacgtgaagatgagcagctcagaggaggtgtcctggatttcctggttctgtgggctccgtggcaatgaattcttctgtgaagtgagttctcttcaacctccctacttgccagcttcacatatcttcccaccagacgttccttcacatattccacttctacactgttctctctaaagcttttatgggagagagtgtaggtgaactagggagagacacaagtacttctgctgagttgggagtgagaaacaagcacaacagatgcagttgtgttgatgataaggcatcacttagagcattttgcccaggtcaaagatgaggattttgatatgggttccctcttggcttccatgtcctgacaggtggatgaagactacatccaggacaaatttaatcttactggactcaatgagcaggtccctcactatcgacaagctctagacatgatctt ggacctggagcctggtgaggcaccctcagggttgttttgtgtgtgtgcgtgcactatttttctcttcaaatctctattcacttgcctgaattttgccaaatttcctttggttctctgatttctttaaccccaaattcatgctttattttgatcctccacctgactcttgtctagttttgtgacgtatatcacttgttctcatgtttt tgcaagggtcagaagcccaggtttctgggtcccatgcccagatgttggatggggtaaggcccaaaagtaggtgctaggcaaactgaatagcccgcagcccctggatatgggcagggcacctaggaaagctgaaaaaca agtagttgcatttggccgggctgtggttcagatgaagaactggaagacaaccccaaccagagtgacctgattgagcaggcagccgagatgctttatggattgatccacgcccgctacatccttaccaaccgtggcatc gcccagatggtgaggcctctctgctcctacctgcctccttctgagcagtaagagacacaggttcctgca gcaagaagtcatgtttaagccctgtttaaggaagctagctgagaagaggggaagaaccccagaacttggGenome sequenceRNAProteinTranslationTranscriptionFigure by MIT OpenCourseWare.Eukaryotic Gene StructureATGATGTGATGAcoding segmentcomplete mRNAATGGTAG GT AG. . .. . .. . .start codon stop codondonor site donor siteacceptor siteacceptor siteexon exon exonintronintronTGAhttp://geneprediction.org/book/classroom.htmlCourtesy of William Majoros. Used with permission.Gene Prediction with HMMATGCCCCAGTTTTTGTStartStartstartExonExonExonExon5’ Splice5’ SpliceIntronIntronIntronIntron3’ Splice3’ SpliceModel of joint distribution P(Y,X) = P(Labels,Seq)For gene prediction, we are given X…How do we select a Y efficiently?X1Y1 Y2 Y3 Yn-1Xn-1YnXnX2 X3Figure by MIT OpenCourseWare.Limitations of HMM Approach (1)All components of HMMs have strict probabilistic semanticsY2Y3Y1X2X3X1YiXi…P(Yi=exon|Yi-1=intron)P(Xi=G|Yi=exon)Each sums to 1,etc..What about incorporating both Blast and Hmmer?P(HMMer|exon)? P(Blast Hit|exon)?Dependent Evidence• HMMer protein domains predictions come from models based on known protein sequences– Protein sequences for the same domain are aligned– Conservation modelled with HMM• But these are the same proteins searched by BLAST• If we see a HMMer hit, we are already more likely to get a BLAST hit, and vice versaBLAST and HMMER do not provide independent evidence- Dependence is the rule for most evidenceDependent Evidence in HMMs• HMMs explicitly model P(Xi|Yi)=P(Blast,Hmmer|Yi)– Not enough to know P(HMMer|Yi), also need to know P(HMMer|Yi,Blast)– Need to model these dependencies in the input data• Every time we add new evidence (i.e. ESTs) we need to know aboutdependence on previous evidence– E.g. not just P(EST|Yi) but P(EST|Yi,Blast,HMMer)• Unpleasant and unnecessary for our task: classification• A common strategy is to simply assume independence (Naïve Bayes assumption)• Almost always a false assumption123 Ni iiiP(X ,X ,X ,...,X|Y)= P(X|Y)∏Independencies in X HMMs make assumptions about dependencies in XY2Y3Y1X2X3X1YiXi…P(Xi|Yi,Yi-1,Yi-2,Yi-3,…,Y1) = P(Xi|Yi)Effectively each Yi“looks” at only a contiguous subset of X given the previous Yi-1Limitations Stem from Generative ModelingHMMs are models of full joint probability distribution P(X,Y)ATGCCCCAGTTTTTGTStartStartstartExonExonExonExon5’ Splice5’ SpliceIntronIntronIntronIntron3’ Splice3’ SpliceP(X,Y) = P(Y|X) P(X)But this is all we need for gene prediction!Figure by MIT OpenCourseWare. X1Y1 Y2 Y3 Yn-1Xn-1YnXnX2 X3• HMMs expend unnecessary effort to model P(X) which is never needed for gene prediction– Must model dependencies in X• During learning, we might trade accuracy in modeling P(Y|X) in order to model P(X) accurately– Less accurate gene predictionGenerative Modeling of P(X)Discriminative ModelsATGCCCCAGTTTTTGTBlast Hits, ESTs, etc..StartStartstartExonExonExonExon5’ Splice5’ SpliceIntronIntronIntronIntron3’ Splice3’ SpliceATGCCCCAGTTTTTGTP(Y|X)Model conditional distribution P(Y|X) directlyDiscriminative models outperform generative models in several natural language processingtasksDiscriminative ModelDesirable characteristics1. Efficient learning & inference algorithms2. Easily incorporate diverse evidence 3. Build on best existing HMM models for gene callingLinear Chain CRFY2Y3Y1YNYi-1YiY4……XHidden state labels (exon, intron, etc)Input data (sequence, blast hits, ESTs, etc..)feature functionsfeature weights()()JNjjii-1j=1 iJNjjii-1Yj=1i1P(Y|X)= exp f Y ,Y ,XZ(X)Z(X)= exp f Y ,Y ,Xλλ⎧⎫⎨⎬⎩⎭⎧⎫⎨⎬⎩⎭∑∑∑∑∑normalizationThe Basic Idea• Feature functions, fj, return real values on pairs of labels and input data that we think are important for determining P(Y|X)– e.g. If the last state (yi-1) was intron and we have a blast hit (x), we have a different probability for whether we are in an exon (yi) now.• We may not know how this probability has changed or dependenceother evidence•We learn this by selecting weights, λj, to maximize the likelihood

View Full Document


School:
Email:
New Password:
Confirm Password:

MIT 6 047 - Conditional Random Fields for Computational Gene Prediction

Sign up for free to view:

Please select your school