MIT 6 047 - Motifs Are Degenerate - D2591271

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 047> Motifs Are Degenerate

MIT 6 047 - Motifs Are Degenerate

School name Massachusetts Institute of Technology

Course 6 047- Computational Biology

Pages 56

Download Save

Unformatted text preview:

Motif DiscoveryRegulatory MotifsMotifs Are DegenerateOther “Motifs”Essential TasksModeling MotifsConsensus SequencesProbabilistic ModelScoring A SequenceScoring a SequenceVisualizing Motifs – Motif LogosMotif InformationUncertainty and probabilityAverage UncertaintyEntropyEntropy versus randomnessEntropy ExamplesInformation ContentMotif InformationMotif LogoBackground DNA FrequencyA Different MeasureComparing Both MethodsOnline Logo GenerationFinding New MotifsA Promoter ModelProbability of a SequenceParameterizing the Motif ModelFinding Known MotifsMotif Position Distribution ZijCalculating the Z VectorCalculating the Z Vector - ExampleDiscovering MotifsExpectation Maximization (EM)EM for Motif DiscoveryThe M-Step Calculating the Motif MatrixMEMEP(Seq|Model) LandscapeSearch from Many Different StartsThe ZOOPS ModelE-step in the ZOOPS ModelE-step in the ZOOPS ModelM-step in the ZOOPS ModelThe TCM ModelLikelihood in the TCM ModelE-step in the TCM ModelGibbs SamplingGibbs SamplingGibbs Sampling and ClimbingAlignACEAntigen Epitope PredictionAntigens and EpitopesGenome to “Immunome”Modeling MHC EpitopesMotifs Bound by MHCsMIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, EvolutionFall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.Motif DiscoveryComputational Biology: Genomes, Networks, EvolutionLecture 9 October 2, 2008Regulatory MotifsFind promoter motifs associated with co-regulated or functionally related genesMotifs Are Degenerate• Protein-DNA interactions– Proteins read DNA by “feeling”the chemical properties of the bases– Without opening DNA (not by base complementarity)• Sequence specificity– Topology of 3D contact dictates sequence specificity of binding– Some positions are fully constrained; other positions are degenerate–“Ambiguous / degenerate”positions are loosely contacted by the transcription factorCOOHBase pairDNAArgArgSerAsnSugar phosphate backboneNH2311223A BDCFigure by MIT OpenCourseWare.Other “Motifs”• Splicing Signals– Splice junctions– Exonic Splicing Enhancers (ESE)– Exonic Splicing Surpressors (ESS)• Protein Domains– Glycosylation sites– Kinase targets– Targetting signals• Protein Epitopes– MHC binding specificities• Modeling Motifs– How to computationally represent motifs• Visualizing Motifs– Motif “Information”• Predicting Motif Instances– Using the model to classify new sequences• Learning Motif Structure– Finding new motifs, assessing their qualityEssential TasksModeling MotifsConsensus SequencesUseful for publicationIUPAC symbols for degenerate sitesNot very amenable to computationNature Biotechnology 24, 423 - 425 (2006) CCCATTGTTCTCHEM13HEM13HEM13ANB1ANB1ANB1ANB1ROX1TTTCTGGTTCTCTCAATTGTTTAGCTCATTGTTGTCTCCATTGTTCTCCCTATTGTTCTCTCCATTGTTCGTCCAATTGTTTTGYCHATTGTTCTCFigure by MIT OpenCourseWare.Probabilistic Model.2.2.5.1.7.2.2.1.3.1.2.4.5.4.1.2.2.2.2.1.4.1.2.1ACGTM1MKM1Pk(S|M)Position FrequencyMatrix (PFM)1KCount frequenciesAdd pseudocountsCCCATTHEM13HEM13HEM13ANB1ANB1ANB1ANB1ROX1TTTCTGTCAATTCTCATTTCCATTCCTATTTCCATTCCAATTFigure by MIT OpenCourseWare.Scoring A Sequence11(| ) (| )(| )log log log(|) ( |) ( |)NNii iiiiiiPS PFM P S PFMPS PFMScorePS B PS B PS B==== =∑∏To score a sequence, we compare to a null modelA: 0.25T: 0.25G: 0.25C: 0.25A: 0.25T: 0.25G: 0.25C: 0.25Background DNA (B).2.2.5.1.7.2.2.1.3.1.2.4.5.4.1.2.2.2.2.1.4.1.2.1ACGT-0.3-0.31-1.31.4-0.3-0.3-1.30.3-1.3-0.30.610.6-1.3-0.30.3-0.3-0.3-1.30.6-1.3-0.3-1.3ACGTPosition WeightMatrix (PWM)PFMScoring a SequenceMacIsaac & Fraenkel (2006) PLoS Comp BioCommon threshold = 60% of maximum scoreCourtesy of Kenzie MacIsaac and Ernest Fraenkel. Used with permission. MacIsaac, Kenzie, and Ernest Fraenkel."Practical Strategies for Discovering Regulatory DNA Sequence Motifs." PLoS Computational Biology 2, no. 4 (2006): e36.Visualizing Motifs – Motif LogosRepresent both base frequency and conservation at each positionHeight of letter proportionalto frequency of base at that positionHeight of stack proportionalto conservation at that positionMotif InformationThe height of a stack is often called the motif information at that position measured in bits{,,,}Motif Position Information 2 logbbbATGCpp==− −∑InformationWhy is this a measure of information?Uncertainty and probability“The sun will rise tomorrow”“The sun will not rise tomorrow”Uncertainty is inversely related to probability of eventNot surprising (p~1)Very surprising (p<<1)Uncertainty is related to our surprise at an eventAverage UncertaintyA “The sun will rise tomorrow”B “The sun will not rise tomorrow”P(A)=p1P(B)=p2Two possible outcomes for sun rising1122( )Uncertainty(A) ( )Uncertainty(B)log loglogiiPA PBpppppp=+=− −=−∑= EntropyWhat is our average uncertainty about the sun risingEntropyEntropy measures average uncertaintyEntropy measures randomnessIf log is base 2, then the units are called bits2() logiiiHXpp=−∑Entropy versus randomness0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 100.10.20.30.40.50.60.70.80.91Entropy is maximum at maximum randomnessP(heads)Example: Coin TossEntropyP(heads)=0.1 Not very randomH(X)=0.47 bitsP(heads)=0.5 Completely randomH(X)=1 bitsEntropy Examples() [0.25log(0.25) 0.25log(0.25)0.25log(0.25) 0.25log(0.25)]2 bitsHX=−+++=123400.10.20.30.40.50.60.70.80.91P(x)00.10.20.30.40.50.60.70.80.911234P(x)( ) [0.1log(0.1) 0.1log(0.1)0.1log(0.1) 0.75log(0.75)]0.63 bitsHX=−+++=A T G CA T G CInformation ContentInformation is a decrease in uncertaintyOnce I tell you the sun will rise, your uncertainty aboutthe event decreasesHbefore(X) Hafter(X)-Information =Information is difference in entropy after receiving informationMotif Information2-Motif Position Information ={,,,}logbbbATGCpp=−∑Hbackground(X) Hmotif_i(X)Prior uncertainty aboutnucleotideUncertainty after learning it isposition i in a motifH(X)=2 bitsATGC00.10.20.30.40.50.60.70.80.91P(x)H(X)=0.63 bitsATGC00.10.20.30.40.50.60.70.80.91P(x)Uncertainty at this position has been reduced by 0.37 bitsMotif LogoConserved ResidueReduction of uncertainty of 2 bitsLittle ConservationMinimal reduction of uncertainty{,,,}logbbbATGCpp=−∑2-Motif Position Information =Hmotif_i(X)H(X)=1.9 bitsATGCBackground DNA Frequency00.10.20.30.40.50.60.70.80.91P(x)Some motifs could have negative information!= -0.2 bitsThe definition of information assumes a uniform background DNA

View Full Document


School:
Email:
New Password:
Confirm Password:

MIT 6 047 - Motifs Are Degenerate

Sign up for free to view:

Please select your school