DOC PREVIEW
UMD CMSC 423 - Motif Search

This preview shows page 1-2-21-22 out of 22 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Motif SearchCMSC 423● substrings encode for genes ,most of which encode for proteins● double-stranded, linear moleculeDNA = ● strands are complements of each other (A 󲰸 T; C 󲰸 G)● each strand is string over {A,C,G,T}mRNAproteinsTranscription(T ➝ U)TranslationGenome“Central Dogma” of BiologyDNA -> mRNA -> ProteinTSSGeneIntron: nottranslatedExon:translatedUpstream regionTF Binding sitesTranscriptionfactorDNA polymeraseDNAFinding transcription factor binding sites can tell us about the cell’s regulatory network.RNA Polymeraseb/c it makes RNAinto a polymeris an enzymeDiscovered in 1960; Nobel prize for its discovery in 1959... oops1959 Nobel awarded to Severo Ochoa and Arthur Kornberg for discovering what was mistakenly believed to be RNA pol.2006 Nobel awarded to Roger Kornberg (son of Arthur) for detailed structure of RNA pol.1960 Sam Weiss and Jared Hurwitz discover the real RNA pol.Image of transcription occurring.Each “hair” is a piece of RNA that RNA pol is growing off of the DNA.Transcription Factor Binding SitesRegulonDB (Feb 27, 2010)Length of E. coli K12 TF binding sitesTranscription Factor Binding SitesRegulonDB (Feb 27, 2010)Transcription Factor Binding SitesRegulonDB (Feb 27, 2010)Position -10Position -44X-axes of plots are not to scale.Transcription Network169 transcription factors (excluding sigmas)3322 edges 1753 activation, 1369 repression, 185 both, 3 unknownSequence Profiles←CCT domain, often found near one end of plant proteins. Suppose we want to search for other examples of this domain.How can we represent the pattern implied by these sequences?One way is a Sequence ProfileSequence Profiles (PSSM)...ACDETVWY1 2 3 4 5 6 7 8 9 10 11 12 13 14 1516 17 18 19Motif PositionAmino Acid∑ = 1Color ≈ Probability that the ith position has the given amino acid = ei(x).Sequence LogosMotif PositionHeight of letter ≈ fraction of time that letter is observed at that position. (Height of all the letters in a column ≈ to how conserved the column is)Scoring a SequenceMRGSAMASINDSKILSLQNKKNALVDTSGYNAEVRVGDNVQLNTIYTNDFKLSSSGDKIIVNColor ≈ Probability that the ith position has the given amino acid = ei(x).xM=Score(x)=Pr(x | M )=L�i=1ei(xi)Score of a string according to profile M =Product of the probabilities you would observe the given letters.Background FrequenciesScoreCorrected(x)=Pr(x | M )Pr(x | background)=L�i=1ei(xi)b(xi)Interested in how different this motif position is from we expect by chance.Correct for “expect by chance” by dividing by the probability of observing x in a random string:b(xi) := probability of observing character xi at random. Usually computed as (# xi in entire string) / (length of string)ScoreCorrectedLog(x) = logL�i=1ei(xi)b(xi)=L�i=1log�ei(xi)b(xi)�Often, to avoid multiplying lots of terms, we take the log and then sum:Problem: What about gaps?•The PSSM doesn’t handle either:•insertions of characters in the string that are not in the profile.•deletions of positions in the profile (that don’t have a match in the string).•A solution: use an HMM to model the profile!AMASINDSKILSLQ-NKKNALVDinsertiondeletionA Simple HMM•A profile is equivalent to a simple HMM:M1 M2 M3 M4 M5 M6Begin Endei(x) =Emission probabilities given by Sequence ProfileNo choice about which state to visit.Handling Insertionscharacters in the string that are not in the profileM1 M2BeginI1M1 M2 M3 M4 M5 M6Begin EndI1I2I2I3I3The “I” state allows any number of non-profile characters to be output.The emission probabilities for “I” states = random probability of observing each character.Handling Deletionspositions in the profile that are not matched in the stringM1 M2 M3 M4 M5 M6Begin EndM1 M2 M3 M4 M5 M6Begin EndD2 D3 D4 D5 D6D1But this is too many edges.Instead we add some delete states that don’t emit any characters:We could add O(n2) edges that allow us to skip any number of match states.Combining Insertions & DeletionsM1 M2 M3 M4 M5 M6Begin EndD2 D3 D4 D5 D6D1I1I0I2I3 I4 I5 I6ExampleM1 M2 M3 M4 M5 M6Begin EndD2 D3 D4 D5 D6D1I1I0I2I3 I4 I5 I6AMASIN-DSEvery alignment corresponds to some path in this HMM.Every path in this HMM corresponds to some alignment.PROSITEDatabase of protein domainsPatterns specified by these HMMs/DEFAULT: M0=-7; D=-50; I=-50; B1=-500; E1=-500; MI=-105; MD=-105; IM=-105; DM=-105; A B C D E F G H I K L M N P Q R S T V W Y Z/I: B1=0; BI=-105; BD=-105;/M: SY='R'; M=-19, -8,-30, -7, 3,-21,-19, 10,-30, 25,-20,-10, 1,-19, 10, 58, -9,-11,-22,-22, -7, 2;/M: SY='E'; M= 1, 0,-24, 1, 16,-22,-15, -8,-19, 8,-13,-10, -2,-11, 4, 4, -4, -8,-15,-25,-15, 9;/M: SY='A'; M= 18, 2,-18, -2, 10,-24, -8, -9,-18, 0,-17,-13, 1, -9, 4, -7, 7, -2,-13,-26,-18, 7;/M: SY='R'; M= 5, -8,-20,-11, -2,-16,-13,-11,-18, 10,-16, -9, -4,-14, 1, 13, 3, -2,-10,-22,-12, -1;/M: SY='I'; M= -6,-26,-18,-28,-22, 1,-29,-23, 23,-20, 22, 14,-24,-26,-20,-18,-17, -5, 23,-24, -5,-22;/M: SY='M'; M= -8, -9,-22,-12, -5,-13,-20, 0, -3, -5, 1, 12, -8,-17, 5, -3, -6, -2, -6,-24, -5, -1;/M: SY='R'; M=-16, -4,-29, -3, 3,-23,-18, -3,-30, 31,-23,-12, 2,-17, 9, 53, -7, -9,-20,-22,-11, 3;/M: SY='Y'; M=-19,-18,-28,-19,-18, 31,-29, 8, -1,-13, 0, -1,-17,-27,-12,-12,-18,-10, -8, 18, 57,-18;/M: SY='R'; M=-15, -9,-28,-10, 0,-16,-21, -7,-23, 27,-16, -7, -3,-17, 4, 42,-10, -7,-16,-19, -7, 0;/M: SY='E'; M= -9, 5,-28, 11, 33,-26,-19, -4,-23, 8,-17,-14, -1, -2, 12, 2, -3, -8,-21,-29,-17, 22;/M: SY='K'; M= -9, -3,-28, -4, 5,-22,-16,-11,-26, 36,-24, -9, -2,-12, 4, 21, -8, -9,-17,-20, -9, 5;/M: SY='R'; M=-15, -9,-28,-10, -1,-16,-21, -6,-23, 29,-18, -5, -4,-18, 5, 44,-11,-10,-14,-19, -7, 0;/M: SY='K'; M= -9, 0,-26, -2, 6,-24,-14, -7,-22, 23,-21, -9, 3,-14, 8, 18, -5, -7,-17,-24,-12, 6;/M: SY='T'; M= -3, 1,-20, -3, 1,-17,-10, -8,-14, -2,-11, -6, 4,-15, 1, -1, 2, 5,-13,-28,-13, 1;/M: SY='R'; M=-19,-10,-29,-10, -2,-17,-21, 5,-27, 24,-18, -9, -1,-20, 8, 58, -9, -7,-19,-18, -4, -2;/M: SY='R'; M=-11, -2, -5, -7, -5,-17,-18, -4,-18, 7,-12, -7, 4,-21, -3, 8, -7, -7,-15,-29,-12, -5;/M: SY='F'; M=-19,-22,-23,-31,-25, 57,-26,-13, -2,-24, 5, -2,-13,-29,-30,-17,-18,-10, -5, 13, 29,-25;/M: SY='D'; M= -7, 9,-25, 14, 12,-26, 1, -6,-26, -4,-20,-17, 5,-13, 0, -5, 2, -7,-21,-30,-19, 5;/M: SY='K'; M=-10, -5,-22, -6, 4,-25,-22,-12,-23, 37,-22, -7, -4,-14, 5, 23,-12,-10,-15,-21,-10, 4;/M: SY='K'; M= -8, -3,-25, -6, 3,-23,-20, -6,-21, 21,-18, -7,


View Full Document

UMD CMSC 423 - Motif Search

Documents in this Course
Midterm

Midterm

8 pages

Lecture 7

Lecture 7

15 pages

Load more
Download Motif Search
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Motif Search and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Motif Search 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?