UMD CMSC 423 - Motif Search - D1679576

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 423> Motif Search

UMD CMSC 423 - Motif Search

School name University of Maryland, College Park

Course Cmsc 423- Bioinformatic Algorithms, Databases, and Tools

Pages 22

Download Save

Unformatted text preview:

Motif Search CMSC 423 Central Dogma of Biology proteins Translation mRNA T U Transcription Genome DNA double stranded linear molecule strands are complements of each other A T C G each strand is string over A C G T substrings encode for genes most of which encode for proteins DNA mRNA Protein DNA polymerase Transcription factor DNA TSS TF Binding sites Upstream region Exon translated Intron not translated Gene Finding transcription factor binding sites can tell us about the cell s regulatory network RNA Polymerase b c it makes RNA is an enzyme into a polymer Discovered in 1960 Nobel prize for its discovery in 1959 oops 1959 Nobel awarded to Severo Ochoa and Arthur Kornberg for discovering what was mistakenly believed to be RNA pol 1960 Sam Weiss and Jared Hurwitz discover the real RNA pol 2006 Nobel awarded to Roger Kornberg son of Arthur for detailed structure of RNA pol Image of transcription occurring Each hair is a piece of RNA that RNA pol is growing off of the DNA Transcription Factor Binding Sites Length of E coli K12 TF binding sites RegulonDB Feb 27 2010 Transcription Factor Binding Sites RegulonDB Feb 27 2010 Transcription Factor Binding Sites RegulonDB Feb 27 2010 Position 44 Position 10 X axes of plots are not to scale Transcription Network 169 transcription factors excluding sigmas 3322 edges 1753 activation 1369 repression 185 both 3 unknown Sequence Profiles CCT domain often found near one end of plant proteins Suppose we want to search for other examples of this domain How can we represent the pattern implied by these sequences One way is a Sequence Profile Sequence Profiles PSSM Motif Position 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 A C D E Amino Acid 1 T V W Y Color Probability that the ith position has the given amino acid ei x 1 Sequence Logos Height of letter fraction of time that letter is observed at that position Height of all the letters in a column to how conserved the column is Motif Position Scoring a Sequence x MRGSAMASINDSKILSLQNKKNALVDTSGYNAEVRVGDNVQLNTIYTNDFKLSSSGDKIIVN M Color Probability that the ith position has the given amino acid ei x Score x Pr x M L ei xi i 1 Score of a string according to profile M Product of the probabilities you would observe the given letters Background Frequencies Interested in how different this motif position is from we expect by chance Correct for expect by chance by dividing by the probability of observing x in a random string L ei xi Pr x M ScoreCorrected x Pr x background i 1 b xi b xi probability of observing character xi at random Usually computed as xi in entire string length of string Often to avoid multiplying lots of terms we take the log and then sum L ei xi ScoreCorrectedLog x log b x i i 1 L i 1 log ei xi b xi Problem What about gaps The PSSM doesn t handle either insertions of characters in the string that are not in the profile deletions of positions in the profile that don t have a match in the string insertion deletion AMASINDSKILSLQ NKKNALVD A solution use an HMM to model the profile A Simple HMM Begin No choice about which state to visit A profile is equivalent to a simple HMM M1 M2 M3 M4 M5 ei x Emission probabilities given by Sequence Profile M6 End Handling Insertions characters in the string that are not in the profile Begin Begin I1 The I state allows any number of non profile characters to be output M1 M2 The emission probabilities for I states random probability of observing each character I1 I2 I2 I3 I3 M1 M2 M3 M4 M5 M6 End Handling Deletions positions in the profile that are not matched in the string We could add O n2 edges that allow us to skip any number of match states M1 Begin M2 M3 M4 M5 M6 End But this is too many edges Instead we add some delete states that don t emit any characters D1 Begin M1 D2 M2 D3 M3 D4 D5 D6 M4 M5 M6 End Combining Insertions Deletions D1 D2 D3 D4 D5 D6 I0 I1 I2 I3 I4 I5 I6 Begin M1 M2 M3 M4 M5 M6 End Example D1 D2 D3 D4 D5 D6 I0 I1 I2 I3 I4 I5 I6 Begin M1 M2 M3 M4 M5 M6 Every alignment corresponds to some path in this HMM Every path in this HMM corresponds to some alignment End AMASIN DS PROSITE Database of protein domains Patterns specified by these HMMs DEFAULT M0 7 D 50 I 50 B1 500 E1 500 MI 105 MD 105 IM 105 DM 105 I M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M I SY R SY E SY A SY R SY I SY M SY R SY Y SY R SY E SY K SY R SY K SY T SY R SY R SY F SY D SY K SY K SY I SY R SY Y SY A SY C SY R SY K SY A SY L SY A SY D SY Q SY R SY P SY R SY V SY K SY G SY R SY F SY V SY R SY N A B C D E F G H I K L M N P Q R S T V W Y Z B1 0 BI 105 BD 105 M 19 8 30 7 3 21 19 10 30 25 20 10 1 19 10 58 9 11 22 22 7 2 M 1 0 24 1 16 22 15 8 19 8 13 10 2 11 4 4 4 8 15 25 15 9 M 18 2 18 2 10 24 8 9 18 0 17 13 1 9 4 7 7 2 13 26 18 7 M 5 8 20 11 2 16 13 11 18 10 16 9 4 14 1 13 3 2 10 22 12 1 M 6 26 18 28 22 1 29 23 23 20 22 14 24 26 20 18 17 5 23 24 5 22 M 8 9 22 12 5 13 20 0 3 5 1 12 8 17 5 3 6 2 6 24 5 1 M 16 4 29 3 3 23 18 3 30 31 23 12 2 17 9 53 7 9 20 22 11 3 M 19 18 28 19 18 31 29 8 1 13 0 1 17 27 12 12 18 10 8 18 57 18 M 15 9 28 10 0 16 21 7 23 27 16 7 3 17 4 42 10 7 16 19 7 0 M 9 5 28 11 33 26 19 4 23 8 17 14 1 2 12 2 3 8 21 29 17 22 M 9 3 28 4 5 22 16 11 26 36 24 …

View Full Document


School:
Email:
New Password:
Confirm Password:

UMD CMSC 423 - Motif Search

Sign up for free to view:

Please select your school