Berkeley STATISTICS 246 - Introduction to motif representation and detection - D2520772

Home> Schools> University of California, Berkeley> (STATISTICS) > STATISTICS 246> Introduction to motif representation and detection

DOC PREVIEW

Berkeley STATISTICS 246 - Introduction to motif representation and detection

School name University of California, Berkeley

Course Statistics 246- Statistical Genetics

Pages 76

This preview shows page 1-2-3-4-5-36-37-38-39-40-72-73-74-75-76 out of 76 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 76 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Introduction to motif representation and detection Statistics 246 Week 13 Spring 2006 Lecture 1 1 Motivation We saw in Lecture 1 last week that transcription of the lac gene in E coli involved several proteins binding DNA at specific sites with consensus sequences as follows RNA polymerase binding at TATATT 10 and TTGACA 35 UP binding at AAA A T A T T A T TTTNNAAA CAP binding at AAATGTGATCTAGATCACATTT and Repressor binding at GTGGAATTGTGAGCGGATAACAATTTTC These proteins bind at many sequences similar but not identical to the consensus and in this lecture we explore the representation and utilization of this diversity 2 Motifs Sites Signals Domains For this lecture I ll use these terms interchangeably to describe recurring elements of interest to us In PROTEINS we have transmembrane domains coiled coil domains EGF like domains signal peptides phosphorylation sites antigenic determinants and protein families In DNA RNA we have enhancers promoters terminators splicing signals translation initiation sites centromeres 3 Motifs and models Motifs typically represent regions of structural significance with specific biological function Are generalisations from known examples The models can be highly specific Multiple models can be used to give higher sensitivity specificity in their detection Can sometimes be generated automatically from examples or multiple alignments 4 Why probability models for biomolecular motifs to characterize them to help identify them for incorporation into larger models e g for an entire gene 5 Transcription initiation in E coli In E coli transcription is initiated at the promotor and the sequence of the promotor is recognised by the Sigma factor of RNA polymerase 6 Determinism 1 consensus sequences Factor 70 28 Promotor consensus sequence 35 10 TTGACA TATAAT CTAAA CCGATAT Similarly for 32 38 and 54 Consensus sequences have the obvious limitation there is usually some deviation from them 7 The human transcription factor Sp1 has 3 Cys Cys His His zinc finger DNA binding domains 8 Prosite patterns An early effort at collecting descriptors for functionally important protein motifs They do not attempt to describe a complete domain or protein but simply try to identify the most important residue combinations such as the catalytic site of an enzyme They use regular expression syntax and focus on the most highly conserved residues in a protein family http au expasy org 9 Seeking consensus in sequences This pattern which must be in the N terminal of the sequence means A x ST 2 x 0 1 V LI Ala any Ser or Thr Ser or Thr any or none Val any but Leu Ile 10 Example C2H2 zinc finger DNA binding domain The characteristic motif of a Cys Cys His His zinc finger DNA binding domain has regular expression C X 2 4 C X 3 LIVMFYWC X 8 H X 3 5 H Here as in algebra X is unknown The sequence of our example domain 1ZNF is as follows clearly fitting the model XYKCGLCERSFVEKSALSRHQRVHKNX 11 Searching with regular expressions http www isrec isb sib ch software PATFND form html c 2 4 c livmfywc h 3 5 h PatternFind output ISREC Server Date Wed Aug 22 13 00 41 MET 2001 gp AF234161 7188808 01AEB01ABAC4F945 nuclear protein NP94b Homo sapiens Occurences 2 Position 514 CYICKASCSSQQEFQDHMSEPQH Position 606 CTVCNRYFKTPRKFVEHVKSQGH 12 Regular expressions can be limiting The regular expression syntax is too rigid to represent many highly divergent protein motifs Also short patterns are sometimes insufficient with today s large databases Even requiring perfect matches you might find many false positives On the other site some real sites might not be perfect matches We need to go beyond apparently equally likely alternatives and ranges for gaps We deal with the former first having a distribution at each position 13 Cys Cys His His profile sequence logo form 1ZNF YKCGLCERSFVEKSALSRHQRVHKN A sequence logo is a scaled position specific a a distribution Scaling is by a measure of a position s information content Note that we ve lost the option of variable spacing 14 60 human TATA boxes 15 These two figures courtesy of Anders Krogh Weight matrix model WMM Stochastic consensus sequence A 9 214 63 142 118 8 A 0 04 0 88 0 26 0 59 0 49 0 03 C 22 7 26 31 52 13 C 0 09 0 03 0 11 0 13 0 21 0 05 G 18 2 29 38 29 5 G 0 07 0 01 0 12 0 16 0 12 0 02 T 193 19 124 31 43 216 T 0 80 0 08 0 51 0 13 0 18 0 89 Counts from 242 known 70 sites Relative frequencies fbl 2 A 38 19 1 12 10 48 C 15 38 8 10 3 32 G 13 48 6 7 10 40 T 32 8 9 6 19 17 Position Specific Scoring Matrix 10 log2fbl pb 1 0 1 2 3 4 5 6 Informativeness 2 bpbllog2pbl 16 Derivation of PSSM entries Suppose that we have aligned sequence data on a number of instances of a given type of site candidate sequence CTATAATC aligned position 123456 Hypotheses S site and independence R random equiprobable independence pr CTATAA S pr CTATAA R log2 log2 09x 03x 26x 13x 51x 01 25x 25x 25x 25x 25x 25 2 log2 09 2 log2 01 1 15 32 1 9 10 48 10 Generally PSSM score sbl log fbl pb l position b base 17 pb background frequency Use of a PSSM to find sites C Move the matrix along the sequence and score each window Peaks should occur at the true sites Of course in general any threshold will have some false positive and false negative rate T A 12 10 48 C 15 38 8 10 3 32 G 13 48 6 17 32 8 9 6 19 A 38 19 1 12 10 48 C 15 38 8 10 3 32 G 13 48 6 17 32 8 9 6 19 A 38 19 1 12 10 48 C 15 38 8 10 3 32 G 13 48 6 17 32 8 A 38 T T T 19 T A 1 A T C sum 93 7 10 48 85 7 10 48 7 10 48 9 6 19 95 18 12 examples of 5 splice donor sites exon TCGGTGAGT TGGGTGTGT CCGGTCCGT ATG GTAAGA TCT GTAAGT CAGGTAGGA CAGGTAGGG AAGGTAAGG AGGGTATGG TGGGTAAGG GAGGTTAGT CATGTGAGT intron There are many thousands of instances of such sites 19 Sequence logo for human splice donor sites Base A C G T 3 33 37 18 12 2 61 13 12 14 1 0 1 2 3 4 5 10 0 0 53 71 7 16 3 0 0 3 8 6 16 80 100 0 42 12 81 22 7 0 100 2 9 6 46 20 Representation of motifs the next steps Missing from the position specific distribution representation of motifs are good ways of dealing with Length distributions for insertions deletions Non local association of amino acids Profiles and then Hidden Markov models help with the first The second remains a hard unsolved problem Local e g neighbour associations can be dealt with 21 straighforwardly generalizing PSSMs Profiles Are a variation of the position specific scoring …

View Full Document