DOC PREVIEW
Berkeley STATISTICS 246 - Introduction to motif representation and detection

This preview shows page 1-2-3-4-5-36-37-38-39-40-72-73-74-75-76 out of 76 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 76 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 76 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 76 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 76 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 76 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 76 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 76 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 76 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 76 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 76 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 76 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 76 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 76 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 76 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 76 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 76 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1Introduction to motif representationand detectionStatistics 246 Spring 2006Week 13 Lecture 12Motivation We saw in Lecture 1 last week that transcription of the lacgene in E. coli involved several proteins binding DNA atspecific sites with consensus sequences as follows:RNA polymerase binding at TATATT (-10) and TTGACA (-35)UP binding at AAA(A/T)(A/T)T(A/T)TTTNNAAACAP binding at AAATGTGATCTAGATCACATTT, andRepressor binding at GTGGAATTGTGAGCGGATAACAATTTTC These proteins bind at many sequences similar but notidentical to the consensus, and in this lecture we explore therepresentation and utilization of this diversity.3Motifs - Sites - Signals - DomainsFor this lecture, I’ll use these terms interchangeablyto describe recurring elements of interest to us.In PROTEINS we have: transmembrane domains,coiled-coil domains, EGF-like domains, signalpeptides, phosphorylation sites, antigenicdeterminants, …, and protein families.In DNA / RNA we have: enhancers, promoters,terminators, splicing signals, translation initiationsites, centromeres, ...4Motifs and modelsMotifs typically represent regions of structuralsignificance with specific biological function.Are generalisations from known examples.The models can be highly specific.Multiple models can be used to give highersensitivity & specificity in their detection.Can sometimes be generated automatically fromexamples or multiple alignments.5Why (probability) modelsfor biomolecular motifs?• to characterize them• to help identify them• for incorporation into largermodels, e.g. for an entire gene6Transcription initiation in E. coliIn E. coli transcription is initiated at the promotor, and the sequence ofthe promotor is recognised by the Sigma factor of RNA polymerase.7Determinism 1: consensus sequences σ Factor Promotor consensus sequence -35 -10 σ70 TTGACA TATAAT σ28 CTAAA CCGATAT Similarly for σ32 , σ38 and σ54. Consensus sequences have the obvious limitation:there is usually some deviation from them.8 has 3 Cys-Cys-His-His zinc finger DNA binding domainsThe human transcription factor Sp19 Prosite patterns An early effort at collecting descriptors forfunctionally important protein motifs. Theydo not attempt to describe a completedomain or protein, but simply try to identifythe most important residue combinations,such as the catalytic site of an enzyme.They use regular expression syntax, andfocus on the most highly conserved residuesin a protein family. http://au.expasy.org10Seeking consensus in sequencesThis pattern, which must be in the N-terminal of the sequence (‘<'), means: <A - x - [ST] (2) - x(0,1) - V - {LI} Ala-any- [Ser or Thr]-[Ser or Thr] - (any or none)-Val-(any but Leu, Ile)11Example: C2H2 zinc finger DNA binding domain The characteristic motif of a Cys-Cys-His-His zincfinger DNA binding domain has regular expression C-X(2,4)-C-X(3)-[LIVMFYWC]-X(8)-H-X(3,5)-H Here, as in algebra, X is unknown. The sequence ofour example domain 1ZNF is as follows, clearly fittingthe model. XYKCGLCERSFVEKSALSRHQRVHKNX12http://www.isrec.isb-sib.ch/software/PATFND_form.htmlc.{2,4}c...[livmfywc]........h.{3,5}hPatternFind output[ISREC-Server] Date: Wed Aug 22 13:00:41 MET 2001...gp|AF234161|7188808|01AEB01ABAC4F945 nuclearprotein NP94b [Homo sapiens] Occurences: 2Position : 514 CYICKASCSSQQEFQDHMSEPQHPosition : 606 CTVCNRYFKTPRKFVEHVKSQGH........Searching with regular expressions13Regular expressions can be limitingThe regular expression syntax is too rigid to representmany highly divergent protein motifs.Also, short patterns are sometimes insufficient withtoday’s large databases. Even requiring perfectmatches you might find many false positives. Onthe other site some real sites might not be perfectmatches.We need to go beyond apparently equally likelyalternatives, and ranges for gaps. We deal with theformer first, having a distribution at each position.14Cys-Cys-His-His profile: sequence logo formA sequence logo is a scaled position-specific a.a.distribution. Scaling is by a measure of a position’s information content.(Note that we’ve lost the option of variable spacing.)1ZNF: ..YKCGLCERSFVEKSALSRHQRVHKN..1560 human TATA boxesThese two figures courtesy of Anders Krogh.16Weight matrix model (WMM) =Stochastic consensus sequence 0.04 0.88 0.26 0.59 0.49 0.03 0.09 0.03 0.11 0.13 0.21 0.05 0.07 0.01 0.12 0.16 0.12 0.02 0.80 0.08 0.51 0.13 0.18 0.89 9 214 63 142 118 8 22 7 26 31 52 13 18 2 29 38 29 5 193 19 124 31 43 216ACGTACGT -38 19 1 12 10 -48 -15 -38 -8 -10 -3 -32 -13 -48 -6 -7 -10 -40 17 -32 8 -9 -6 19ACGT2101 2 3 4 5 6 Counts from 242 known σ70 sites Relative frequencies: fbl 10×log2fbl/pb Informativeness:2+Σbpbllog2pblPosition Specific Scoring Matrix17Derivation of PSSM entriescandidate sequence CTATAATC....aligned position 123456Hypotheses:S=site (and independence)R=random (equiprobable, independence)log2 = log2 = (2+log2.09)+...+(2+log2.01) =Generally, PSSM score sbl = log fbl/pbpr(CTATAA | S)pr(CTATAA | R)     .09x.03x.26x.13x.51x.01.25x.25x.25x.25x.25x.25    110-15 - 32 + 1 - 9 + 10 - 48{ } l=position, b=base pb=background frequencySuppose that we have aligned sequence data on anumber of instances of a given type of site.18Use of a PSSM to find sites C T A T A A T C -38 19 1 12 10 -48 -15 -38 -8 -10 -3 -32 -13 -48 -6 -7 -10 -48 17 -32 8 -9 -6 19ACGT -38 19 1 12 10 -48 -15 -38 -8 -10 -3 -32 -13 -48 -6 -7 -10 -48 17 -32 8 -9 -6 19ACGT -38 19 1 12 10 -48 -15 -38 -8 -10 -3 -32 -13 -48 -6 -7 -10 -48 17 -32 8 -9 -6 19ACGTsum-93+85-95Move the matrixalong the sequenceand score eachwindow.Peaks shouldoccur at thetrue sites.Of course in generalany threshold willhave some falsepositive and falsenegative rate.1912 examples of 5’splice (donor) sitesTCGGTGAGTTGGGTGTGTCCGGTCCGTATG GTAAGATCT GTAAGTCAGGTAGGACAGGTAGGGAAGGTAAGGAGGGTATGGTGGGTAAGGGAGGTTAGT CATGTGAGTexon intronThere are many thousandsof instances of such sites.20Base -3 -2 -1 0 +1 +2 +3 +4 +5A 33 61 10 0 0 53 71 7 16C 37 13 3 0 0 3 8 6 16G 18 12 80 100 0 42 12 81


View Full Document

Berkeley STATISTICS 246 - Introduction to motif representation and detection

Documents in this Course
Meiosis

Meiosis

46 pages

Meiosis

Meiosis

47 pages

Load more
Download Introduction to motif representation and detection
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Introduction to motif representation and detection and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Introduction to motif representation and detection 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?