DOC PREVIEW
Berkeley STATISTICS 246 - Biological Sequence Analysis

This preview shows page 1-2-3-4-31-32-33-34-35-63-64-65-66 out of 66 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 66 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 66 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 66 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 66 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 66 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 66 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 66 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 66 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 66 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 66 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 66 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 66 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 66 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 66 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1Biological Sequence AnalysisLecture 26, Statistics 246April 27, 20042SynopsisSome biological backgroundA progression of modelsAcknowledgementsReferences3The objects of our study DNA, RNA and proteins: macromolecules which areunbranched polymers built up from smaller units. DNA: units are the nucleotide residues A, C, G and T RNA: units are the nucleotide residues A, C, G and U Proteins: units are the amino acid residues A, C, D, E,F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W and Y. To a considerable extent, the chemical properties ofDNA, RNA and protein molecules are encoded in thelinear sequence of these basic units: their primarystructure.4The central dogmaProteinmRNADNAtranscriptiontranslationCCTGAGCCAACTATTGATGAAPEPTIDECCUGAGCCAACUAUUGAUGAA5A protein-coding gene6Motifs - Sites - Signals - DomainsFor this lecture, I’ll use these terms interchangeablyto describe recurring elements of interest to us.In PROTEINS we have: transmembrane domains,coiled-coil domains, EGF-like domains, signalpeptides, phosphorylation sites, antigenicdeterminants, ...In DNA / RNA we have: enhancers, promoters,terminators, splicing signals, translation initiationsites, centromeres, ...7Motifs and modelsMotifs typically represent regions of structuralsignificance with specific biological function.Are generalisations from known examples.The models can be highly specific.Multiple models can be used to give highersensitivity & specificity in their detection.Can sometimes be generated automatically fromexamples or multiple alignments.8The use of stochastic models for motifs Can be descriptive, predictive or everythingelse in between…..almost business as usual. However, stochastic mechanisms should neverbe taken literally, but nevertheless they can beamazingly useful. Care is always needed: a model or method canbreak down at any time without notice. Biological confirmation of predictions is almostalways necessary.9Transcription initiation in E. coliRNA polymerase-promotor interactions In E. coli transcription is initiated at the promotor , whose sequence is recognised by the Sigma factor of RNA polymerase.10Transcription initiation in E. coli, cont...YKFSTYATWWIRQAITR..11Determinism 1: consensus sequences σ Factor Promotor consensus sequence -35 -10 σ70 TTGACA TATAAT σ28 CTAAA CCGATAT Similarly for σ32 , σ38 and σ54. Consensus sequences have the obvious limitation:there is usually some deviation from them.12 has 3 Cys-Cys-His-His zinc finger DNA binding domainsThe human transcription factor Sp113Determinism 2: regular expressions The characteristic motif of a Cys-Cys-His-His zinc fingerDNA binding domain has regular expression C-X(2,4)-C-X(3)-[LIVMFYWC]-X(8)-H-X(3,5)-H Here, as in algebra, X is unknown. The 29 a.a. sequence of ourexample domain 1SP1 is as follows, clearly fitting the model. 1SP1: KKFACPECPKRFMRSDHLSKHIKTHQNKK14 Prosite patterns An early effort at collecting descriptors forfunctionally important protein motifs. Theydo not attempt to describe a completedomain or protein, but simply try to identifythe most important residue combinations,such as the catalytic site of an enzyme.They use regular expression syntax, andfocus on the most highly conservedresidues in a protein family. http://au.expasy.org15More on Prosite patternsThispattern,whichmustbeintheN-terminalofthesequence(ʻ<'),means:<A-x-[ST](2) -x(0,1)-V-{LI}Ala-any-[SerorThr]-[SerorThr]-(anyornone)-Val-(anybutLeu,Ile)16http://www.isrec.isb-sib.ch/software/PATFND_form.htmlc.{2,4}c...[livmfywc]........h.{3,5}hPatternFind output[ISREC-Server] Date: Wed Aug 22 13:00:41 MET 2001...gp|AF234161|7188808|01AEB01ABAC4F945 nuclearprotein NP94b [Homo sapiens] Occurences: 2Position : 514 CYICKASCSSQQEFQDHMSEPQHPosition : 606 CTVCNRYFKTPRKFVEHVKSQGH........Searching with regular expressions17Regular expressions can be limitingThe regular expression syntax is still too rigid torepresent many highly divergent protein motifs.Also, short patterns are sometimes insufficient withtoday’s large databases. Even requiring perfectmatches you might find many false positives. Onthe other site some real sites might not be perfectmatches.We need to go beyond apparently equally likelyalternatives, and ranges for gaps. We deal with theformer first, having a distribution at each position.18Cys-Cys-His-His profile: sequence logo formA sequence logo is a scaled position-specific a.a.distribution.Scaling is by a measure of a position’s information content.19Weight matrix model (WMM) =Stochastic consensus sequence 0.04 0.88 0.26 0.59 0.49 0.03 0.09 0.03 0.11 0.13 0.21 0.05 0.07 0.01 0.12 0.16 0.12 0.02 0.80 0.08 0.51 0.13 0.18 0.89 9 214 63 142 118 8 22 7 26 31 5213 18 2 29 38 295 193 19 124 31 43 216ACGTACGT -38 19 1 12 10 -48 -15 -38 -8 -10 -3 -32 -13 -48 -6 -7 -10 -40 17 -32 8 -9 -6 19ACGT2101 2 3 4 5 6 Counts from 242 known σ70 sites Relative frequencies: fbl 10 log2fbl/pb Informativeness:2+Σbpbllog2pbl20Interpretation of weight matrix entriescandidate sequence CTATAATC....aligned position 123456Hypotheses:S=site (and independence)R=random (equiprobable, independence)log2 = log2 = (2+log2.09)+...+(2+log2.01) =Generally, score sbl = log fbl/pbpr(CTATAA | S)pr(CTATAA | R)     .09x.08x.26x.13x.51x.01.25x.25x.25x.25x.25x.25110-15 - 32 + 1 - 9 + 10 - 48{ } l=position, b=base pb=background frequency21Use of the matrix to find sites C T A T A A TC -38 19 1 12 10 -48 -15 -38 -8 -10 -3 -32 -13 -48 -6 -7 -10 -40 17 -32 8 -9 -6 19ACGT -38 19 1 12 10 -48 -15 -38 -8 -10 -3 -32 -13 -48 -6 -7 -10 -40 17 -32 8 -9 -6 19ACGT -38 19 1 12 10 -48 -15 -38 -8 -10 -3 -32 -13 -48 -6 -7 -10 -40 17 -32 8 -9 -6 19ACGTsum-93+85-95Move the matrixalong the sequenceand score eachwindow.Peaks shouldoccur at thetrue sites.Of course in generalany threshold willhave some falsepositive and falsenegative


View Full Document

Berkeley STATISTICS 246 - Biological Sequence Analysis

Documents in this Course
Meiosis

Meiosis

46 pages

Meiosis

Meiosis

47 pages

Load more
Download Biological Sequence Analysis
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Biological Sequence Analysis and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Biological Sequence Analysis 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?