DOC PREVIEW
Berkeley STATISTICS 246 - Biological sequence analysis

This preview shows page 1-2-3-4-25-26-27-51-52-53-54 out of 54 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Biological sequence analysisStat 246, Lecture 3The objects of our study DNA, RNA and proteins: macromolecules which areunbranched polymers built up from smaller units. DNA: units are the nucleotide residues A, C, G and T RNA: units are the nucleotide residues A, C, G and U Proteins: units are the amino acid residues A, C, D, E,F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W and Y. To a considerable extent, the chemical properties ofDNA, RNA and protein molecules are encoded in thelinear sequence of these basic units: their primarystructure.The use of statistics to study linearsequences of biomolecular units Can be descriptive, predictive or everythingelse in between…..almost business as usual. Stochastic mechanisms should never be takenliterally, but nevertheless can be amazinglyuseful. Care is always needed: a model or method canbreak down at any time without notice. Biological confirmation of predictions is almostalways necessary.The statistics of biological sequencescan be global or localBase composition of genomes:E. coli: 25% A, 25% C, 25% G, 25% TP. falciparum: 82%A+TTranslation initiation:ATG is the near universal motif indicating thestart of translation in DNA coding sequence.1 ZNF: Cys-Cys-His-His zinc finger DNA binding domainFrom certainty to statistical models: a brief case studyCys-Cys-His-His zinc finger DNA binding domain Its characteristic motif has regular expression C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H 1ZNF: XYKCGLCERSFVEKSALSRHQRVHKNX.http://www.isrec.isb-sib.ch/software/PATFND_form.htmlc.{2,4}c...[livmfywc]........h.{3,5}hPatternFind output[ISREC-Server] Date: Wed Aug 22 13:00:41 MET 2001...gp|AF234161|7188808|01AEB01ABAC4F945 nuclear proteinNP94b [Homo sapiens] Occurences: 2Position : 514 CYICKASCSSQQEFQDHMSEPQHPosition : 606 CTVCNRYFKTPRKFVEHVKSQGH........gp|X67787|1326037|02AF953C84E0AB5A zinc finger protein[Saccharomyces cerevisiae] Occurences: 1Position : 3 CSFDGCEKVYNRPSLLQQHQNSH200 matches found: output limit reachedThis search could have been conducted using a suffix tree representation.Regular expressions can be limitingCAAGGT AGTAG5’ splice junction in eukaryotes( )TCTC≥11N AGC3’ splice junction Most protein binding sites are characterized by some degree of sequence specificity, but seeking a consensus sequence is often an inadequate way to recognize sites.Position-specific distributions came to representthe variability in motif composition.Cys-Cys-His-His profile: sequence logo formA sequence logo is a scaled position-specific a.a.distribution.Scaling is by a measure of a position’s information content.Sequence logos (T.D. Schneider)A visual representation of a position-specific distribution.Easy for nucleotides, but we need colour to depict up to 20amino acid proportions.Idea: overall height at position l proportional to informationcontent (2-Hl); proportions of each nucleotide ( or aminoacid) are in relation to their observed frequency at thatposition, with most frequent on top, next most frequentbelow, etc..How do we search with position-specific distributions?Position-specific scoring matrices -38 19 1 12 10 -48 -15 -38 -8 -10 -3 -32 -13 -48 -6 -7 -10 -48 17 -32 8 -9 -6 19 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1ACGTconsensus PSSMT A T A A TACGTT A T A A TUse of a PSSM to find sites C T A T A A T C -38 19 1 12 10 -48 -15 -38 -8 -10 -3 -32 -13 -48 -6 -7 -10 -48 17 -32 8 -9 -6 19ACGT -38 19 1 12 10 -48 -15 -38 -8 -10 -3 -32 -13 -48 -6 -7 -10 -48 17 -32 8 -9 -6 19ACGT -38 19 1 12 10 -48 -15 -38 -8 -10 -3 -32 -13 -48 -6 -7 -10 -48 17 -32 8 -9 -6 19ACGTsum-93+85-95Move the matrixalong the sequenceand score each“window”.Peaks shouldoccur at the “true”sites.Of course in generalany threshold willhave some falsepositive and falsenegative rate.Calculation of a PSSM from counts 0.04 0.88 0.26 0.59 0.49 0.03 0.09 0.03 0.11 0.13 0.22 0.05 0.07 0.01 0.12 0.16 0.12 0.02 0.80 0.08 0.51 0.13 0.18 0.89 9 214 63 142 118 8 22 7 26 31 52 13 18 2 29 38 29 5 193 19 124 31 43 216ACGTACGT -2.76 1.82 0.06 1.23 0.96 -2.92 -1.46 -3.11 -1.22 -1.00 -0.22 -2.21 -1.76 -5.00 -1.06 -0.67 -1.06 -3.58 1.67 -1.66 1.04 -1.00 -0.49 1.84ACGT2101 2 3 4 5 6 Counts from 242 known sites Relative frequencies: fbl PSSM: log fbl/pb Informativeness: 2+∑∑∑∑bpbllog2pblDerivation of PSSM entriesCandidate sequence CTATAATC....Aligned position 123456Hypotheses:S=site (and independence)R=random (equiprobable, independence)log2 = log2 = (2+log2.09)+...+(2+log2.01) =More generally, PSSM score sbl = log fbl/pbpr(CTATAA | S)pr(CTATAA | R)€ .09x.03x.26x.13x.51x.01.25x.25x.25x.25x.25x.25110-15 - 32 +1- 9 + 10 - 48{ } l=position, b=base pb=background frequencySuppose that we have aligned sequence data on a number ofinstances of a given type of site.Representation of motifs: further steps Missing from the position-specific distribution representation of motifs are good ways ofdealing with: Length distributions for insertions/deletions Cross-position association of amino acids Hidden Markov models help with the first.The second remains a hard unsolved problem.Hidden Markov models Processes {(St,Ot), t=1,…}, where St is the hidden state and Ot the observation at time t, such that pr(St | St-1,Ot-1,St-2 ,Ot-2 …) = pr(St | St-1) pr(Ot | St-1,Ot-1,St-2 ,Ot-2 …) = pr(Ot | St, St-1) The basics of HMMs were laid bare in a series ofbeautiful papers by L E Baum and colleagues around1970, and their formulation has been used almostunchanged to this day.Hidden Markov models:extensions Many variants are now used. For example, the distribution of Omay not depend on previous S but on previous O values, pr(Ot | St , St-1 , Ot-1 ,.. ) = pr(Ot | St ), or pr(Ot | St , St-1 , Ot-1 ,.. ) = pr(Ot | St , St-1 ,Ot-1) . Most importantly for us, the times of S and O may be decoupled,permitting the Observation corresponding to State time t to be astring whose length and composition depends on St (andpossibly St-1 and part or all of the previous Observations). Thisis called a hidden semi-Markov or generalized hidden


View Full Document

Berkeley STATISTICS 246 - Biological sequence analysis

Documents in this Course
Meiosis

Meiosis

46 pages

Meiosis

Meiosis

47 pages

Load more
Download Biological sequence analysis
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Biological sequence analysis and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Biological sequence analysis 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?