DOC PREVIEW
Berkeley STATISTICS 246 - Biological sequence analysis

This preview shows page 1-2-3-4-25-26-27-51-52-53-54 out of 54 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Biological sequence analysisThe objects of our studyThe use of statistics to study linear sequences of biomolecular unitsThe statistics of biological sequences can be global or localPowerPoint PresentationCys-Cys-His-His zinc finger DNA binding domainSlide 7Regular expressions can be limitingCys-Cys-His-His profile: sequence logo formSequence logos (T.D. Schneider)Position-specific scoring matricesUse of a PSSM to find sitesCalculation of a PSSM from countsDerivation of PSSM entriesRepresentation of motifs: further stepsHidden Markov modelsHidden Markov models:extensionsSlide 18Slide 19Slide 20Slide 21Slide 22Slide 23Profile HMM: m=match state, I-insert state, d=delete state; go from left to right. I and m states output amino acids; d states are ‘silent”.Pfam domain-HMMsSlide 27What is a (protein-coding) gene?What is a gene, ctd?Slide 30Some facts about human genesThe idea behind a GHMM genefinderHalf a model for a genefinderSplice sites can be included in the exonsBeyond position-specific distributionsSlide 36RemarkChallenges in the analysis of sequence dataTopics not mentioned includeAcknowledgementsReferencesSlide 42Coiled-coil domains, schematicallySlide 44Designing the HMM, IDesigning the HMM, 2Designing the HMM, 3HMM: decodingCC-PROBABILITY PROFILEAssessing performance: termsAssessing performance: study designSlide 52Assessing performance: summariesThe algorithmsBiological sequence analysis Stat 246, Lecture 3The objects of our study DNA, RNA and proteins: macromolecules which are unbranched polymers built up from smaller units. DNA: units are the nucleotide residues A, C, G and T RNA: units are the nucleotide residues A, C, G and U Proteins: units are the amino acid residues A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W and Y. To a considerable extent, the chemical properties of DNA, RNA and protein molecules are encoded in the linear sequence of these basic units: their primary structure.The use of statistics to study linear sequences of biomolecular units Can be descriptive, predictive or everything else in between…..almost business as usual. Stochastic mechanisms should never be taken literally, but nevertheless can be amazingly useful. Care is always needed: a model or method can break down at any time without notice. Biological confirmation of predictions is almost always necessary.The statistics of biological sequences can be global or localBase composition of genomes: E. coli: 25% A, 25% C, 25% G, 25% TP. falciparum: 82%A+TTranslation initiation: ATG is the near universal motif indicating thestart of translation in DNA coding sequence.1 ZNF: Cys-Cys-His-His zinc finger DNA binding domainFrom certainty to statistical models: a brief case studyCys-Cys-His-His zinc finger DNA binding domain Its characteristic motif has regular expression C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H 1ZNF: XYKCGLCERSFVEKSALSRHQRVHKNX .http://www.isrec.isb-sib.ch/software/PATFND_form.htmlc.{2,4}c...[livmfywc]........h.{3,5}hPatternFind output[ISREC-Server] Date: Wed Aug 22 13:00:41 MET 2001 ...gp|AF234161|7188808|01AEB01ABAC4F945 nuclear protein NP94b [Homo sapiens] Occurences: 2 Position : 514 CYICKASCSSQQEFQDHMSEPQH Position : 606 CTVCNRYFKTPRKFVEHVKSQGH........gp|X67787|1326037|02AF953C84E0AB5A zinc finger protein [Saccharomyces cerevisiae] Occurences: 1 Position : 3 CSFDGCEKVYNRPSLLQQHQNSH200 matches found: output limit reachedThis search could have been conducted using a suffix tree representation.Regular expressions can be limitingCAAGGT AGTAG5’ splice junction in eukaryotes( )TCTC≥11N AGC 3’ splice junction Most protein binding sites are characterized by some degree of sequence specificity, but seeking a consensus sequence is often an inadequate way to recognize sites.Position-specific distributions came to represent the variability in motif composition.Cys-Cys-His-His profile: sequence logo formA sequence logo is a scaled position-specific a.a.distribution. Scaling is by a measure of a position’s information content.Sequence logos (T.D. Schneider)A visual representation of a position-specific distribution. Easy for nucleotides, but we need colour to depict up to 20 amino acid proportions.Idea: overall height at position l proportional to information content (2-Hl); proportions of each nucleotide ( or amino acid) are in relation to their observed frequency at that position, with most frequent on top, next most frequent below, etc..How do we search with position-specific distributions?Position-specific scoring matrices -38 19 1 12 10 -48 -15 -38 -8 -10 -3 -32 -13 -48 -6 -7 -10 -48 17 -32 8 -9 -6 19 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1ACGTconsensus PSSM T A T A A TACGTT A T A A TUse of a PSSM to find sites C T A T A A T C -38 19 1 12 10 -48 -15 -38 -8 -10 -3 -32 -13 -48 -6 -7 -10 -48 17 -32 8 -9 -6 19ACGT -38 19 1 12 10 -48 -15 -38 -8 -10 -3 -32 -13 -48 -6 -7 -10 -48 17 -32 8 -9 -6 19ACGT -38 19 1 12 10 -48 -15 -38 -8 -10 -3 -32 -13 -48 -6 -7 -10 -48 17 -32 8 -9 -6 19ACGTsum-93+85-95Move the matrixalong the sequenceand score each “window”.Peaks should occur at the “true” sites.Of course in general any threshold will have some false positive and false negative rate.Calculation of a PSSM from counts 0.04 0.88 0.26 0.59 0.49 0.03 0.09 0.03 0.11 0.13 0.22 0.05 0.07 0.01 0.12 0.16 0.12 0.02 0.80 0.08 0.51 0.13 0.18 0.89 9 214 63 142 118 8 22 7 26 31 52 13 18 2 29 38 29 5 193 19 124 31 43 216ACGTACGT -2.76 1.82 0.06 1.23 0.96 -2.92 -1.46 -3.11 -1.22 -1.00 -0.22 -2.21 -1.76 -5.00 -1.06 -0.67 -1.06 -3.58 1.67 -1.66 1.04 -1.00 -0.49 1.84ACGT2101 2 3 4 5 6 Counts from 242 known sites Relative frequencies: fbl PSSM: log fbl/pb Informativeness: 2+∑bpbllog2pblDerivation of PSSM entriesCandidate sequence CTATAATC....Aligned position 123456Hypotheses:S=site (and independence)R=random (equiprobable, independence)log2 = log2 = (2+log2.09)+...+(2+log2.01) = More generally, PSSM score sbl = log fbl/pbpr(CTATAA | S)pr(CTATAA | R)⎛ ⎝ ⎜ ⎞ ⎠ € .09x.03x.26x.13x.51x.01.25x.25x.25x.25x.25x.25⎛ ⎝ ⎞ ⎠ 110-15 - 32 +1 - 9 +10 - 48{ } l=position, b=base pb=background


View Full Document

Berkeley STATISTICS 246 - Biological sequence analysis

Documents in this Course
Meiosis

Meiosis

46 pages

Meiosis

Meiosis

47 pages

Load more
Download Biological sequence analysis
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Biological sequence analysis and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Biological sequence analysis 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?