Berkeley STATISTICS 246 - Biological Sequence Analysis - D2883961

Home> Schools> University of California, Berkeley> (STATISTICS) > STATISTICS 246> Biological Sequence Analysis

DOC PREVIEW

Berkeley STATISTICS 246 - Biological Sequence Analysis

School name University of California, Berkeley

Course Statistics 246- Statistical Genetics

Pages 66

This preview shows page 1-2-3-4-31-32-33-34-35-63-64-65-66 out of 66 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 66 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Biological Sequence Analysis Lecture 26 Statistics 246 April 27 2004 1 Synopsis Some biological background A progression of models Acknowledgements References 2 The objects of our study DNA RNA and proteins macromolecules which are unbranched polymers built up from smaller units DNA units are the nucleotide residues A C G and T RNA units are the nucleotide residues A C G and U Proteins units are the amino acid residues A C D E F G H I K L M N P Q R S T V W and Y To a considerable extent the chemical properties of DNA RNA and protein molecules are encoded in the linear sequence of these basic units their primary 3 structure The central dogma DNA CCTGAGCCAACTATTGATGAA transcription mRNA CCUGAGCCAACUAUUGAUGAA translation Protein PEPTIDE 4 A protein coding gene 5 Motifs Sites Signals Domains For this lecture I ll use these terms interchangeably to describe recurring elements of interest to us In PROTEINS we have transmembrane domains coiled coil domains EGF like domains signal peptides phosphorylation sites antigenic determinants In DNA RNA we have enhancers promoters terminators splicing signals translation initiation sites centromeres 6 Motifs and models Motifs typically represent regions of structural significance with specific biological function Are generalisations from known examples The models can be highly specific Multiple models can be used to give higher sensitivity specificity in their detection Can sometimes be generated automatically from examples or multiple alignments 7 The use of stochastic models for motifs Can be descriptive predictive or everything else in between almost business as usual However stochastic mechanisms should never be taken literally but nevertheless they can be amazingly useful Care is always needed a model or method can break down at any time without notice Biological confirmation of predictions is almost always necessary 8 Transcription initiation in E coli RNA polymerase promotor interactions In E coli transcription is initiated at the promotor whose 9 sequence is recognised by the Sigma factor of RNA polymerase Transcription initiation in E coli cont YKFSTYATWWIRQAITR 10 Determinism 1 consensus sequences Factor 70 28 Promotor consensus sequence 35 10 TTGACA TATAAT CTAAA CCGATAT Similarly for 32 38 and 54 Consensus sequences have the obvious limitation there is usually some deviation from them 11 The human transcription factor Sp1 has 3 Cys Cys His His zinc finger DNA binding domains 12 Determinism 2 regular expressions The characteristic motif of a Cys Cys His His zinc finger DNA binding domain has regular expression C X 2 4 C X 3 LIVMFYWC X 8 H X 3 5 H Here as in algebra X is unknown The 29 a a sequence of our example domain 1SP1 is as follows clearly fitting the model 1SP1 KKFACPECPKRFMRSDHLSKHIKTHQNKK 13 Prosite patterns An early effort at collecting descriptors for functionally important protein motifs They do not attempt to describe a complete domain or protein but simply try to identify the most important residue combinations such as the catalytic site of an enzyme They use regular expression syntax and focus on the most highly conserved residues in a protein family http au expasy org 14 More on Prosite patterns This pattern which must be in the N terminal of the sequence means A x ST 2 x 0 1 V LI Ala any Ser or Thr Ser or Thr any or none Val any but Leu Ile 15 Searching with regular expressions http www isrec isb sib ch software PATFND form html c 2 4 c livmfywc h 3 5 h PatternFind output ISREC Server Date Wed Aug 22 13 00 41 MET 2001 gp AF234161 7188808 01AEB01ABAC4F945 nuclear protein NP94b Homo sapiens Occurences 2 Position 514 CYICKASCSSQQEFQDHMSEPQH Position 606 CTVCNRYFKTPRKFVEHVKSQGH 16 Regular expressions can be limiting The regular expression syntax is still too rigid to represent many highly divergent protein motifs Also short patterns are sometimes insufficient with today s large databases Even requiring perfect matches you might find many false positives On the other site some real sites might not be perfect matches We need to go beyond apparently equally likely alternatives and ranges for gaps We deal with the former first having a distribution at each position 17 Cys Cys His His profile sequence logo form A sequence logo is a scaled position specific a a distribution 18 Scaling is by a measure of a position s information content Weight matrix model WMM Stochastic consensus sequence A C 13 G T 5 9 214 22 7 18 2 193 19 63 142 26 31 29 124 118 52 38 31 Counts from 242 known 8A C 0 04 0 88 0 26 0 59 0 49 0 03 0 09 0 03 0 11 0 13 0 21 0 05 G 0 07 0 01 0 12 0 16 0 12 0 02 T 0 80 0 08 0 51 0 13 0 18 0 89 29 43 70 216 sites Relative frequencies fbl 2 A 38 19 1 12 10 48 C 15 38 8 10 3 32 G 13 48 6 7 10 40 T 32 8 9 6 19 17 1 0 1 10 log2fbl pb 2 3 4 5 6 Informativeness 2 bpbllog2pbl 19 Interpretation of weight matrix entries candidate sequence CTATAATC aligned position 123456 Hypotheses S site and independence R random equiprobable independence 09x 08x 26x 13x 51x 01 25x 25x 25x 2 pr CTATAA S pr CTATAA R log2 log2 2 log2 09 2 log2 01 1 15 32 1 9 10 48 10 Generally score sbl log fbl pb l position b base 20 pb background frequency Use of the matrix to find sites Move the matrix along the sequence and score each window Peaks should occur at the true sites Of course in general any threshold will have some false positive and false negative rate A C T A T C 38 19 1 12 10 48 3 32 C 15 38 8 10 G 13 48 6 T 17 32 8 A 38 C A A T sum 7 10 40 9 19 19 12 10 48 15 38 8 10 3 32 G 13 48 6 T 17 32 8 A 38 C 1 6 93 19 85 7 10 40 9 1 6 19 12 10 48 15 38 8 10 3 32 G 13 48 6 T 17 32 8 7 10 40 9 6 19 95 21 Profiles Are a variation of the position specific scoring matrix approach just described Profiles are calculated slightly differently to reflect amino acid substitutions and the possibility of gaps but are used in the same way In general a profile entry Mla for location l and amino acid a is calculated by Mla bwlbSab where b ranges over amino acids wlb is a weight e g the observed frequency of a a b in position l and Sab is the a b entry of a substitution matrix e g PAM or BLOSUM calculated as a likelihood ratio Position specific gap penalties can also be included 22 Derivation of a profile for Ig domains PileUp of home ucsb00 George WAG pileup 8391 8424 Symbol comparison table GenRunData pileuppep cmp 1254 CompCheck GapWeight 3 000 GapLengthWeight 0 100 pileup …

View Full Document