Stat 246 Lecture 3 Biological sequence analysis The objects of our study DNA RNA and proteins macromolecules which are unbranched polymers built up from smaller units DNA units are the nucleotide residues A C G and T RNA units are the nucleotide residues A C G and U Proteins units are the amino acid residues A C D E F G H I K L M N P Q R S T V W and Y To a considerable extent the chemical properties of DNA RNA and protein molecules are encoded in the linear sequence of these basic units their primary structure The use of statistics to study linear sequences of biomolecular units Can be descriptive predictive or everything else in between almost business as usual Stochastic mechanisms should never be taken literally but nevertheless can be amazingly useful Care is always needed a model or method can break down at any time without notice Biological confirmation of predictions is almost always necessary The statistics of biological sequences can be global or local Base composition of genomes E coli 25 A 25 C 25 G 25 T P falciparum 82 A T Translation initiation ATG is the near universal motif indicating the start of translation in DNA coding sequence From certainty to statistical models a brief case study 1 ZNF Cys Cys His His zinc finger DNA binding domain Cys Cys His His zinc finger DNA binding domain Its characteristic motif has regular expression C x 2 4 C x 3 LIVMFYWC x 8 H x 3 5 H 1ZNF XYKCGLCERSFVEKSALSRHQRVHKNX http www isrec isb sib ch software PATFND form html c 2 4 c livmfywc h 3 5 h PatternFind output ISREC Server Date Wed Aug 22 13 00 41 MET 2001 gp AF234161 7188808 01AEB01ABAC4F945 nuclear protein NP94b Homo sapiens Occurences 2 Position 514 CYICKASCSSQQEFQDHMSEPQH Position 606 CTVCNRYFKTPRKFVEHVKSQGH gp X67787 1326037 02AF953C84E0AB5A zinc finger protein Saccharomyces cerevisiae Occurences 1 Position 3 CSFDGCEKVYNRPSLLQQHQNSH 200 matches found output limit reached This search could have been conducted using a suffix tree representation Regular expressions can be limiting C AGGT A AGT A G TC N T AGC 11 C 5 splice junction in eukaryotes 3 splice junction Most protein binding sites are characterized by some degree of sequence specificity but seeking a consensus sequence is often an inadequate way to recognize sites Position specific distributions came to represent the variability in motif composition Cys Cys His His profile sequence logo form A sequence logo is a scaled position specific a a distribution Scaling is by a measure of a position s information content Sequence logos T D Schneider A visual representation of a position specific distribution Easy for nucleotides but we need colour to depict up to 20 amino acid proportions Idea overall height at position l proportional to information content 2 Hl proportions of each nucleotide or amino acid are in relation to their observed frequency at that position with most frequent on top next most frequent below etc How do we search with position specific distributions Position specific scoring matrices T A T A A T A 0 1 0 1 1 0 A 38 C 0 0 0 0 0 0 C G 0 0 0 0 0 0 T 1 0 1 0 0 1 consensus T A 19 T A T 12 10 48 15 38 8 10 3 32 G 13 48 6 T 17 32 8 PSSM 1 A 7 10 48 9 6 19 Use of a PSSM to find sites Move the matrix along the sequence and score each window Peaks should occur at the true sites Of course in general any threshold will have some false positive and false negative rate C T A 12 10 48 C 15 38 8 10 3 32 G 13 48 6 A 38 T T 19 A 1 A T sum 93 7 10 48 17 32 8 9 6 19 A 38 19 1 12 10 48 C 15 38 8 10 3 32 G 13 48 6 T C 85 7 10 48 17 32 8 9 6 19 A 38 19 1 12 10 48 C 15 38 8 10 3 32 G 13 48 6 T 17 32 8 7 10 48 9 6 19 95 Calculation of a PSSM from counts A 9 214 63 142 118 8 A 0 04 0 88 0 26 0 59 0 49 0 03 C 22 7 26 31 52 13 C 0 09 0 03 0 11 0 13 0 22 0 05 G 18 2 29 38 29 5 G 0 07 0 01 0 12 0 16 0 12 0 02 T 193 19 124 31 43 216 T 0 80 0 08 0 51 0 13 0 18 0 89 Counts from 242 known sites A 2 76 1 82 0 06 1 23 Relative frequencies 0 96 2 92 C 1 46 3 11 1 22 1 00 0 22 2 21 G 1 76 5 00 1 06 0 67 1 06 3 58 T 1 67 1 66 1 04 1 00 0 49 1 84 2 1 0 1 PSSM log fbl pb fbl 2 3 4 5 6 Informativeness 2 bpbllog2pbl Derivation of PSSM entries Suppose that we have aligned sequence data on a number of instances of a given type of site Candidate sequence CTATAATC Aligned position 123456 Hypotheses log2 S site and independence R random equiprobable independence pr CTATAA S pr CTATAA R log2 09x 03x 26x 13x 51x 01 25x 25x 25x 25x 25x 25 2 log2 09 2 log2 01 1 10 15 32 1 9 10 48 More generally PSSM score sbl log fbl pb l position b base pb background frequency Representation of motifs further steps Missing from the position specific distribution representation of motifs are good ways of dealing with Length distributions for insertions deletions Cross position association of amino acids Hidden Markov models help with the first second remains a hard unsolved problem The Hidden Markov models Processes St Ot t 1 where St is the hidden state and Ot the observation at time t such that pr St St 1 Ot 1 St 2 Ot 2 pr St St 1 pr Ot St 1 Ot 1 St 2 Ot 2 pr Ot St St 1 The basics of HMMs were laid bare in a series of beautiful papers by L E Baum and colleagues around 1970 and their formulation has been used almost unchanged to this day Hidden Markov models extensions Many variants are now used For example the distribution of O may not depend on previous S but on previous O values pr Ot St St 1 Ot 1 pr Ot St or pr Ot St St 1 Ot 1 pr Ot St St 1 Ot 1 Most importantly for us the times of S and O may be decoupled permitting the Observation corresponding to State time t to be a string whose length and composition depends on St and possibly St 1 and part or all of the previous Observations This is called a hidden semi Markov or generalized hidden Markov model …
View Full Document