Frederick Lo Biochemistry 218 Final Project Paper Three Recently Developed Algorithms for Aligning Distantly Related Proteins Introduction The rate at which new protein sequences are discovered has long outpaced the rate at which those proteins are experimentally assigned functions To speed the process of function assignment protein sequences with unknown functions can be compared to sequences with known functions If two protein sequences are shown to be highly related then it follows that the protein functions may be closely related as well Domingues et al 2000 Early sequence alignment algorithms such as Smith Waterman and BLAST focused on comparing the residue sequences only in making alignments However function may be conserved even in sequences that appear to have diverged Much recent effort has been devoted to developing algorithms that align distantly related proteins also known as remote homologues The three algorithms discussed in this paper consider three separate approaches to this alignment problem The three algorithms are the Structure Dependent Sequence Alignment a hybrid Iterative Parametric approach to suboptimal alignment and the amino acid property based Proximity Correlation Matrix alignment Structure Dependent Sequence Alignment Previous studies have shown that structural motifs in related proteins often remain highly conserved even in the presence of significant divergences in sequence Jaroszewski et al 2000 Therefore integrating structural data generally leads to more accurate protein sequence alignments than using sequence data alone especially for distantly related proteins Structural alignment is frequently cited as the gold standard for sequence alignment Sunyaev et al 2004 Since protein structure and function are closely related it is no surprise that a protein s amino acid sequence is much easier to determine than the protein s structure such that the number of known protein sequences far outstrips the number of known protein structures Jaroszewski et al 2000 Unless the structures of both the query and template proteins are known structural alignments cannot be conducted The Structure Dependent Sequence Alignment or SDSA algorithm attempts to bridge this gap Yang 2002 SDSA is a sequence structure alignment algorithm that is adapted from the Needleman Wunsch global sequence alignment algorithm The Needleman Wunsch algorithm can be expressed with the following equation H ij Dij max H i 1 j 1 max H k j 1 g i j k j 1 k i 2 Na Eq 1 max H i 1 k g i j i 1 k k j 2 Nb D is the sequence sequence amino acid substitution matrix where Dij is the score for substituting the template residue at position j with the query residue at position i H is the scoring matrix where Hij is the maximum score for all sequences rooted at i j Na and Nb are the number of residues found in the query and template sequences respectively The function g i j k j 1 is the penalty for inserting a gap in the query sequence whereas g i j i 1 k is the penalty for deleting a subsequence in the template The formulae for determining g in SDSA will be discussed later The maximal value in H indicates the best alignment and the entire alignment can be obtained by traversing H diagonally downward from left to right where position 1 1 represents the top left corner of H Yang 2002 At the core of SDSA is the construction of a structure based amino acid substitution matrix D Since the structure of the template sequence is known residue j in the template sequence can be identified as belonging to an helix strand or coil region Thus the following conditional equations were developed for D q B fP w Dij q j Aij fP i w j when residue j belongs to an helix Eq 2a or Dij Eq 2b or Dij C ij fP i c j ij i j when residue j belongs to a strand when residue j belongs to a coil region Eq 2c A B and C are amino acid substitution matrices for residues in helices strands and coil regions respectively Yang used a database of protein structure fragments to obtain 170 673 pairs of local structure alignments which consisted of 503 466 1 130 201 and 516 046 pairs of aligned helices strands and coil regions respectively By processing these pairs with the Protein Informatics System for Modeling PrISM 1 structural alignment procedure also developed by Yang and utilizing formulas used by Henikoff and Henikoff 1992 to derive the BLOSUM substitution matrices the A B and C matrices were constructed Yang 2002 The functions Pi x in Equation 2 where x is or c represent the log odds probability that amino acid i will be found in an helix strand or coil region respectively These probabilities were derived from the pairings generated for the A B and C substitution matrices above The general formula for Pi x is Pi x ln ni x ni n x n where ni x is the number of amino acid type i found the sequences of structure type x ni is the number of amino acid type i found in the total set of residues from all pairings n x is the number of residues found in all pairings of type x and n is the total number of residues found in all pairings The numerator ni x ni can be interpreted as the probability that a given amino acid type i will be found in structure type x while the denominator n x n can be interpreted as the probability that a random amino acid will be found in structure type x Yang 2002 Parameters q f and wj in Equation 2 are user specified Parameter q is a measure of whether residue i is exposed to solvent or buried The residue is considered buried if less than 20 of its surface area is exposed to solvent The weighting parameter f for the log odds probability should ideally have a value of 1 but is present to allow flexibility to improve alignment accuracy Parameter f will be determined by using training sets Parameter wj provides extra weight for helix and strand residues to differentiate them from coil regions Although Yang claims that wj can be varied given information about the template structure no explanation was given on how this would be done Parameter wj is generally set to 1 Yang 2002 With the amino acid substitution matrix D now constructed using Equation 2 it can be placed in Equation 1 However the gap insertion and deletion components of Equation 1 also need to be adjusted to take structural considerations into account For gap insertion g i j k j 1 a b k i 2 c j j 1 Eq 3 where the gap is inserted between positions i 1 and k 1 of the query sequence and residues i and k in the query sequence are matched with residues j and j 1 in the
View Full Document
Unlocking...