Stanford CS 262 - Lecture 8 - Pair HMMs for Sequence Alignment - D2110858

Home> Schools> Stanford University> Computer Science (CS) > CS 262> Lecture 8 - Pair HMMs for Sequence Alignment

DOC PREVIEW

Stanford CS 262 - Lecture 8 - Pair HMMs for Sequence Alignment

School name Stanford University

Course Cs 262- Computational Genomics

Pages 12

This preview shows page 1-2-3-4 out of 12 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS 262 Winter 2007 Computational Genomics Lecture 8 Pair HMMs for Sequence Alignment 02 01 07 Scribe Bahman Bahmani Substitution of Amino Acids Different amino acids have different properties that make them more or less likely to replace each other during the evolutionary processes such as mutations This is mainly due to the differences i e different levels of similarity in structural and chemical specifications For instance hydrophobic amino acids are likelier to substitute each other than to substitute hydrophilic amino acids What we would like to do is to capture these likelihoods in a more rigorous way One of the most well known ways to do so is through the BLOSUM substitution matrices which we will explain below Substitution Matrices BLOSUM A Substitution Matrix whose entries represent the likelihood that one amino acid replaces another one during evolution There are a number of ways to construct such matrices resulting in a variety of substitution matrix families such as PAM and BLOSUM In this part we briefly explain how BLOSUM matrices are constructed We start from the BLOCKS database which is a database hand curated by experts containing blocks of gap free alignments between protein sequences in which we fully trust Then we cluster the sequences in each block in this database based on their percentage of identical residues That is if any two sequences have a similarity percentage more than some level X we put them in the same cluster Then we calculate the frequencies Aab of observing residue a in one cluster aligned against residue b in another cluster correcting for the sizes of the clusters by weighting each occurrence by 1 mn where m and n are the sizes of two clusters Then we estimate the probability of observing a residue a and also the probability of substituting residue a with residue b as follows P a b Aab c d Acd P a b Aab c d Acd This gives us a rigorous measure of how different amino acids like or dislike substituting each other during the evolution After all we note that there is actually a family of BLOSUM matrices each made from sequences with different levels of similarity For instance BLOSUM 50 has been made from sequences with 50 similarity The figure below shows two of the matrices of this family In the following we will present a probabilistic view of the alignment task which will allow us to utilize the above substitution probabilities BLOSUM 50 BLOSUM 62 Probabilistic Interpretation of an Alignment So far the alignment methods we have considered have used rather arbitrary scoring parameters In this part we show how we can make a probabilistic interpretation of an alignment task and also of its parameters We know that we can model an alignment using a finite state automaton Below we use the same model but we label the transitions between the states with corresponding probabilities and also label the states themselves with the corresponding emission probabilities This model is called Pair HMM 1 2 M P xi yj 1 1 I J P xi P yj So exactly the same as FSA for each alignment between the two sequences we have a path in this model But this model assigns a probability to each path along its states so it also assigns a probability to each alignment between our sequences Thus this model parameterizes a probability distribution over all alignments of any two sequences It should be mentioned that the emission probabilities in the above model reflect the substitution frequencies between pairs of amino acids for the state M and the frequencies of each amino acid for the states I and J So they can be taken from the above explained BLOSUM matrix Also note that in the above model the average length of a run of match mismatches state M is 1 2 and also the average length of a gap states I and J is 1 1 So knowing these average lengths we should set the corresponding transition probabilities and accordingly After all note that it is often the case that is a little larger than which shows that it is harder to open a gap than to continue it Now it can be easily seen that in terms of the probability distribution generated over the set of alignments between two sequences the above model is equivalent to the following model 1 2 M P xi yj 1 2 I P xi 1 2 1 1 2 J P yj Note that this latter model is no longer a true HMM because the transitions out of each of the states no longer add to one but as long as we are only concerned with the probabilities over paths through its states it is exactly equivalent to the HMM explained previously Now we consider another model to which we will contrast our above alignment model On the contrary to the above explained alignment model in which it was assumed that the two sequences had evolved from each other and hence would align to each other in this model it is assumed that the two sequences were generated completely independently from each other Hence if we call this model R we have P x y R P x1 P xm P y1 P yn i P xi j P yj Note that this model can also be shown graphically using two disjoint finite state automatons 1 1 I J P xi P yj Now if we divide the probability assigned to each alignment between the given two sequences by the probability of generating those sequences according to the random model above then arrive at another model as follows 1 2 M P xi yj P xi P yj 1 2 I 1 1 2 1 1 2 J 1 In this latter model it can be seen that any match M has a contribution to the whole score of a path equal to 1 2 P xi yj P xi P yj Every gap open event i e transition from M to I or J has a contribution equal to 1 1 2 and every gap extension event i e transition from either I or J to itself has a contribution equal to So taking logarithms of these contributions we define the substitution score gap initiation penalty and gap extension penalty to be respectively as follows P xi yj s xi yj log log 1 2 P xi P yj d 1 P xi 1 log log 1 2 P xi 1 2 e P xi log log P xi Then it is easy to see that with these scores and penalties the alignment maximizing the resulting Needleman Wunsch score is exactly the most likely alignment in our original HMM But to find the most likely alignment i e the path through the states with maximal probability we can use the Viterbi algorithm Hence the Viterbi algorithm for Pair HMMs is exactly finding the best global alignment with affine gaps After all it should be mentioned that the above defined substitution scores are very nearly the same as those obtained from BLOSUM matrices P a b s a b log P …

View Full Document