Stanford CS 262 - Pair HMMs and Protein Alignment - D1797368

Home> Schools> Stanford University> Computer Science (CS) > CS 262> Pair HMMs and Protein Alignment

DOC PREVIEW

Stanford CS 262 - Pair HMMs and Protein Alignment

School name Stanford University

Course Cs 262- Computational Genomics

Pages 9

This preview shows page 1-2-3 out of 9 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1CS 262: Computational Genomics Professor Serafim Batzoglou Lecture 8: Pair HMMs and Protein Alignment Scribed on February 2, 2006 by Ben Handy Finite State Automaton for Alignment We can create a state model that corresponds to generating an alignment between two sequences x and y. There is a one-to-one correspondence between alignments and strings composed of M, I, and J, which is also a path through the automaton, such that the sum of +1 terms for x is equal to the length of sequence x and the sum of +1 terms for y is equal to the length of sequence y. -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII To simplify our model, we will make the restriction that gaps in x cannot immediately follow gaps in y, and vice versa. The model would work just as well if we allow this. This is a design decision – if you wish to have gaps in x followed by y, you can. We have labeled every transition in the model with a score. Transitions to state M indicate letter-to-letter correspondences, so they are labeled with s(xi, yj) corresponding to the substitution score for replacing xi with yj. We know which i and j to use, based on the current sum of +1’s for x and y thus far. We label every transition from M to a gap state (I or J) with the gap initiation penalty -d, and we label each transition from a gap state to itself with the gap extension penalty –e. Every path through this model corresponds to an alignment. If we sum every score on the transitions of a path, we will have the same score as a global alignment dynamic programming problem with affine gap penalties.2Dynamic Programming Review We defined three matrices for our dynamic programming, one corresponding for each of our states M, I, J. M(i, j): Optimal alignment of x1…xi to y1…yj ending in M I(i, j): Optimal alignment of x1…xi to y1…yj ending in I J(i, j): Optimal alignment of x1…xi to y1…yj ending in J And our standard dynamic programming recurrences: Initialization: M(0,0) = 0; M(i, 0) = M(0, j) = -infinity, for i, j > 0 I(i,0) = d + I * e; J(0, j) = d + j * e Iteration: M(i, j) = s(xi, yj) + max { M(i – 1, j – 1), I(i – 1, j – 1) , J(i – 1, j – 1) } I(i, j) = max { -e + I(i – 1, j) , -d + M(i – 1, j) } J(i, j) = max { -e + J(i, j – 1), -d + M(i, j – 1) } Termination: The optimal alignment given by max [ M(m, n), I(m, n), J(m, n) ] Protein Sequence and Structure (consider taking CS273 next quarter for more details.) Muscles very are important to the biology of vertebrates. Muscles are very long cells: they are 1-50 millimeters long, and 10-50 micrometers in diameter. They are composed of road-like structures (filaments) composed of two proteins: Actin and Myosin. Actin filaments are like strings or roads, and myosin filaments are in between these, and can pull the actin filaments closer together. This allows the muscles to contract to 70% of their original length.3 Actin is a very important protein that can be found in almost every cell (even bacteria). It is involved in forming road-like structures inside the cell. We tend to think of cells as bags of different chemicals and proteins and enzymes, but really cells are very elaborately sculptured machines, and actin plays a major roll of keeping the shape and structure of each cell so they can perform the appropriate functions. A road-like structure of actin is often an alternating pattern of three different types of actin. Actin is very ancient and very abundant. In fact, it is the most abundant protein in our bodies. There are 1-2 actin genes in bacteria, while humans have 6 Actin genes that differ by about 4 amino acids. Humans and amoebas have 80% similarity in actin proteins. Actin is generally grouped in the following categories: alpha-actin: in muscles beta-actin, gamma-actin: in non-muscle cells. Another extremely important protein is FtsZ (or FtsA) protein that helps in cell division. The FtsZ protein creates a ring around the dividing cell, and makes the ring smaller and smaller until the cell divides. If we compare the FtsA protein to actin, we see that they are very similar, sharing 3 main domains, but FtsA has an additional domain (shown in yellow below).4 If we examine their sequences and align them, we see they have very high similarity in 3 domains, but the 4th domain acts as a large gap (in yellow below).5Protein Phylogenies There are 2 main ways that proteins evolve (families are formed): 1. Speciation: one ancestor species splits. When you compare the proteins, they are generally similar, but have some differences. 2. Duplication: A region is copied, so one area is free to take on a new purpose. A phylogenetic tree can show how proteins are related. Duplication is generally represented by horizontal and vertical lines, while speciation is represented by direct edges with no bends. In the example below, the two yellow proteins are found in the same species, and the two purple proteins are found in the same species. Actin must be a billion years old, or older. Protein Structure In proteins, structure determines function. Also, the sequence determines the structure. This means that we can discover the function of a protein, by only examining the sequence. One sequence has exactly one structure, and one structure generally has one function. Some examples of protein functions are: regulation of gene expression, structure of cell (actin), movement (muscle), catalysis, transport of molecules in and out of cells, signaling between cells, etc. We will not study structure much in this course, but it is interesting to note that studying sequences, and alignment of sequences, can be connected to structure and function of proteins. Proteins are composed of amino acids that are connected with peptide bonds, and differ from each other only in the side chain of amino acids. There are 20 different amino acids, which have very different chemical properties and shapes, and can be divided into the following 3 categories: 1. Hydrophobic: These stay close to each other, and away from water (much like oil does). Cytoplasm is mostly water, which causes these to fold. Examples include: Alanine, Valine, Isoleucine, Leucine, Methionine, Phenylalanine, Tyrosine, Tryptophan. 2. Hydrophilic: (or polar). These like to stay close to water, forming bonds

View Full Document