# Segmentation Conditional Random Fields (SCRFs)

**View the full content.**View Full Document

0 0 8 views

**Unformatted text preview:**

Segmentation Conditional Random Fields SCRFs A New Approach for Protein Fold Recognition Yan Liu1 Jaime Carbonell1 Peter Weigele2 and Vanathi Gopalakrishnan3 1 2 School of Computer Science Carnegie Mellon University Pittsburgh PA USA yanliu jgc cs cmu edu Biology Department Massachusetts Institute of Technology Cambridge MA USA pweigele mit edu 3 Center for Biomedical Informatics University of Pittsburgh PA USA vanathi cbmi pitt edu Abstract Protein fold recognition is an important step towards understanding protein three dimensional structures and their functions A conditional graphical model i e segmentation conditional random fields SCRFs is proposed to solve the problem In contrast to traditional graphical models such as hidden markov model HMM SCRFs follow a discriminative approach It has the flexibility to include overlapping or long range interaction features over the whole sequence as well as global optimally solutions for the parameters On the other hand the segmentation setting in SCRFs makes its graphical structures intuitively similar to the protein 3 D structures and more importantly provides a framework to model the long range interactions directly Our model is applied to predict the parallel helix fold an important fold in bacterial infection of plants and binding of antigens The crossfamily validation shows that SCRFs not only can score all known helices higher than non helices in Protein Data Bank but also demonstrate more success in locating each rung in the known helix proteins than BetaWrap a state of the art algorithm for predicting helix fold and HMMER a general motif detection algorithm based on HMM Applying our prediction model to Uniprot database we hypothesize previously unknown helices 1 Introduction It is believed that protein structures reveal important information about the protein functions One key step towards modeling a tertiary structure is to identify how secondary structures as building blocks arrange themselves in space i e the supersecondary structures or protein folds There has been significant work on predicting some well defined types of structural motifs or functional units such as and hairpins 1 4 The task of protein fold recognition is the following given a protein sequence and a particular fold or super secondary structure predict whether the protein contains the structural fold and if so locate its exact positions in the sequence The traditional approach for protein fold prediction is to search the database using PSI BLAST 5 or match against an HMM profile built from sequences with the same fold by HMMER 4 or SAM 3 These approaches work well for short motifs with strong sequence similarities However there exist many important motifs or folds without clear sequence similarity and involving the long range interactions such as folds in class 6 These cases necessitate a more powerful model which can capture the structural characteristics of the protein fold Interestingly the protein fold recognition task parallels an emerging trend in machine learning community i e the structure prediction problem which predict the labels of each node in a graph given the observation with particular structures for example webpage classification using the hyperlink graph or object recognition using grids of image pixels The conditional graphical models prove to be one of the most effective tools for this kind of problem 7 8 In fact several graphical models have been applied to protein structure prediction One of the early approaches is to apply simple hidden markov models HMMs to protein secondary structure prediction and protein motif detection 3 4 9 Delcher et al introduced probabilistic causal networks for protein secondary structure modeling 10 Recently Liu et al applied conditional random fields CRFs a discriminative graphical model based on undirected graph for protein secondary structure prediction 11 Chu et al extended segmental semi Markov model SSMM under the Baysian framework for protein secondary structures 12 The bottleneck for protein fold prediction is the long range interactions which could be either two strands with hydrogen bonds in a parallel sheet or helix pairs in coupled helical motifs Generative models such as HMM or SSMM assume a particular generating process which makes it difficult to consider overlapping features and long range interactions Discriminative graphical models such as CRFs assume a single residue as an observation Thus they fail to capture the features over a whole secondary structure element or the interactions between adjacent elements in 3 D which may be distant in the primary sequence To solve the problem we propose segmentation conditional random fields SCRFs which retain all the advantages of original CRFs and at the same time can handle observations of variable length 2 Conditional Random Fields CRFs Simple graphical chain models such as hidden markov models HMMs have been applied to various problems As a generative model HMMs assume that the data are generated by a particular model and compute the joint distribution of the observation sequence x and state sequence y i e P x y However generative models might perform poorly with inappropriate assumptions In contrast discriminative models such as neural networks and support vector machines SVMs estimate the decision boundary directly without computing the underlying data distribution and thus often achieve better performance Recently several discriminative graphical models have been proposed by the machine learning community such as Maximum Entropy Markov Models MEMMs 13 and Conditional Random fields CRFs 14 Among these models CRFs proposed by Lafferty et al are very effective in many applications including information extraction image processing and so on 8 7 CRFs are undirected graphical models also known as random fields as opposed to directed graphical models such as HMMs to compute the conditional likelihood P y x directly By the Hammersely Clifford theorem 15 the conditional probability P y x is proportional to the product of the potential functions over all the cliques in the graph Y 1 P y x c yc xc Z0 c C y x where c yc xc is the potential function over the clique c and Z0 is the normalization factor over all possible assignments of y see 16 for more detail For a chain structure CRFs define the conditional probability as N X K X 1 P y x exp k fk yi 1 yi x i Z0 i 1 1 k 1 where fk is an arbitrary feature function over x N is the number of observations and