Segmentation Conditional Random Fields (SCRFs)

Home> Academic Documents> Segmentation Conditional Random Fields (SCRFs)

DOC PREVIEW

This preview shows page 1-2-3-4-5 out of 15 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Segmentation Conditional Random Fields(SCRFs): A New Approach forProtein Fold RecognitionYan Liu1, Jaime Carbonell1, Peter Weigele2, and Vanathi Gopalakrishnan31School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA{yanliu, jgc}@cs.cmu.edu2Biology Department, Massachusetts Institute of Technology, Cambridge, MA, USA{pweigele}@mit.edu3Center for Biomedical Informatics, University of Pittsburgh, PA, [email protected]. Protein fold recognition is an important step towards un-derstanding protein three-dimensional structures and their functions. Aconditional graphical model, i.e. segmentation conditional random fields(SCRFs), is proposed to solve the problem. In contrast to traditionalgraphical models such as hidden markov model (HMM), SCRFs followa discriminative approach. It has the flexibility to include overlappingor long-range interaction features over the whole sequence, as well asglobal optimally solutions for the parameters. On the other hand, thesegmentation setting in SCRFs makes its graphical structures intuitivelysimilar to the protein 3-D structures and more importantly, provides aframework to model the long-range interactions directly.Our model is applied to predict the parallel β-helix fold, an importantfold in bacterial infection of plants and binding of antigens. The cross-family validation shows that SCRFs not only can score all known β-helices higher than non β-helices in Protein Data Bank, but also demon-strate more success in locating each rung in the known β-helix proteinsthan BetaWrap, a state-of-the-art algorithm for predicting β-helix fold,and HMMER, a general motif detection algorithm based on HMM. Ap-plying our prediction model to Uniprot database, we hypothesize previ-ously unknown β-helices.1 IntroductionIt is believed that protein structures reveal important information about the pro-tein functions. One key step towards modeling a tertiary structure is to identifyhow secondary structures as building blocks arrange themselves in space, i.e. thesupersecondary structures or protein folds. There has been significant work onpredicting some well-defined types of structural motifs or functional units, suchas αα- and ββ-hairpins [1–4]. The task of protein fold recognition is the follow-ing: given a protein sequence and a particular fold or super-secondary structure,predict whether the protein contains the structural fold and if so, locate its exactpositions in the sequence.The traditional approach for protein fold prediction is to search the databaseusing PSI-BLAST [5] or match against an HMM profile built from sequenceswith the same fold by HMMER [4] or SAM [3]. These approaches work wellfor short motifs with strong sequence similarities. However, there exist manyimportant motifs or folds without clear sequence similarity and involving thelong-range interactions, such as folds in β class [6]. These cases necessitate amore powerful model, which can capture the structural characteristics of theprotein fold. Interestingly, the protein fold recognition task parallels an emergingtrend in machine learning community, i.e the structure prediction problem, whichpredict the labels of each node in a graph given the observation with particularstructures, for example webpage classification using the hyperlink graph or objectrecognition using grids of image pixels. The conditional graphical models proveto be one of the most effective tools for this kind of problem [7, 8].In fact, several graphical models have been applied to protein structure pre-diction. One of the early approaches is to apply simple hidden markov models(HMMs) to protein secondary structure prediction and protein motif detection[3, 4, 9]; Delcher et al. introduced probabilistic causal networks for protein sec-ondary structure modeling [10]. Recently, Liu et al. applied conditional ran-dom fields (CRFs), a discriminative graphical model based on undirected graph,for protein secondary structure prediction [11]; Chu et al. extended segmentalsemi-Markov model (SSMM) under the Baysian framework for protein secondarystructures [12].The bottleneck for protein fold prediction is the long-range interactions,which could be either two β-strands with hydrogen bonds in a parallel β-sheetor helix pairs in coupled helical motifs. Generative models, such as HMM orSSMM, assume a particular generating process, which makes it difficult to con-sider overlapping features and long-range interactions. Discriminative graphicalmodels, such as CRFs, assume a single residue as an observation. Thus they failto capture the features over a whole secondary structure element or the inter-actions between adjacent elements in 3-D, which may be distant in the primarysequence. To solve the problem, we propose segmentation conditional randomfields (SCRFs), which retain all the advantages of original CRFs and at the sametime can handle observations of variable length.2 Conditional Random Fields (CRFs)Simple graphical chain models, such as hidden markov models (HMMs), havebeen applied to various problems. As a “generative” model, HMMs assume thatthe data are generated by a particular model and compute the joint distributionof the observation sequence x and state sequence y, i.e. P (x, y). However, gener-ative models might perform poorly with inappropriate assumptions. In contrast,discriminative models, such as neural networks and support vector machines(SVMs), estimate the decision b oundary directly without computing the under-lying data distribution and thus often achieve better performance.Recently, several discriminative graphical models have been proposed bythe machine learning community, such as Maximum Entropy Markov Models(MEMMs) [13] and Conditional Random fields (CRFs) [14]. Among these mod-els, CRFs proposed by Lafferty et al., are very effective in many applications,including information extraction, image processing and so on [8, 7].CRFs are “undirected” graphical models (also known as random fields, asopposed to directed graphical models such as HMMs) to compute the condi-tional likelihood P (y|x) directly. By the Hammersely-Clifford theorem [15], theconditional probability P (y|x) is proportional to the product of the potentialfunctions over all the cliques in the graph,P (y|x) =1Z0Yc∈C(y,x)Φc(yc, xc),where Φc(yc, xc) is the potential function over the clique c, and Z0is the nor-malization factor over all possible assignments of y (see [16] for more detail).For a chain


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4-5 out of 15 pages.

Please select your school