Stanford CS 374 - Study Notes - D473197

Home> Schools> Stanford University> Computer Science (CS) > CS 374> Study Notes

DOC PREVIEW

Stanford CS 374 - Study Notes

School name Stanford University

Course Cs 374- Algorithms in Biology

Pages 7

This preview shows page 1-2 out of 7 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

BIOINFORMATICSVol. 19 Suppl. 1 2003, pages i205–i211DOI: 10.1093/bioinformatics/btg1027An ENSEMBLE machine learning approach forthe prediction of all-alpha membrane proteinsPier Luigi Martelli, Piero Fariselli and Rita Casadio∗Laboratory of Biocomputing, CIRB/Department of Biology, University of Bologna, viaIrnerio 42, 40126 Bologna, ItalyReceived on January 6, 2003; accepted on February 20, 2003ABSTRACTMotivation: All-alpha membrane proteins constitute afunctionally relevant subset of the whole proteome. Theircontent ranges from about 10 to 30% of the cell proteins,based on sequence comparison and specific predictivemethods. Due to the paucity of membrane proteinssolved with atomic resolution, the training/testing sets ofpredictive methods for protein topography and topologyroutinely include very few well-solved structures mixed witha hundred proteins known with low resolution. Moreover,available predictors fail in predicting recently crystallisedmembrane proteins (Chen et al., 2002). Presently thenumber of well-solved membrane proteins comprisessome 59 chains of low sequence homology. It is thereforepossible to train/test predictors only with the set ofproteins known with atomic resolution and evaluate morethoroughly the performance of different methods.Results: We implement a cascade-neural network (NN),two different hidden Markov models (HMM), and theirensemble (ENSEMBLE) as a new method. We train andtest in cross validation the three methods and ENSEMBLEon the 59 well resolved membrane proteins. ENSEMBLEscores with a per-protein accuracy of 90% for topographyand 71% for topology, outperforming the best singlemethod of 7 and 5 percentage points, respectively. Whentested on a low resolution set of 151 proteins, with nohomology with the 59 proteins, the per-protein accuracyof ENSEMBLE is 76% for topography and 68% fortopology. Our results also indicate that the performanceof ENSEMBLE is higher than that of the best predictorspresently available on the Web.Contact: [email protected]; http://www.biocomp.unibo.itINTRODUCTIONMembrane proteins are involved in almost every cellactivity and signal transmission. However their modellingis generally more difficult than that of globular proteins,due to the few examples of membrane proteins known∗To whom correspondence should be addressed.with atomic resolution. For this reason a 2D model of theprotein is routinely predicted, highlighting those regionsthat can interact with the membrane phase. This is doneby predicting first the location of transmembrane segmentsalong the protein sequence (topography) and then thelocation of the N and C terminus with respect to thelipid bilayer (topology). This last step, depending on thepredictive method, can be computed using different ‘adhoc’ rules derived from experiments and/or statisticalanalysis (von Heijne, 1999) or using hidden Markovmodels (Tusnady and Simon, 1998; Krogh et al., 2001).Two types of membrane proteins have been charac-terised: the first includes all-alpha proteins that, to adifferent extent, interact with the lipid bilayer of thecytoplasmic membrane of all cells (White and Wimley,1999); the second group includes the so called beta-barrelmembrane proteins, which interact with the outer mem-brane with antiparallel beta-strands forming barrels, withan even number of segments (Schulz, 2000). Few methodshave been described so far for the prediction of the all-betamembrane proteins (Jacoboni et al., 2001; Martelli etal., 2002; Wimley, 2002, and references therein). On thecontrary, several methods have been developed to predictthe location of transmembrane segments in the all-helicalmembrane proteins (for detailed reviews see M¨oller et al.,2001; Chen et al., 2002).Routinely, different datasets are used to score thepredictor performance. Basically two sets of proteins areconsidered: the first includes high resolution structures,the second topological models obtained mainly fromexperimental data (referred to as the low resolutionset; M¨oller et al., 2000). A recent thorough analysishighlights that none of the different advanced methods,based on machine learning and available on the Web (Webpredictors), when tested on the high resolution structuresof membrane proteins perform consistently best, and thatwrong predictions are different for different predictors(Chen et al., 2002).With the purpose of overcoming the blur introducedby the low resolution training set, we select 59 high-resolution membrane proteins with low sequence identityBioinformatics 19(Suppl. 1)c Oxford University Press 2003; all rights reserved. i205P.L.Martelli et al.to train/test our predictors. We implement a neuralnetwork and two HMMs, known to be among the bestperforming predictors for the task at hand (Chen et al.,2002). We also develop their ensemble (ENSEMBLE)and this is new for the prediction of membrane proteins.Our strategy allows a more thorough comparison betweendifferent approaches, based on the high resolution setof membrane proteins, and uses as a blind test the lowresolution set. This is different from what was donebefore, since the predictors previously described weretrained on mixed sets of proteins, including also the lowresolution models and did not compare predictors on thesame training/testing set.With our approach, we find that all methods performsimilarly; however the performance is maximal only whenthe ensemble of predictors is used, including the neuralnetwork and the two HMMs, all trained on evolutionaryinformation. Furthermore, when predicting both the highresolution and low resolution sets of membrane proteins,ENSEMBLE outperforms the best performing Web pre-dictors.ABSTRACT SYSTEM AND METHODSDatasetsWe use three datasets for different purposes. Thefirst one (S59) is derived from the database of mem-brane proteins available at http://blanco.biomol.uci.edu(Jayasinghe et al., 2001). S59 comprises 59 highresolution membrane proteins, which are used for train-ing and scoring the predictive methods (available athttp://www.biocomp.unibo.it/gigi/ENSEMBLE). Thesecond (S151) is a M¨oller’s database subset (M¨olleret al., 2000) containing only low resolution proteins,whose sequences do not have similarity with thosein S59. The third dataset (S1396) is a non redundantset of 1396 globular proteins, whose structures areknown and whose sequences are less than 25% similar(http://www.cbrc.jp/papia/papia.html).Each predictor is trained using evolutionary informationin the form of sequence

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 7 pages.

Stanford CS 374 - Study Notes

Sign up for free to view:

Please select your school