Unformatted text preview:

Learning from Diversity Epitope Prediction with Sequence and Structure Features using an Ensemble of Support Vector Machines Rob Patro and Carl Kingsford Center for Bioinformatics and Computational Biology University of Maryland Nov 16 2010 N Overview Challenge epitope antibody recognition Solution ensemble of support vector machines N I Trained with probabilistic extension I Variety of feature classes physicochemical properties string kernels structure I Performance of individual methods and ensemble Problem Overview The Challenge The Details Measure binding affinity aff pi 0 65536 C pi aff pi 10000 65536 6 841 binders C pi aff pi 0 1000 20 437 non binders Learn a function to predict binding f P 0 1 Binding with linear epitopes Simpler sequence affinity relation f pi 0 5 p C f pi 0 5 p C N System Overview Individual classifiers trained on various features f f 0 1 f M Decision Trees Boosted Bagged Random Forests Naive Bayes Logistic Regression Maximum Entropy Classification Balanced Winnow Classifiers etc Support Vector Machines SVM Aggregate scores of classifiers 0 C Unlikely Binder N 0 5 1 0 C Likely Binder Produces prediction for binding class Probabilistic SVMs Ideally we want a confidence in each prediction Platt 1999 For each prediction we obtain a posterior probability Allows ranking of predictions by posterior Aids in classifier combination N Combining Predictions Probabilistic SVMs trained on various features using various kernels with various parameters Combined by weighted voting N Choosing Features To train SVMs we translate each peptide p i into a feature vector xi Good features are essential Good features should Real features Be discriminative Capture partial information Lead to class separability Separate data subsets Be efficient to compute Are often complementary Consider many useful features predictive power N Which Features Sequence Features k spectrum kernel mismatch kernel substitution kernel string subsequence kernel sparse spatial sample kernel N Physico Biochemical Features Structure Features BLOSUM encoding AAIndex encoding Local composition Peptide Structure shape complementarity Sequence Features String Kernels String kernels assign a similarity to a pair of strings K spectrum kernel Consider all K k mers that occur in the training set Encode each peptide as a vector v RK 1 if p contains the j th k mer vj 0 otherwise or vj can encode the frequency of the j th k mer in the peptide Other string kernels Mismatch kernel Substitution kernel Restricted gappy kernel String subsequence kernel Sparse Spatial Sample SSS kernel N Compositional Features Consider physicochemical properties of each peptide sequence Amino Acid Hydropathy Hydropathy Antigenicity Structure preference etc A L A 1 8 3 8 R K R 4 5 3 9 N M N 3 5 1 9 D F D 3 5 2 8 C P C 2 5 6 Q S Q 3 5 0 8 E T E 3 5 0 7 G W G 0 4 0 9 H Y H 3 2 1 3 I V I 4 5 4 2 4 5 3 8 1 8 1 9 4 5 0 8 3 2 1 3 0 6 2 8 Average property over entire peptide Map each peptide to a scalar v R 0 44 N Local Compositional Features Physicochemical features can be useful but are global Epitope is only a subset of the peptide Consider a sliding window of a given length w Move window along the peptide from left to right Average values over window Concatenate output to represent the peptide N Orthogonal Encoding Orthonormal representation proposed by Qian1988 Map each amino acid aj p i to a 20 long bit vector vj xi v0 v1 vk 1 for an amino acid of length k N Property Encoding Orthogonality is not actually important in our application Replace the indicator vector by something more informative e g a row from a BLOSUM or PAM matrix N AAIndex Encoding The Amino Acid Index AAIndex Kawashima2008 compiles a growing list of different phyiscochemical and biochemical properties of amino acids 544 to date Is it possible to make use of all this information Use non linear factor matrix of AAIndex Nanni2010 N Structural Features Consider how well IgG and a peptide fit together N Results Table vs ensemble Features k spectrum Sparse Spatial Sample Nonlinear Fisher Mat Statistical Analysis Mat BLOSUM Encoding Local Composition Structure AUROC AUPR AUROC AUPR 0 85 0 87 0 86 0 85 0 86 0 88 0 74 0 70 0 73 0 69 0 67 0 70 0 74 0 53 ensemble 0 893 0 772 2nd Place 3rd Place 4th Place 0 892 0 864 0 855 0 766 0 691 0 689 0 043 0 023 0 024 0 025 0 024 0 013 0 153 0 072 0 042 0 082 0 102 0 072 0 032 0 242 0 001 0 029 0 038 0 006 0 081 0 083 using various physicochemical features N Performance Curves ROC N Performance Curves P R N Conclusions I Many good features exist I They capture some non overlapping information I Ensemble solutions used properly are effective I Structure features are hard to compute I Much room for improvement here I Simple features should not be discounted I The local composition feature was the best single classifier I We didn t encounter anyone using this in the literature N Thanks Funding NIH grant 1R21AI085376 and NSF grant 0849899 to C K For many interesting and useful conversations Geet Duggal Darya Filippova Justin Malin Guillaume Mar ais Saket Navlakha Emre Sefer N


View Full Document

UMD CMSC 423 - Epitope Prediction with Sequence and Structure Features

Documents in this Course
Midterm

Midterm

8 pages

Lecture 7

Lecture 7

15 pages

Load more
Loading Unlocking...
Login

Join to view Epitope Prediction with Sequence and Structure Features and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Epitope Prediction with Sequence and Structure Features and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?