UMD CMSC 423 - Learning from Diversity - D1909458

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 423> Learning from Diversity

UMD CMSC 423 - Learning from Diversity

School name University of Maryland, College Park

Course Cmsc 423- Bioinformatic Algorithms, Databases, and Tools

Pages 20

Download Save

Unformatted text preview:

Learning from Diversity Epitope Prediction with Sequence and Structure Features using an Ensemble of Support Vector Machines Rob Patro and Carl Kingsford Center for Bioinformatics and Computational Biology University of Maryland Nov 16 2010 N Overview Challenge epitope antibody recognition Solution ensemble of support vector machines N I Trained with probabilistic extension I Variety of feature classes physicochemical properties string kernels structure I Performance of individual methods and ensemble Problem Overview The Challenge Binding Site The Details Binding Site Measure binding affinity aff pi 0 65536 C pi aff pi 10000 65536 6 841 binders C pi aff pi 0 1000 20 437 non binders Learn a function to predict binding http visualscience ru 2010 f P 0 1 Binding with linear epitopes Simpler sequence affinity relation f pi 0 5 p C f pi 0 5 p C N System Overview Individual classifiers trained on various features f f 0 1 f M Decision Trees Boosted Bagged Random Forests Naive Bayes Logistic Regression Maximum Entropy Classification Balanced Winnow Classifiers etc Support Vector Machines SVM Aggregate scores of classifiers 0 C Unlikely Binder N 0 5 1 0 C Likely Binder Produces prediction for binding class Probabilistic SVMs Ideally we want a confidence in each prediction Platt 1999 For each prediction we obtain a posterior probability Allows ranking of predictions by posterior N 0 5 1 0 0 Aids in classifier combination C C Combining Predictions Probabilistic SVMs trained on various features using various kernels with various parameters Combined by weighted voting th Weight of the j classifier based on cross validation performance Normalizing constant N Posterior of positive class label for j th classifier Choosing Features To train SVMs we translate each peptide p i into a feature vector xi Good features are essential Good features should Real features Be discriminative Capture partial information Lead to class separability Separate data subsets Be efficient to compute Are often complementary Consider many useful features predictive power N Which Features ILAMRSHYPF Sequence Features k spectrum kernel mismatch kernel substitution kernel string subsequence kernel sparse spatial sample kernel N Physico Biochemical Features Structure Features BLOSUM encoding AAIndex encoding Local composition Peptide Structure shape complementarity Sequence Features String Kernels String kernels assign a similarity to a pair of strings K spectrum kernel Consider all K k mers that occur in the training set Encode each peptide as a vector v RK 1 if p contains the j th k mer vj 0 otherwise or vj can encode the frequency of the j th k mer in the peptide Other string kernels Mismatch kernel Substitution kernel Restricted gappy kernel String subsequence kernel Sparse Spatial Sample SSS kernel N Compositional Features Consider physicochemical properties of each peptide sequence Amino Acid Hydropathy Hydropathy Antigenicity Structure preference etc A L A 1 8 3 8 R K R 4 5 3 9 N M N 3 5 1 9 D F D 3 5 2 8 C P C 2 5 6 Q S Q 3 5 0 8 E T E 3 5 0 7 G W G 0 4 0 9 H Y H 3 2 1 3 I V I 4 5 4 2 4 5 3 8 1 8 1 9 4 5 0 8 3 2 1 3 0 6 2 8 Average property over entire peptide Map each peptide to a scalar v R 0 44 N Local Compositional Features Physicochemical features can be useful but are global Epitope is only a subset of the peptide Amino Acid Hydropathy A L A 1 8 3 8 R K R 4 5 3 9 N M N 3 5 1 9 D F D 3 5 2 8 C P C 2 5 6 Q S Q 3 5 0 8 E T E 3 5 0 7 G W G 0 4 0 9 H Y H 3 2 1 3 4 5 3 8 1 8 1 9 4 5 0 8 3 2 1 3 0 6 2 8 3 36 2 5 0 26 Consider a sliding window of a given length w Move window along the peptide from left to right Average values over window Concatenate output to represent the peptide N 0 3 I V I 4 5 4 2 Orthogonal Encoding Orthonormal representation proposed by Qian1988 C E H R Orthogonal Encoding Matrix 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 Map each amino acid aj p i to a 20 long bit vector vj xi v0 v1 vk 1 for an amino acid of length k N Property Encoding Orthogonality is not actually important in our application C E H R BLOSUM 62 9 1 1 4 0 0 3 1 0 3 1 1 Replace the indicator vector by something more informative e g a row from a BLOSUM or PAM matrix N AAIndex Encoding The Amino Acid Index AAIndex Kawashima2008 compiles a growing list of different phyiscochemical and biochemical properties of amino acids 544 to date C E H R Property Matrix 0 1 Is it possible to make use of all this information Use non linear factor matrix of AAIndex Nanni2010 N K Structural Features Consider how well IgG and a peptide fit together Good Shape Poor Shape Complementarity Complementarity Approximate native peptide conformation Choose most common sidechain positions Relax energy score frequency frequency Experimentally measured IgG conformation Compute 2000 best dockings using ZDock Chen Li and Wang 2003 score Feature vector given by histogram of docking scores N Results Table vs ensemble Features k spectrum Sparse Spatial Sample Nonlinear Fisher Mat Statistical Analysis Mat BLOSUM Encoding Local Composition Structure AUROC AUPR AUROC AUPR 0 85 0 87 0 86 0 85 0 86 0 88 0 74 0 70 0 73 0 69 0 67 0 70 0 74 0 53 ensemble 0 893 0 772 2nd Place 3rd Place 4th Place 0 892 0 864 0 855 0 766 0 691 0 689 0 043 0 023 0 024 0 025 0 024 0 013 0 153 0 072 0 042 0 082 0 102 0 072 0 032 0 242 0 001 0 029 0 038 0 006 0 081 0 083 using various physicochemical features N Performance Curves ROC N Performance Curves P R N Conclusions I Many good features exist I They capture some non overlapping information I Ensemble solutions used properly are effective I Structure features are hard to compute I Much room for improvement here I Simple features should not be discounted I The local composition feature was the best single classifier I We didn t encounter anyone using this in the literature N Thanks Funding NIH grant 1R21AI085376 and NSF grant 0849899 to C K For many interesting and useful conversations Geet …

View Full Document


School:
Email:
New Password:
Confirm Password:

UMD CMSC 423 - Learning from Diversity

Sign up for free to view:

Please select your school