DOC PREVIEW
UMD CMSC 423 - Learning from Diversity

This preview shows page 1-2-19-20 out of 20 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

BackgroundLearning MethodFeaturesResultsLearning from DiversityEpitope Prediction with Sequence and Structure Featuresusing an Ensemble of Support Vector MachinesRob Patro and Carl KingsfordCenter for Bioinformatics and Computational BiologyUniversity of MarylandNov. 16, 2010NOverviewChallenge: epitope-antibody recognitionSolution: ensemble of support vector machinesITrained with probabilistic extensionIVariety of feature classes : physicochemical properties, stringkernels, structureIPerformance of individual methods and ensembleNProblem OverviewThe Challenge©http://visualscience.ru, 2010}Binding Site{Binding Site?Binding with linear epitopes“Simpler” sequence → affinity relationThe DetailsMeasure binding affinity aff(pi) ∈ [0, 65536]C+= {pi| aff(pi) ∈ [10000, 65536]}6, 841 bindersC−= {pi| aff(pi) ∈ [0, 1000]}20, 437 non-bindersLearn a function to predict bindingf : P → [0, 1]f (pi) ≥ 0.5 =⇒ p ∈ C+f (pi) < 0.5 =⇒ p ∈ C−NSystem Overview?C-f0f1. . . fM0 0.5 1.0}}C+Individual classifierstrained on various featuresDecision Trees, Boosted / Bagged /Random Forests, Naive Bayes, LogisticRegression, Maximum Entropy Classification,(Balanced) Winnow Classifiers, etc.Support Vector Machines (SVM)Aggregate scores of classifiersProduces prediction for binding classUnlikely Binder Likely BinderNProbabilistic SVMsIdeally we want a confidence in each prediction (Platt:1999)For each prediction, we obtaina posterior probabilityAllows ranking of predictions byposteriorAids in classifier combinationC-0 0.5 1.0}}C+NCombining PredictionsProbabilistic SVMs trainedon various featuresusing various kernels,with various parametersCombined by weighted voting:Weight of the j classifierbased on cross-validationperformancethNormalizingconstantthPosterior of positive class label for j classifier NChoosing FeaturesTo train SVMs we translate each peptide piinto a feature vector xiGood features are essentialGood features shouldBe discriminativeLead to class separabilityBe efficient to computeReal featuresCapture partial informationSeparate data subsetsAre often complementaryConsider many useful features → predictive powerNWhich Features?ILAMRSHYPFSequenceFeaturesPhysico|BiochemicalFeaturesStructureFeaturesk-spectrum kernelmismatch kernelsubstitution kernelstring subsequence kernelsparse spatial sample kernelBLOSUM encodingAAIndex encodingLocal compositionPeptide/Structureshape complementarityNSequence Features (String Kernels)String kernels assign a similarity to a pair of stringsK-spectrum kernelConsider all (K) k-mers that occur in the training setEncode each peptide as a vector v ∈ RKvj=(1 if p contains the jthk-mer0 otherwiseor vjcan encode the frequency of the jthk-mer in the peptide.Other string kernels: Mismatch kernel, Substitution kernel,Restricted gappy kernel, String subsequence kernel, Sparse SpatialSample (SSS) kernelNCompositional FeaturesConsider physicochemicalproperties of each peptidesequenceHydropathy, Antigenicity,Structure preference etc.Average property over entirepeptideMap each peptide to a scalarv ∈ RA/L R/K N/M D/F C/P Q/S E/T G/W H/Y I/V1.8 -4.5 -3.5 -3.5 2.5 -3.5 -3.5 -0.4 -3.2 4.53.8 -3.9 1.9 2.8 -.6 -0.8 -0.7 -0.9 -1.3 4.24.5 3.8 1.8 1.9 -4.5 -0.8 -3.2 -1.3 -0.6 2.8Amino Acid Hydropathy0.44}A R N DC QEGH I. . .NLocal Compositional FeaturesPhysicochemical features canbe useful but are globalEpitope is only a subset ofthe peptideA/L R/K N/M D/F C/P Q/S E/T G/W H/Y I/V1.8 -4.5 -3.5 -3.5 2.5 -3.5 -3.5 -0.4 -3.2 4.53.8 -3.9 1.9 2.8 -.6 -0.8 -0.7 -0.9 -1.3 4.24.5 3.8 1.8 1.9 -4.5 -0.8 -3.2 -1.3 -0.6 2.8Amino Acid Hydropathy3.36 2.5 -0.26 0.3. . .. . .A R N DC QEGH I. . .Consider a sliding window of a given length wMove window along the peptide from left to rightAverage values over windowConcatenate output to represent the peptideNOrthogonal EncodingOrthonormal representation proposed by Qian1988CE HROrthogonal EncodingMatrix11110 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Map each amino acid aj∈ pito a 20 long bit-vector vjxi= v0v1. . . vk−1for an amino acid of length kNProperty EncodingOrthogonality is not actually important in our applicationCE HR9 -1 -1 . . . -4 0 0 . . . -3 -1 0 . . . -3 -1 -1BLOSUM 62Replace the indicator vector by something more informativee.g. a row from a BLOSUM or PAM matrixNAAIndex EncodingThe Amino Acid Index (AAIndex) (Kawashima2008) compiles agrowing list of different phyiscochemical and biochemical propertiesof amino acids . . . 544 to date!CE HRProperty Matrix0 1 . . . KIs it possible to make use of all this information?Use non-linear factor matrix of AAIndex (Nanni2010)NStructural FeaturesConsider how well IgG and a peptide “fit” together.Poor ShapeComplementarityscorefrequencyGood ShapeComplementarityscorefrequencyExperimentally measuredIgG conformationApproximate native peptide conformationChoose most common sidechain positionsRelax energyCompute 2000 "best" dockingsusing ZDock (Chen, Li, and Wang 2003)Feature vector given by histogramof docking scoresNResults Tablevs. ensembleFeatures AUROC AUPR ∆AUROC ∆AUPRk-spectrum 0.85 0.70 -0.043 -0.072Sparse Spatial Sample 0.87 0.73 -0.023 -0.042Nonlinear Fisher Mat. 0.86 0.69 -0.024 -0.082Statistical Analysis Mat. 0.85 0.67 -0.025 -0.102BLOSUM Encoding 0.86 0.70 -0.024 -0.072Local Composition∗0.88 0.74 -0.013 -0.032Structure 0.74 0.53 -0.153 -0.242ensemble 0.893 0.7722ndPlace 0.892 0.766 -0.001 -0.0063rdPlace 0.864 0.691 -0.029 -0.0814thPlace 0.855 0.689 -0.038 -0.083∗using various physicochemical featuresNPerformance Curves ROCNPerformance Curves P/RNConclusionsIMany good features existIThey capture some non-overlapping informationIEnsemble solutions, used properly, are effectiveIStructure features are hard to computeIMuch room for improvement hereISimple features should not be discountedIThe local composition feature was the best single classifierIWe didn’t encounter anyone using this in the literature!NThanksFunding:NIH grant


View Full Document

UMD CMSC 423 - Learning from Diversity

Documents in this Course
Midterm

Midterm

8 pages

Lecture 7

Lecture 7

15 pages

Load more
Download Learning from Diversity
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Learning from Diversity and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Learning from Diversity 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?