Machine Learning for Protein Classification: Kernel MethodsOutlineProteinsThe Protein ProblemHow to look at amino acid chainsFamiliesSuperfamiliesFoldsProtein ClassificationMachine Learning ConceptsDiscriminative and Generative ModelsTransductive LearningSupport Vector MachinesSupport Vector Machines (2)Kernel MethodsMismatch Kernelk-mer based SVMsDisadvantagesSemi-Supervised MethodsSlide 20Slide 21Slide 22Slide 23Slide 24Slide 25Cluster KernelsBagged Mismatched KernelSlide 28What works best?ReferencesMACHINE LEARNING FOR PROTEIN CLASSIFICATION: KERNEL METHODSCS 374Rajesh Ranganath4/10/2008OUTLINEBiological Motivation and BackgroundAlgorithmic ConceptsMismatch KernelsSemi-supervised methodsPROTEINSTHE PROTEIN PROBLEMPrimary Structure can be easily determined3D structure determines functionGrouping proteins into structural and evolutionary families is difficultUse machine learning to group proteinsHOW TO LOOK AT AMINO ACID CHAINS Smith-Waterman IdeaMismatch IdeaFAMILIESProteins whose evolutionarily relationship is readily recognizable from the sequence (>~25% sequence identity)Families are further subdivided into ProteinsProteins are divided into SpeciesThe same protein may be found in several speciesFoldFamilySuperfamilyProteinsMorten Nielsen,CBS, BioCentrum, DTUSUPERFAMILIESProteins which are (remote) evolutionarily relatedSequence similarity lowShare functionShare special structural featuresRelationships between members of a superfamily may not be readily recognizable from the sequence aloneFoldFamilySuperfamilyProteinsMorten Nielsen,CBS, BioCentrum, DTUFOLDSProteins which have >~50% secondary structure elements arranged the in the same order in the protein chain and in three dimensions are classified as having the same foldNo evolutionary relation between proteinsFoldFamilySuperfamilyProteinsMorten Nielsen,CBS, BioCentrum, DTUPROTEIN CLASSIFICATIONGiven a new protein, can we place it in its “correct” position within an existing protein hierarchy?MethodsBLAST / PsiBLASTProfile HMMsSupervised Machine Learning methodsFoldFamilySuperfamilyProteins?new proteinMACHINE LEARNING CONCEPTSSupervised MethodsDiscriminative Vs. Generative ModelsTransductive LearningSupport Vector MachinesKernel MethodsSemi-supervised MethodsDISCRIMINATIVE AND GENERATIVE MODELSDiscriminative GenerativeTRANSDUCTIVE LEARNINGMost Learning is InductiveGiven (x1,y1) …. (xm,ym), for any test input x* predict the label y*Transductive LearningGiven (x1,y1) …. (xm,ym) and all the test input {x1*,…, xp*} predict label {y1*,…, yp*}SUPPORT VECTOR MACHINESPopular Discriminative Learning algorithmOptimal geometric marginal classifierCan be solved efficiently using the Sequential Minimal Optimization algorithmIf x1 … xn training examples, sign(iixiTx) “decides” where x falls Train i to achieve best marginSUPPORT VECTOR MACHINES (2)Kernalizable: The SVM solution can be completely written down in terms of dot products of the input. {sign(iiK(xi,x) determines class of x)}KERNEL METHODSK(x, z) = f(x)Tf(z)f is the feature mappingx and z are input vectorsHigh dimensional features do not need to be explicitly calculatedThink of the kernel function similarity measure between x and zExample:MISMATCH KERNELRegions of similar amino acid sequences yield a similar tertiary structure of proteinsUsed as a kernel for an SVM to identify protein homologiesK-MER BASED SVMSFor given word size k, and mismatch tolerance l, defineK(X, Y) = # distinct k-long word occurrences with ≤ l mismatchesDefine normalized mismatch kernel K’(X, Y) = K(X, Y)/ sqrt(K(X,X)K(Y,Y))SVM can be learned by supplying this kernel functionA B A C A R D IA B R A D A B IXYK(X, Y) = 4K’(X, Y) = 4/sqrt(7*7) = 4/7 Let k = 3; l = 1DISADVANTAGES3D structure of proteins is practically impossiblePrimary sequences are cheap to determineHow do we use all this unlabeled data?Use semi-supervised learning based on the cluster assumptionSEMI-SUPERVISED METHODS• Some examples are labeled• Assume labels vary smoothly among all examples• Some examples are labeled• Assume labels vary smoothly among all examplesSEMI-SUPERVISED METHODS• SVMs and other discriminative methods may make significant mistakes due to lack of dataSEMI-SUPERVISED METHODS• Some examples are labeled• Assume labels vary smoothly among all examplesSEMI-SUPERVISED METHODS• Some examples are labeled• Assume labels vary smoothly among all examplesSEMI-SUPERVISED METHODS• Some examples are labeled• Assume labels vary smoothly among all examplesSEMI-SUPERVISED METHODS• Some examples are labeled• Assume labels vary smoothly among all examplesAttempt to “contract” the distances within each cluster while keeping intracluster distances largerSEMI-SUPERVISED METHODS• Some examples are labeled• Assume labels vary smoothly among all examplesCLUSTER KERNELSSemi-supervised methods1. Neighborhood 1. For each X, run PSI-BLAST to get similar seqs Nbd(X)2. Define Φnbd(X) = 1/|Nbd(X)| X’ Nbd(X) Φoriginal(X’) “Counts of all k-mers matching with at most 1 diff. all sequences that are similar to X”3. Knbd(X, Y) = 1/(|Nbd(X)|*|Nbd(Y)) X’ Nbd(X) Y’ Nbd(Y) K(X’, Y’)2. Next bagged mismatchBAGGED MISMATCHED KERNELFinal method1. Bagged mismatch1. Run k-means clustering n times, giving p = 1,…,n assignments cp(X)2. For every X and Y, count up the fraction of times they are bagged togetherKbag(X, Y) = 1/n p 1(cp(X) = cp (Y))3. Combine the “bag fraction” with the original comparison K(.,.)Knew(X, Y) = Kbag(X, Y) K(X, Y)O. JangminWHAT WORKS BEST?Transductive SettingREFERENCESC. Leslie et al. Mismatch string kernels for discriminative protein classification. Bioinformatics Advance Access. January 22, 2004.J. Weston et al. Semi-supervised protein classification using cluster kernels.2003.Images pulled under
View Full Document