Stanford CS 374 - MACHINE LEARNING FOR PROTEIN CLASSIFICATION - D2114446

Home> Schools> Stanford University> Computer Science (CS) > CS 374> MACHINE LEARNING FOR PROTEIN CLASSIFICATION

DOC PREVIEW

Stanford CS 374 - MACHINE LEARNING FOR PROTEIN CLASSIFICATION

School name Stanford University

Course Cs 374- Algorithms in Biology

Pages 30

This preview shows page 1-2-14-15-29-30 out of 30 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Machine Learning for Protein Classification: Kernel MethodsOutlineProteinsThe Protein ProblemHow to look at amino acid chainsFamiliesSuperfamiliesFoldsProtein ClassificationMachine Learning ConceptsDiscriminative and Generative ModelsTransductive LearningSupport Vector MachinesSupport Vector Machines (2)Kernel MethodsMismatch Kernelk-mer based SVMsDisadvantagesSemi-Supervised MethodsSlide 20Slide 21Slide 22Slide 23Slide 24Slide 25Cluster KernelsBagged Mismatched KernelSlide 28What works best?ReferencesMACHINE LEARNING FOR PROTEIN CLASSIFICATION: KERNEL METHODSCS 374Rajesh Ranganath4/10/2008OUTLINEBiological Motivation and BackgroundAlgorithmic ConceptsMismatch KernelsSemi-supervised methodsPROTEINSTHE PROTEIN PROBLEMPrimary Structure can be easily determined3D structure determines functionGrouping proteins into structural and evolutionary families is difficultUse machine learning to group proteinsHOW TO LOOK AT AMINO ACID CHAINS Smith-Waterman IdeaMismatch IdeaFAMILIESProteins whose evolutionarily relationship is readily recognizable from the sequence (>~25% sequence identity)Families are further subdivided into ProteinsProteins are divided into SpeciesThe same protein may be found in several speciesFoldFamilySuperfamilyProteinsMorten Nielsen,CBS, BioCentrum, DTUSUPERFAMILIESProteins which are (remote) evolutionarily relatedSequence similarity lowShare functionShare special structural featuresRelationships between members of a superfamily may not be readily recognizable from the sequence aloneFoldFamilySuperfamilyProteinsMorten Nielsen,CBS, BioCentrum, DTUFOLDSProteins which have >~50% secondary structure elements arranged the in the same order in the protein chain and in three dimensions are classified as having the same foldNo evolutionary relation between proteinsFoldFamilySuperfamilyProteinsMorten Nielsen,CBS, BioCentrum, DTUPROTEIN CLASSIFICATIONGiven a new protein, can we place it in its “correct” position within an existing protein hierarchy?MethodsBLAST / PsiBLASTProfile HMMsSupervised Machine Learning methodsFoldFamilySuperfamilyProteins?new proteinMACHINE LEARNING CONCEPTSSupervised MethodsDiscriminative Vs. Generative ModelsTransductive LearningSupport Vector MachinesKernel MethodsSemi-supervised MethodsDISCRIMINATIVE AND GENERATIVE MODELSDiscriminative GenerativeTRANSDUCTIVE LEARNINGMost Learning is InductiveGiven (x1,y1) …. (xm,ym), for any test input x* predict the label y*Transductive LearningGiven (x1,y1) …. (xm,ym) and all the test input {x1*,…, xp*} predict label {y1*,…, yp*}SUPPORT VECTOR MACHINESPopular Discriminative Learning algorithmOptimal geometric marginal classifierCan be solved efficiently using the Sequential Minimal Optimization algorithmIf x1 … xn training examples, sign(iixiTx) “decides” where x falls Train i to achieve best marginSUPPORT VECTOR MACHINES (2)Kernalizable: The SVM solution can be completely written down in terms of dot products of the input. {sign(iiK(xi,x) determines class of x)}KERNEL METHODSK(x, z) = f(x)Tf(z)f is the feature mappingx and z are input vectorsHigh dimensional features do not need to be explicitly calculatedThink of the kernel function similarity measure between x and zExample:MISMATCH KERNELRegions of similar amino acid sequences yield a similar tertiary structure of proteinsUsed as a kernel for an SVM to identify protein homologiesK-MER BASED SVMSFor given word size k, and mismatch tolerance l, defineK(X, Y) = # distinct k-long word occurrences with ≤ l mismatchesDefine normalized mismatch kernel K’(X, Y) = K(X, Y)/ sqrt(K(X,X)K(Y,Y))SVM can be learned by supplying this kernel functionA B A C A R D IA B R A D A B IXYK(X, Y) = 4K’(X, Y) = 4/sqrt(7*7) = 4/7 Let k = 3; l = 1DISADVANTAGES3D structure of proteins is practically impossiblePrimary sequences are cheap to determineHow do we use all this unlabeled data?Use semi-supervised learning based on the cluster assumptionSEMI-SUPERVISED METHODS• Some examples are labeled• Assume labels vary smoothly among all examples• Some examples are labeled• Assume labels vary smoothly among all examplesSEMI-SUPERVISED METHODS• SVMs and other discriminative methods may make significant mistakes due to lack of dataSEMI-SUPERVISED METHODS• Some examples are labeled• Assume labels vary smoothly among all examplesSEMI-SUPERVISED METHODS• Some examples are labeled• Assume labels vary smoothly among all examplesSEMI-SUPERVISED METHODS• Some examples are labeled• Assume labels vary smoothly among all examplesSEMI-SUPERVISED METHODS• Some examples are labeled• Assume labels vary smoothly among all examplesAttempt to “contract” the distances within each cluster while keeping intracluster distances largerSEMI-SUPERVISED METHODS• Some examples are labeled• Assume labels vary smoothly among all examplesCLUSTER KERNELSSemi-supervised methods1. Neighborhood 1. For each X, run PSI-BLAST to get similar seqs  Nbd(X)2. Define Φnbd(X) = 1/|Nbd(X)| X’  Nbd(X) Φoriginal(X’) “Counts of all k-mers matching with at most 1 diff. all sequences that are similar to X”3. Knbd(X, Y) = 1/(|Nbd(X)|*|Nbd(Y)) X’  Nbd(X) Y’  Nbd(Y) K(X’, Y’)2. Next bagged mismatchBAGGED MISMATCHED KERNELFinal method1. Bagged mismatch1. Run k-means clustering n times, giving p = 1,…,n assignments cp(X)2. For every X and Y, count up the fraction of times they are bagged togetherKbag(X, Y) = 1/n p 1(cp(X) = cp (Y))3. Combine the “bag fraction” with the original comparison K(.,.)Knew(X, Y) = Kbag(X, Y) K(X, Y)O. JangminWHAT WORKS BEST?Transductive SettingREFERENCESC. Leslie et al. Mismatch string kernels for discriminative protein classification. Bioinformatics Advance Access. January 22, 2004.J. Weston et al. Semi-supervised protein classification using cluster kernels.2003.Images pulled under

View Full Document