Stanford CS 374 - Study Notes - D602756

Home> Schools> Stanford University> Computer Science (CS) > CS 374> Study Notes

DOC PREVIEW

Stanford CS 374 - Study Notes

School name Stanford University

Course Cs 374- Algorithms in Biology

Pages 10

This preview shows page 1-2-3 out of 10 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

BIOINFORMATICSVol. 1 no. 1 2003Pages 1–10Mismatch string kernels for discriminative proteinclassificationChristina Leslie1, Eleazar Eskin1, Adiel Cohen1, Jason Weston2and William Stafford Noble† 31Department of Computer Science, Columbia University, 1214 Amsterdam Ave, MailCode 0401, New York, NY, 10027, USA,2Max-Planck Institute for BiologicalCybernetics, Spemannstrasse 38, 72076 T¨ubingen, Germany and3Department ofGenome Sciences, University of Washington, 1705 NE Pacific Street, Seattle, WA,98195, USAABSTRACTMotivation: Classification of proteins sequences into func-tional and structural families based on sequence homol-ogy is a central problem in computational biology. Discrim-inative supervised machine learning approaches providegood performance, but simplicity and computational effi-ciency of training and prediction are also important con-cerns.Results: We introduce a class of string kernels, calledmismatch kernels, for use with support vector machines(SVMs) in a discriminative approach to the problem ofprotein classification and remote homology detection.These kernels measure sequence similarity based onshared occurrences of fixed-length patterns in the data,allowing for mutations between patterns. Thus the kernelsprovide a biologically well-motivated way to compare pro-tein sequences without relying on family-based generativemodels such as hidden Markov models. We compute thekernels efficiently using a mismatch tree data structure,allowing us to calculate the contributions of all patternsoccurring in the data in one pass while traversing thetree. When used with an SVM, the kernels enable fastprediction on test sequences. We report experiments ontwo benchmark SCOP data sets, where we show that themismatch kernel used with an SVM classifier performscompetitively with state-of-the-art methods for homologydetection, particularly when very few training examplesare available. Examination of the highest-weighted pat-terns learned by the SVM classifier recovers biologicallyimportant motifs in protein families and superfamilies.Availability: SVM software is publically available athttp://microarray.cpmc.columbia.edu/gist.Mismatch kernel software is available upon request.Contact: [email protected]†Formerly William Noble Grundy, see www.gs.washington.edu/noble/name-change.htmlINTRODUCTIONOne of the central problems in computational biologyis the classification of protein sequences into functionaland structural families based on sequence homology.Approaches based on pairwise similarity of sequences(Waterman et al. (1991); Altschul et al. (1990, 1997)),profiles for protein families (Gribskov et al. (1987)),consensus patterns using motifs (Bairoch (1995); Attwoodet al. (1998)) and hidden Markov models (Krogh et al.(1994); Eddy (1995); Baldi et al. (1994)) have all beenused for this problem. Recent research suggests thatthe best-performing methods are discriminative: proteinsequences are seen as a set of labeled examples — positiveif they are in the family and negative otherwise — and alearning algorithm attempts to learn a decision boundarybetween the different classes. In this category, severalsuccessful techniques (Jaakkola et al. (2000); Liao andNoble (2002); Leslie et al. (2002)) use protein sequencesto train a support vector machine (SVM) (Vapnik (1998))classifier.In this paper, we present a method for using SVMs forremote homology detection, based on a family of kernelfunctions called mismatch kernels. A kernel functionmeasures the similarity between a pair of inputs, anddefines an inner product in an implicit feature space forthe SVM optimization problem. The features used by ourmismatch kernel are the set of all possible subsequencesof amino acids of a fixed length k. If two proteinsequences contain many k-length subsequences that differby at most m mismatches, then their inner product underthe mismatch kernel will be large. More precisely, themismatch kernel is calculated based on shared occurrencesof (k, m)-patterns in the data, where the (k, m)-patterngenerated by a fixed k-length subsequence consists ofall k-length subsequences differing from it by at mostm mismatches. Thus, the mismatch kernel extends thecomputationally simpler spectrum kernel presented inLeslie et al. (2002), adding the biologically importantc Oxford University Press 2000 1Bioinformatics Advance Access published January 22, 2004Copyright (c) 2004 Oxford University PressC. Leslie et al.notion of mismatches.We also describe the mismatch tree data structure thatwe use to compute the mismatch kernel. We can efficientlycompute the contributions of all instances of (k, m)-patterns occurring in the data to all entries of the kernelmatrix in one pass while traversing the tree. For (k, m)-parameters that are useful in applications, the computetime to generate the kernel is fast enough for practical useon real data sets. Moreover, when mismatch kernels areused with SVMs, we can implement the classification tomake linear-time prediction on test sequences.We report results for two sets of experiments over theSCOP database (Murzin et al. (1995)). In the first setof experiments, we test our method on the benchmarkdataset assembled by Jaakkola et al. (2000), where SCOPsequences are augmented by domain homologs of positivetraining sequences in order to assist HMM-based methods.We show that our mismatch kernel, in conjunction withan SVM classifier, performs competitively with state-of-the-art methods such as the the SVM-Fisher method onthis dataset. In the second set of experiments, we performa similar set of experiments on a newer version of theSCOP database (Liao and Noble (2002)); however, in thissecond test, we limit the training examples to proteins inthe original SCOP dataset. In the absence of additionaldomain homologs for training the hidden Markov models,SVM-Fisher and other HMM-based approaches exhibitpoorer performance than our mismatch-SVM approach.However, mismatch-SVM performs as well as SVM-pairwise, the best-performing method reported in Liao andNoble (2002) for this benchmark.The current work is an expanded version of Leslie et al.(2002), which defined the mismatch kernel and presentedresults on the Jaakkola et al. dataset. Here, in addition toreporting experiments on the second benchmark datasetand compairing to the SVM-pairwise method, we presentfurther results that give biological motivation for use ofour method. First, we show that the mismatch kernelcaptures a biologically

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 10 pages.

Stanford CS 374 - Study Notes

Sign up for free to view:

Please select your school