Protein ClassificationPDB GrowthOnly a few folds are found in natureProtein classificationProtein structure classificationStructure Classification DatabasesMajor classes in SCOPAll a: Hemoglobin (1bab)All b: Immunoglobulin (8fab)a/b: Triosephosphate isomerase (1hti)a+b: Lysozyme (1jsf)FamiliesSuperfamiliesFoldsSlide 15Slide 16PSI-BLASTA profileProfile HMMsClassification with Profile HMMsSlide 21Generative ModelsSlide 23Slide 24Slide 25Slide 26Discriminative MethodsDiscriminative Models -- SVMDiscriminative protein classificationSlide 30The Fisher KernelSlide 32Slide 33Slide 34BenchmarksOther methodsResultsSlide 38QUESTIONk-mer based SVMsSlide 41SVMs will find a few support vectorsSlide 43Semi-Supervised MethodsSlide 45Slide 46Slide 47Slide 48Slide 49Slide 50Slide 51Slide 52Slide 53Slide 54Slide 55Slide 56Slide 57Slide 58Slide 59Slide 60Slide 61Slide 62Cluster KernelsSlide 64Slide 65Protein ClassificationPDB GrowthNew PDB structuresOnly a few folds are found in natureProtein classification•Number of protein sequences grows exponentially•Number of solved structures grows exponentially•Number of new folds identified very small (and close to constant)•Protein classification canGenerate overview of structure typesDetect similarities (evolutionary relationships) between protein sequencesHelp predict 3D structure of new protein sequencesMorten Nielsen,CBS, BioCentrum, DTUSCOP release 1.69, Class # folds # superfamilies # familiesAll alpha proteins 218 376 608All beta proteins 144 290 560Alpha and beta proteins (a/b) 136 222 629Alpha and beta proteins (a+b) 279 409 717Multi-domain proteins 46 46 61Membrane & cell surface 47 88 99Small proteins 75 108 171Total 945 1539 2845Classification of 25,973 protein structures in PDBProtein worldProtein foldProtein structure classificationProtein superfamilyProtein familyMorten Nielsen,CBS, BioCentrum, DTUStructure Classification Databases•SCOPManual classification (A. Murzin)scop.berkeley.edu•CATHSemi manual classification (C. Orengo)www.biochem.ucl.ac.uk/bsm/cath •FSSPAutomatic classification (L. Holm)www.ebi.ac.uk/dali/fssp/fssp.html Morten Nielsen,CBS, BioCentrum, DTUMajor classes in SCOP•ClassesAll proteinsAll proteins and proteins (/) and proteins (+)Multi-domain proteinsMembrane and cell surface proteinsSmall proteinsCoiled coil proteinsMorten Nielsen,CBS, BioCentrum, DTUAll : Hemoglobin (1bab)Morten Nielsen,CBS, BioCentrum, DTUAll : Immunoglobulin (8fab)Morten Nielsen,CBS, BioCentrum, DTU Triosephosphate isomerase (1hti)Morten Nielsen,CBS, BioCentrum, DTU: Lysozyme (1jsf)Morten Nielsen,CBS, BioCentrum, DTUFamilies•Proteins whose evolutionarily relationship is readily recognizable from the sequence (>~25% sequence identity)•Families are further subdivided into Proteins•Proteins are divided into SpeciesThe same protein may be found in several speciesFoldFamilySuperfamilyProteinsMorten Nielsen,CBS, BioCentrum, DTUSuperfamilies•Proteins which are (remotely) evolutionarily relatedSequence similarity lowShare functionShare special structural features•Relationships between members of a superfamily may not be readily recognizable from the sequence aloneFoldFamilySuperfamilyProteinsMorten Nielsen,CBS, BioCentrum, DTUFolds•>~50% secondary structure elements arranged in the same order in sequence and in 3D•No evolutionary relation FoldFamilySuperfamilyProteinsMorten Nielsen,CBS, BioCentrum, DTUProtein Classification•Given a new protein sequence, can we place it in its “correct” position within an existing protein hierarchy?Methods•BLAST / PsiBLAST•Profile HMMs•Supervised Machine Learning methodsFoldFamilySuperfamilyProteins?new proteinBLAST(Basic Local Alignment Search Tool)Main idea:1. Construct a dictionary of all the words in the query2. Initiate a local alignment for each word match between query and DBRunning Time: O(MN)However, orders of magnitude faster than Smith-WatermanqueryDBPSI-BLASTGiven a sequence query x, and database D1. Find all pairwise alignments of x to sequences in D2. Collect all matches of x to y with some minimum significance3. Construct position specific matrix M, a profile4. Using the matrix M, search D for more matches5. Iterate 1–4 until convergenceProfile MA profileProfile HMMs•Each M state has a position-specific pre-computed substitution table•Each I and D state has position-specific gap penalties •Profile HMM is a generative model:The sequence X that is aligned to H, is thought of as “generated by” HTherefore, H parametrizes a conditional distribution P(X | H)Protein profile HMMM1M2MmBEGINI0I1Im-1D1D2DmENDImDm-1Classification with Profile HMMsM1M2MmBEGINI0I1Im-1D1D2DmENDImDm-1FoldFamilySuperfamily?M1M2MmBEGINI0I1Im-1D1D2DmENDImDm-1M1M2MmBEGINI0I1Im-1D1D2DmENDImDm-1new proteinClassification with Profile HMMs•How generative models workTraining examples ( sequences known to be members of family )Model assigns a probability to any given protein sequence.The sequence from that family yield a higher probability than that of outside family.•Log-likelihood ratio as score P(X | H1) P(H1) P(H1|X) P(X) P(H1|X) L(X) = log -------------------------- = log --------------------- = log -------------- P(X | H0) P(H0) P(H0|X) P(X) P(H0|X)Generative ModelsGenerative ModelsGenerative ModelsGenerative ModelsGenerative ModelsDiscriminative MethodsInstead of modeling the process that generates data, directly discriminate between classes• More direct way to the goal• Better if model is not accurateDiscriminative Models -- SVMvDecision Rule:red: vTx > 0marginIf x1 … xn training examples, sign(iixiTx) “decides” where x falls• Train i to achieve best marginLarge Margin for |v| < 1 Margin of 1 for small |v|Discriminative protein classificationJaakkola, Diekhans, Haussler, ISMB 1999•Define the discriminant function to beL(X) = XiH1 i K(X, Xi) - XjH0 j K(X, Xj)We decide X family H whenever L(X) > 0 •For now, let’s just assume K(.,.) is a similarity function•Then, we want to train i so that this classifier makes as few mistakes as possible in the new data•Similarly to SVMs, train i so that margin is largest for 0 i 1Discriminative protein classification•Ideally, for training examples, L(Xi) ≥ 1 if Xi H1, L(Xi) -1 otherwise•This is not
View Full Document