DOC PREVIEW
Stanford CS 374 - Lecture 14 - Protein Classification

This preview shows page 1-2-3-4-30-31-32-33-34-62-63-64-65 out of 65 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 65 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 65 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 65 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 65 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 65 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 65 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 65 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 65 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 65 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 65 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 65 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 65 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 65 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 65 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Protein ClassificationPDB GrowthOnly a few folds are found in natureProtein classificationProtein structure classificationStructure Classification DatabasesMajor classes in SCOPAll a: Hemoglobin (1bab)All b: Immunoglobulin (8fab)a/b: Triosephosphate isomerase (1hti)a+b: Lysozyme (1jsf)FamiliesSuperfamiliesFoldsSlide 15Slide 16PSI-BLASTA profileProfile HMMsClassification with Profile HMMsSlide 21Generative ModelsSlide 23Slide 24Slide 25Slide 26Discriminative MethodsDiscriminative Models -- SVMDiscriminative protein classificationSlide 30The Fisher KernelSlide 32Slide 33Slide 34BenchmarksOther methodsResultsSlide 38QUESTIONk-mer based SVMsSlide 41SVMs will find a few support vectorsSlide 43Semi-Supervised MethodsSlide 45Slide 46Slide 47Slide 48Slide 49Slide 50Slide 51Slide 52Slide 53Slide 54Slide 55Slide 56Slide 57Slide 58Slide 59Slide 60Slide 61Slide 62Cluster KernelsSlide 64Slide 65Protein ClassificationPDB GrowthNew PDB structuresOnly a few folds are found in natureProtein classification•Number of protein sequences grows exponentially•Number of solved structures grows exponentially•Number of new folds identified very small (and close to constant)•Protein classification canGenerate overview of structure typesDetect similarities (evolutionary relationships) between protein sequencesHelp predict 3D structure of new protein sequencesMorten Nielsen,CBS, BioCentrum, DTUSCOP release 1.69, Class # folds # superfamilies # familiesAll alpha proteins 218 376 608All beta proteins 144 290 560Alpha and beta proteins (a/b) 136 222 629Alpha and beta proteins (a+b) 279 409 717Multi-domain proteins 46 46 61Membrane & cell surface 47 88 99Small proteins 75 108 171Total 945 1539 2845Classification of 25,973 protein structures in PDBProtein worldProtein foldProtein structure classificationProtein superfamilyProtein familyMorten Nielsen,CBS, BioCentrum, DTUStructure Classification Databases•SCOPManual classification (A. Murzin)scop.berkeley.edu•CATHSemi manual classification (C. Orengo)www.biochem.ucl.ac.uk/bsm/cath •FSSPAutomatic classification (L. Holm)www.ebi.ac.uk/dali/fssp/fssp.html Morten Nielsen,CBS, BioCentrum, DTUMajor classes in SCOP•ClassesAll  proteinsAll  proteins and  proteins (/) and  proteins (+)Multi-domain proteinsMembrane and cell surface proteinsSmall proteinsCoiled coil proteinsMorten Nielsen,CBS, BioCentrum, DTUAll : Hemoglobin (1bab)Morten Nielsen,CBS, BioCentrum, DTUAll : Immunoglobulin (8fab)Morten Nielsen,CBS, BioCentrum, DTU Triosephosphate isomerase (1hti)Morten Nielsen,CBS, BioCentrum, DTU: Lysozyme (1jsf)Morten Nielsen,CBS, BioCentrum, DTUFamilies•Proteins whose evolutionarily relationship is readily recognizable from the sequence (>~25% sequence identity)•Families are further subdivided into Proteins•Proteins are divided into SpeciesThe same protein may be found in several speciesFoldFamilySuperfamilyProteinsMorten Nielsen,CBS, BioCentrum, DTUSuperfamilies•Proteins which are (remotely) evolutionarily relatedSequence similarity lowShare functionShare special structural features•Relationships between members of a superfamily may not be readily recognizable from the sequence aloneFoldFamilySuperfamilyProteinsMorten Nielsen,CBS, BioCentrum, DTUFolds•>~50% secondary structure elements arranged in the same order in sequence and in 3D•No evolutionary relation FoldFamilySuperfamilyProteinsMorten Nielsen,CBS, BioCentrum, DTUProtein Classification•Given a new protein sequence, can we place it in its “correct” position within an existing protein hierarchy?Methods•BLAST / PsiBLAST•Profile HMMs•Supervised Machine Learning methodsFoldFamilySuperfamilyProteins?new proteinBLAST(Basic Local Alignment Search Tool)Main idea:1. Construct a dictionary of all the words in the query2. Initiate a local alignment for each word match between query and DBRunning Time: O(MN)However, orders of magnitude faster than Smith-WatermanqueryDBPSI-BLASTGiven a sequence query x, and database D1. Find all pairwise alignments of x to sequences in D2. Collect all matches of x to y with some minimum significance3. Construct position specific matrix M, a profile4. Using the matrix M, search D for more matches5. Iterate 1–4 until convergenceProfile MA profileProfile HMMs•Each M state has a position-specific pre-computed substitution table•Each I and D state has position-specific gap penalties •Profile HMM is a generative model:The sequence X that is aligned to H, is thought of as “generated by” HTherefore, H parametrizes a conditional distribution P(X | H)Protein profile HMMM1M2MmBEGINI0I1Im-1D1D2DmENDImDm-1Classification with Profile HMMsM1M2MmBEGINI0I1Im-1D1D2DmENDImDm-1FoldFamilySuperfamily?M1M2MmBEGINI0I1Im-1D1D2DmENDImDm-1M1M2MmBEGINI0I1Im-1D1D2DmENDImDm-1new proteinClassification with Profile HMMs•How generative models workTraining examples ( sequences known to be members of family )Model assigns a probability to any given protein sequence.The sequence from that family yield a higher probability than that of outside family.•Log-likelihood ratio as score P(X | H1) P(H1) P(H1|X) P(X) P(H1|X) L(X) = log -------------------------- = log --------------------- = log -------------- P(X | H0) P(H0) P(H0|X) P(X) P(H0|X)Generative ModelsGenerative ModelsGenerative ModelsGenerative ModelsGenerative ModelsDiscriminative MethodsInstead of modeling the process that generates data, directly discriminate between classes• More direct way to the goal• Better if model is not accurateDiscriminative Models -- SVMvDecision Rule:red: vTx > 0marginIf x1 … xn training examples, sign(iixiTx) “decides” where x falls• Train i to achieve best marginLarge Margin for |v| < 1 Margin of 1 for small |v|Discriminative protein classificationJaakkola, Diekhans, Haussler, ISMB 1999•Define the discriminant function to beL(X) = XiH1 i K(X, Xi) - XjH0 j K(X, Xj)We decide X  family H whenever L(X) > 0 •For now, let’s just assume K(.,.) is a similarity function•Then, we want to train i so that this classifier makes as few mistakes as possible in the new data•Similarly to SVMs, train i so that margin is largest for 0  i  1Discriminative protein classification•Ideally, for training examples, L(Xi) ≥ 1 if Xi  H1, L(Xi)  -1 otherwise•This is not


View Full Document

Stanford CS 374 - Lecture 14 - Protein Classification

Documents in this Course
Probcons

Probcons

42 pages

ProtoMap

ProtoMap

19 pages

Lecture 3

Lecture 3

16 pages

Load more
Download Lecture 14 - Protein Classification
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 14 - Protein Classification and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 14 - Protein Classification 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?