Stanford CS 224 - Using the Mesh Hierarchy to Index Bioinformatics articles - D802866

Home> Schools> Stanford University> Computer Science (CS) > CS 224> Using the Mesh Hierarchy to Index Bioinformatics articles

DOC PREVIEW

Stanford CS 224 - Using the Mesh Hierarchy to Index Bioinformatics articles

School name Stanford University

Course Cs 224- N Natural Language Processing with Deep Learning

Pages 10

This preview shows page 1-2-3 out of 10 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

USING THE MESH HIERARCHY TO INDEXBIOINFORMATICS ARTICLESJEFFREY CHANGStanford Biomedical [email protected] the number of bioinformatics articles increase, the ability to classify them into astructured hierarchy grows more important. For purposes such as text retrieval orknowledge discovery, it is useful organize articles based on content. In this study, weevaluate a new method for hierarchical classification of abstracts from bioinformaticsarticles according to MeSH terms. We compare our method against a flat classifier as abaseline, and find that although performance on exact index assignment is weaker, thehierarchical classifier more often makes predictions in the correct branch of thehierarchy.IntroductionThe MEDLINE database contains references for all journal articles pertaining to thebiomedical domain. Beginning in 1966, it now contains over 10 million citations. Inorder to facilitate searching of the database, the National Library of Medicine (NLM)defined a controlled vocabulary of index terms called Medical Subject Headings (MeSH).MeSH is organized hierarchically and contains over 19,000 main headings(Figure 1). Articles areindexed under severalheadings to describe allrelevant concepts. TheNLM indexes thesearticles manually at acost of about twomillion dollars per year(Yang, evaluatestatistical approaches).Since the index terms,or MeSH headings, aremanually assigned, theprocess suffers fromhigh cost, time latency,and low inter-indexerreliability (Funk &Reid).In addition tothe problems describeabove, assigning indexterms to articles fromAll MeSH CategoriesAnalytical, Diagnostic and TherapeuticTechniques and Equipment Category +Anatomy Category +Anthropology, Education, Sociology andSocial Phenomena Category +Biological Sciences Category +Check Tags +Chemicals and Drugs Category +Diseases Category +Geographical Locations Category +Health Care Category +Humanities Category +Information Science Category +Organisms Category +Persons Category +Physical Sciences Category +Psychiatry and Psychology Category +Technology and Food and BeveragesCategory +Figure 1 The top level of the MeSH hierarchy.an inter-disciplinary field such as bioinformatics (defined here broadly to include allbiology that includes non-trivial computational analysis) is particularly difficult. Withbioinformatics articles,indexers need to be familiarwith the coding conventions indiverse parts of the tree. Forexample, the index terms for theStructural Classification ofProteins database containsterms from "ProteinConformation" to "Databases,Factual" (Figure 2).The general problem ofindexing text has beenaddressed extensively inprevious literature. Manyalgorithms have been applied, including vector space (Salton, 1991), Linear LeastSquares Fit (Yang & Chute, 1992), Nearest Neighbors (Yang, 1994; Yavuz and Guvenir),Naïve Bayes (Koller & Sahami; 1997), Linear Discriminant Analysis, LogisticRegression, Neural Networks (Schutze et al.), Maximum Entropy (Nigam et al.),Expectation-Maximization (Cohen, 1998), Rule Induction (Cohen, 1995), and SupportVector Machines and Transductive Support Vector Machines (Joachims). Although verymany methods have been applied, simple methods have been shown to work generally aswell as more computational expensive ones (Yang, 1997).In addition, there has been work showing that classification results can beimproved by taking advantage of hierarchical structure (Weigend et al.; Koller & Sahami,1997; Mladenic). The general idea behind this is that the features useful fordistinguishing articles will vary among different levels in the hierarchy. Thus, it is moreeffective to train multiple classifiers, where each one classifies one level of the hierarchy,rather than to train one classifier for all possible classifications.Thus, for this study, we investigate methods to exploit the MeSH hierarchy inorder to assign index terms to abstracts from bioinformatics articles. Specifically, viewthis as a problem of assigning probabilities that each index term applies to a document.Because we view this as a probabilistic classification problem, we apply two algorithmsthat can be interpreted probabilistically: naïve Bayes (NB), and k-nearest-neighbors(kNN).The naive Bayes classifier uses Bayes Rule to assess the probability ofthe fit between some classification and the data:)()|()|( classPclassdataPdataclassP∝That is, the probability of the class given the data is proportional to the probability of thedata given the class multiplied by the prior probability of the class. Since the latter twoprobabilities can be estimated from the training data, this rule can be used to makepredictions of the probability of a class. A NB classifier takes many pieces of data, orfeatures, and multiplies the probabilities together to get the probability for the class givenall the data. This assumes independence among the features of the data. Although this isalmost never true, the classifier still works will in practice.TI - SCOP: a structuralclassification of proteins database.LA - EngMH - *Databases, FactualMH - Evolution, MolecularMH - Information Storage andRetrievalMH - InternetMH - *Protein ConformationMH - Proteins/chemistry/geneticsMH - Support, Non-U.S. Gov'tFigure 2 MeSH headings for a reference to a paper onSCOP (Lo Conte, et al., 1999).We also use the k-nearest-neighbors classifier. kNN represents every trainingexample as a specific point in feature space. Then, when given a new observation, itexamines the classification of the K points closest in space to make its prediction. Oneinterpretation of the results is that kNN is assigning major likelihood probabilities forclasses based on the class assignments of its K neighbors.For both these classifiers, we represent documents as vectors of words. Thecontents of the vectors can be binary, 1 or 0 depending on whether the word appears inthe document, the actual word count, or the word count binned into discrete categories.Represented this way, feature selection is thus the task of removing words fromconsideration. Fortunately, very simple feature selection algorithms, such as documentfrequency (the number of document a word appears in) cutoff, have been shown to workwell (Yang).MethodFor the gold standard, we collected the set of articles published in the "Bioinformatics"journal since it started in April 1998 until 1 May 2000. We augmented this set using the“Related Articles” function available on the PubMed

View Full Document