DOC PREVIEW
Stanford CS 224 - Using the Mesh Hierarchy to Index Bioinformatics articles

This preview shows page 1-2-3 out of 10 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

USING THE MESH HIERARCHY TO INDEXBIOINFORMATICS ARTICLESJEFFREY CHANGStanford Biomedical [email protected] the number of bioinformatics articles increase, the ability to classify them into astructured hierarchy grows more important. For purposes such as text retrieval orknowledge discovery, it is useful organize articles based on content. In this study, weevaluate a new method for hierarchical classification of abstracts from bioinformaticsarticles according to MeSH terms. We compare our method against a flat classifier as abaseline, and find that although performance on exact index assignment is weaker, thehierarchical classifier more often makes predictions in the correct branch of thehierarchy.IntroductionThe MEDLINE database contains references for all journal articles pertaining to thebiomedical domain. Beginning in 1966, it now contains over 10 million citations. Inorder to facilitate searching of the database, the National Library of Medicine (NLM)defined a controlled vocabulary of index terms called Medical Subject Headings (MeSH).MeSH is organized hierarchically and contains over 19,000 main headings(Figure 1). Articles areindexed under severalheadings to describe allrelevant concepts. TheNLM indexes thesearticles manually at acost of about twomillion dollars per year(Yang, evaluatestatistical approaches).Since the index terms,or MeSH headings, aremanually assigned, theprocess suffers fromhigh cost, time latency,and low inter-indexerreliability (Funk &Reid).In addition tothe problems describeabove, assigning indexterms to articles fromAll MeSH CategoriesAnalytical, Diagnostic and TherapeuticTechniques and Equipment Category +Anatomy Category +Anthropology, Education, Sociology andSocial Phenomena Category +Biological Sciences Category +Check Tags +Chemicals and Drugs Category +Diseases Category +Geographical Locations Category +Health Care Category +Humanities Category +Information Science Category +Organisms Category +Persons Category +Physical Sciences Category +Psychiatry and Psychology Category +Technology and Food and BeveragesCategory +Figure 1 The top level of the MeSH hierarchy.an inter-disciplinary field such as bioinformatics (defined here broadly to include allbiology that includes non-trivial computational analysis) is particularly difficult. Withbioinformatics articles,indexers need to be familiarwith the coding conventions indiverse parts of the tree. Forexample, the index terms for theStructural Classification ofProteins database containsterms from "ProteinConformation" to "Databases,Factual" (Figure 2).The general problem ofindexing text has beenaddressed extensively inprevious literature. Manyalgorithms have been applied, including vector space (Salton, 1991), Linear LeastSquares Fit (Yang & Chute, 1992), Nearest Neighbors (Yang, 1994; Yavuz and Guvenir),Naïve Bayes (Koller & Sahami; 1997), Linear Discriminant Analysis, LogisticRegression, Neural Networks (Schutze et al.), Maximum Entropy (Nigam et al.),Expectation-Maximization (Cohen, 1998), Rule Induction (Cohen, 1995), and SupportVector Machines and Transductive Support Vector Machines (Joachims). Although verymany methods have been applied, simple methods have been shown to work generally aswell as more computational expensive ones (Yang, 1997).In addition, there has been work showing that classification results can beimproved by taking advantage of hierarchical structure (Weigend et al.; Koller & Sahami,1997; Mladenic). The general idea behind this is that the features useful fordistinguishing articles will vary among different levels in the hierarchy. Thus, it is moreeffective to train multiple classifiers, where each one classifies one level of the hierarchy,rather than to train one classifier for all possible classifications.Thus, for this study, we investigate methods to exploit the MeSH hierarchy inorder to assign index terms to abstracts from bioinformatics articles. Specifically, viewthis as a problem of assigning probabilities that each index term applies to a document.Because we view this as a probabilistic classification problem, we apply two algorithmsthat can be interpreted probabilistically: naïve Bayes (NB), and k-nearest-neighbors(kNN).The naive Bayes classifier uses Bayes Rule to assess the probability ofthe fit between some classification and the data:)()|()|( classPclassdataPdataclassP∝That is, the probability of the class given the data is proportional to the probability of thedata given the class multiplied by the prior probability of the class. Since the latter twoprobabilities can be estimated from the training data, this rule can be used to makepredictions of the probability of a class. A NB classifier takes many pieces of data, orfeatures, and multiplies the probabilities together to get the probability for the class givenall the data. This assumes independence among the features of the data. Although this isalmost never true, the classifier still works will in practice.TI - SCOP: a structuralclassification of proteins database.LA - EngMH - *Databases, FactualMH - Evolution, MolecularMH - Information Storage andRetrievalMH - InternetMH - *Protein ConformationMH - Proteins/chemistry/geneticsMH - Support, Non-U.S. Gov'tFigure 2 MeSH headings for a reference to a paper onSCOP (Lo Conte, et al., 1999).We also use the k-nearest-neighbors classifier. kNN represents every trainingexample as a specific point in feature space. Then, when given a new observation, itexamines the classification of the K points closest in space to make its prediction. Oneinterpretation of the results is that kNN is assigning major likelihood probabilities forclasses based on the class assignments of its K neighbors.For both these classifiers, we represent documents as vectors of words. Thecontents of the vectors can be binary, 1 or 0 depending on whether the word appears inthe document, the actual word count, or the word count binned into discrete categories.Represented this way, feature selection is thus the task of removing words fromconsideration. Fortunately, very simple feature selection algorithms, such as documentfrequency (the number of document a word appears in) cutoff, have been shown to workwell (Yang).MethodFor the gold standard, we collected the set of articles published in the "Bioinformatics"journal since it started in April 1998 until 1 May 2000. We augmented this set using the“Related Articles” function available on the PubMed


View Full Document

Stanford CS 224 - Using the Mesh Hierarchy to Index Bioinformatics articles

Documents in this Course
Load more
Download Using the Mesh Hierarchy to Index Bioinformatics articles
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Using the Mesh Hierarchy to Index Bioinformatics articles and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Using the Mesh Hierarchy to Index Bioinformatics articles 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?