View Full Document

An Integrated System for Building Enterprise Taxonomies



View the full content.
View Full Document
View Full Document

10 views

Unformatted text preview:

An Integrated System for Building Enterprise Taxonomies Li Zhang1 Tao Li2 ShiXia Liu1 and Yue Pan1 1 IBM China Research Lab lizhang liusx panyue cn ibm com 2 School of Computer Science Florida International University taoli cs fiu edu May 24 2007 Abstract Although considerable research has been conducted in the field of hierarchical text categorization little has been done on automatically collecting labeled corpus for building hierarchical taxonomies In this paper we propose an automatic method of collecting training samples to build hierarchical taxonomies In our method the category node is initially defined by some keywords the web search engine is then used to construct a small set of labeled documents and a topic tracking algorithm with keyword based content normalization is applied to enlarge the training corpus on the basis of the seed documents We also design a method to check the consistency of the collected corpus The above steps produce a flat category structure 1 which contains all the categories for building the hierarchical taxonomy Next linear discriminant projection approach is utilized to construct more meaningful intermediate levels of hierarchies in the generated flat set of categories Experimental results show that the training corpus is good enough for statistical classification methods 1 Introduction In recent years many enterprises tend to organize corporate information into taxonomies At the same time the enterprises also find it important to keep track of marketing and competitor information from the web according to desired taxonomies for their business intelligence This leads to the requirements to design the enterprise taxonomy and label large amounts of data manually or automatically Generally speaking manual labeling is of high cost It is estimated that in order to tag a few hundred thousand documents the enterprise may spend more than 1M 48 The cost will increase dramatically when the taxonomy changes and new facets of taxonomy are defined Although manual labeling could achieve high accuracy in some circumstances it is not a good choice for large data collections Text categorization as a fundamental and effective tool that can automatically classify all kinds of documents into predefined categories has been receiving much attention and numerous approaches have been developed in the liter 2 ature 33 10 23 27 53 42 Experimental results show that the classification accuracy achieved by automatic approaches is as good as human performance and thus makes text categorization an attractive technique for information organization 44 However in practice the performance of classification methods depends on the number of available training samples Test with Reuters 21578 54 shows that the classification precision and recall are good with common categories with more than 300 training samples but poor with rare categories sometimes with less than 10 training samples Furthermore the classification accuracy also depends on the quality of user labeled corpus If the corpus is poorly labeled the classification accuracy will decrease greatly Thus there exists a big gap between the accuracy expected and the real performance in practice In order to achieve high accuracy statistical machine learning methods need a large high quality corpus to train the classifier As manual tagging is very expensive it would not be expected to spend a huge amount of human efforts to do this So how to prepare a training corpus with low cost is a big problem that the enterprise must resolve Several attempts have been made to solve this problem Among them the most commonly used method is active learning 14 34 45 16 In active learning the machine prompts the most informative document for the user to label Human operates interactively to mark the categories the documents belong to Active learning can reduce the number of training samples and reach a reasonable classification quality This technique is well suited for the cases where a large unlabeled 3 corpus exists But the enterprise information is extraordinarily complex much of them are even non related to the scope of desired taxonomy This will lead to a lot of non useful data to be labeled The other type of method is to generate taxonomies based on the labeled data as well as unlabeled data The methods include learning from labeled and unlabeled examples 17 36 or a partially supervised classification 21 32 or supervised clustering 2 All of these methods require a well labeled pre existing document set but in many cases the enterprise does not have such a labeled document set Different from above methods our approach focuses on starting with several seed documents and then enlarge the corpus from the web or Intranet incrementally The starting point is a set of category labels and the user of this system is interested in finding a hierarchical relationship between the categories There are two basic components in a taxonomy a hierarchy of categories and a collection of documents 1 In order to build enterprise taxonomies efficiently and effectively the following issues should be carefully studied how to collect a large corpus from the web efficiently how to evaluate the quality of the training corpus how to organize the collected corpus into hierarchical structures In this paper we present an integrated system to address the above mentioned issues Our system can assist the enterprise to prepare large training corpus and build hierarchical categories from the web It takes an empty taxonomy as input and generates as output a hierarchy of categories and a document set in which 4 Category Tree Editor Keywords Categories Search Engine Document Crawler Seed Selector Taxonomy Hierarchy Generator Consistency Checking LargeCorpus Collector Seeds Figure 1 System Architecture each document is labeled with one or more categories In particular first the technicians of the enterprise can define the flat category structure freely and submit some keywords to a search engine Our system can automatically analyze the retrieved documents recommend several seed documents for each category with a refined centroid method A keyword based content normalization and topic tracking technique 13 are then used to collect large training samples from the web To evaluate the quality of the training corpus a consistency checking method is utilized to help the user check the quality of the corpus Finally the taxonomy hierarchy is generated via linear discriminant projection


Access the best Study Guides, Lecture Notes and Practice Exams

Loading Unlocking...
Login

Join to view An Integrated System for Building Enterprise Taxonomies and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view An Integrated System for Building Enterprise Taxonomies and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?