DOC PREVIEW
An Integrated System for Building Enterprise Taxonomies

This preview shows page 1-2-3-4-25-26-27-52-53-54-55 out of 55 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

An Integrated System for Building EnterpriseTaxonomiesLi Zhang1, Tao Li2, ShiXia Liu1, and Yue Pan11. IBM China Research Lab, {lizhang,liusx,panyue}@cn.ibm.com2. School of Computer Science, Florida International University, [email protected] 24, 2007AbstractAlthough considerable research has been conducted in the field of hi-erarchical text categorization, little has been done on automatically collect-ing labeled corpus for building hierarchical taxonomies. In this paper, wepropose an automatic method of collecting training samples to build hierar-chical taxonomies. In our method, the category node is initially defined bysome keywords, the web search engine is then used to construct a small setof labeled documents, and a topic tracking algorithm with keyword-basedcontent normalization is applied to enlarge the training corpus on the basisof the seed documents. We also design a method to check the consistencyof the collected corpus. The above steps produce a flat category structure1which contains all the categories for building the hierarchical taxonomy.Next, linear discriminant projection approach is utilized to construct moremeaningful intermediate levels of hierarchies in the generated flat set of cat-egories. Experimental results show that the training corpus is good enoughfor statistical classification methods.1 IntroductionIn recent years, many enterprises tend to organize corporate information into tax-onomies. At the same time, the enterprises also find it important to keep trackof marketing and competitor information from the web according to desired tax-onomies for their business intelligence. This leads to the requirements to designthe enterprise taxonomy and label large amounts of data manually or automati-cally.Generally speaking, manual labeling is of high cost. It is estimated that, inorder to tag a few hundred thousand documents, the enterprise may spend morethan $1M [48]. The cost will increase dramatically when the taxonomy changesand new facets of taxonomy are defined. Although manual labeling could achievehigh accuracy (in some circumstances), it is not a good choice for large data col-lections.Text categorization, as a fundamental and effective tool that can automati-cally classify all kinds of documents into predefined categories, has been receiv-ing much attention and numerous approaches have been developed in the liter-2ature [33, 10, 23, 27, 53, 42]. Experimental results show that the classificationaccuracy achieved by automatic approaches is as good as human performance andthus makes text categorization an attractive technique for information organiza-tion [44].However, in practice, the performance of classification methods depends onthe number of available training samples. Test with Reuters-21578 [54] showsthat the classification precision and recall are good with common categories (withmore than 300 training samples), but poor with rare categories (sometimes withless than 10 training samples). Furthermore, the classification accuracy also de-pends on the quality of user labeled corpus. If the corpus is poorly labeled, theclassification accuracy will decrease greatly. Thus, there exists a big gap betweenthe accuracy expected and the real performance in practice.In order to achieve high accuracy, statistical machine learning methods needa large high-quality corpus to train the classifier. As manual tagging is very ex-pensive, it would not be expected to spend a huge amount of human efforts to dothis. So, how to prepare a training corpus with low cost is a big problem that theenterprise must resolve.Several attempts have been made to solve this problem. Among them, the mostcommonly used method is active learning [14, 34, 45, 16]. In active learning, themachine prompts the most informative document for the user to label. Humanoperates interactively to mark the categories the documents belong to. Activelearning can reduce the number of training samples and reach a reasonable classi-fication quality. This technique is well suited for the cases where a large unlabeled3corpus exists. But the enterprise information is extraordinarily complex, much ofthem are even non-related to the scope of desired taxonomy. This will lead to alot of non-useful data to be labeled. The other type of method is to generate tax-onomies based on the labeled data as well as unlabeled data. The methods includelearning from labeled and unlabeled examples [17, 36], or a partially supervisedclassification [21, 32], or supervised clustering [2]. All of these methods require awell-labeled pre-existing document set, but in many cases, the enterprise does nothave such a labeled document set. Different from above methods, our approachfocuses on starting with several seed documents and then enlarge the corpus fromthe web or Intranet incrementally. The starting point is a set of category labels, andthe user of this system is interested in finding a hierarchical relationship betweenthe categories.There are two basic components in a taxonomy: a hierarchy of categories anda collection of documents [1]. In order to build enterprise taxonomies efficientlyand effectively, the following issues should be carefully studied:• how to collect a large corpus from the web efficiently;• how to evaluate the quality of the training corpus;• how to organize the collected corpus into hierarchical structures.In this paper, we present an integrated system to address the above-mentionedissues. Our system can assist the enterprise to prepare large training corpus andbuild hierarchical categories from the web. It takes an empty taxonomy as inputand generates as output a hierarchy of categories and a document set in which4CategoryTree EditorSearchEngineDocumentCrawlerSeedSelectorLargeCorpusCollectorKeywords& CategoriesSeedsConsistencyCheckingHierarchyGeneratorTaxonomyFigure 1: System Architectureeach document is labeled with one or more categories. In particular, first, thetechnicians of the enterprise can define the flat category structure freely and sub-mit some keywords to a search engine. Our system can automatically analyze theretrieved documents, recommend several seed documents for each category with arefined centroid method. A keyword-based content normalization and topic track-ing technique [13] are then used to collect large training samples from the web.To evaluate the quality of the training corpus, a consistency checking method isutilized to


An Integrated System for Building Enterprise Taxonomies

Download An Integrated System for Building Enterprise Taxonomies
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view An Integrated System for Building Enterprise Taxonomies and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view An Integrated System for Building Enterprise Taxonomies 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?