DOC PREVIEW
doceng07_koh

This preview shows page 1-2-3 out of 9 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1. INTRODUCTION 2. RELATED WORK 3. SURROGATE FEATURES 4. DOCUMENT SURROGATE MODEL 5. PATTERN RECOGNITION APPROACH 5.1 PATTERN CLASSIFIER 5.2 CROSS-VALIDATION METHOD 6. EXPERIMENTS 6.1 DATASETS 6.2 RESULTS 6.2.1 THE STRUCTURED COLLECTION THE NON-STRUCTURED COLLECTION 6.2.3 THE COMPLETE SET 7. DISCUSSION 7.1 PATTERN RECOGNITION 7.2 DIGITAL COLLECTIONS 8. ACKNOWLEDGMENTS 9. REFERENCESElimination of Junk Document Surrogate Candidates through Pattern Recognition Eunyee Koh, Daniel Caruso, Andruid Kerne, Ricardo Gutierrez-Osuna Interface Ecology Lab Center for Study of Digital Libraries | Computer Science Department Texas A&M University, College Station, TX 77843, USA {eunyee, dcaruso, andruid, rgutier}@cs.tamu.edu ABSTRACT A surrogate is an object that stands for a document and enables navigation to that document. Hypermedia is often represented with textual surrogates, even though studies have shown that image and text surrogates facilitate the formation of mental models and overall understanding. Surrogates may be formed by breaking a document down into a set of smaller elements, each of which is a surrogate candidate. While processing these surrogate candidates from an HTML document, relevant information may appear together with less useful junk material, such as navigation bars and advertisements. This paper develops a pattern recognition based approach for eliminating junk while building the set of surrogate candidates. The approach defines features on candidate elements, and uses classification algorithms to make selection decisions based on these features. For the purpose of defining features in surrogate candidates, we introduce the Document Surrogate Model (DSM), a streamlined Document Object Model (DOM)-like representation of semantic structure. Using a quadratic classifier, we were able to eliminate junk surrogate candidates with an average classification rate of 80%. By using this technique, semi-autonomous agents can be developed to more effectively generate surrogate collections for users. We end by describing a new approach for hypermedia and the semantic web, which uses the DSM to define value-added surrogates for a document. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Selection process H.5.4 [Information Interfaces and Presentation]: Hypertext/Hypermedia – Navigation. General Terms Algorithms, Performance, Design, Human Factors Keywords surrogate, document surrogate model, navigation, mixed-initiatives, pattern recognition, quadratic classifier, principal components analysis, semi-autonomous agents 1. INTRODUCTION Representing large collections of documents to users in ways that facilitate understanding the essential meanings that the documents convey is a hard problem. This is a form of Vanevar Bush’s problem which frames our field: there is too much information [4]. Surrogates are information elements selected from a specific document, which can be used in place of the original document [3, 25]. Most responses to search queries are represented in the form of lists of textual surrogates [14, 32, 35]. Yet, studies have shown that users prefer image and text surrogates and understand them more readily [10, 20]. Further, image and text representations facilitate the formation of mental models [13]. Building good image and text surrogates for a document is not simple and straightforward. One approach to this problem is to explicitly include image and text surrogates among the metadata that is specified for each document, just as abstracts are kept as textual representations. Image and text surrogates function as “boosters” [28] that add value to the process of content aggregation by promoting collection understanding [6]. Alternatively one may extract surrogates from documents through procedural methods. The nature of this task differs depending on the document format. Some digital libraries and semantic web repositories include a large number of HTML documents and sites [26, 27]. Extracting good visual surrogates from documents in this type of collection is complicated by the presence of junk, such as site navigational elements, which may not represent the document’s meaning. In addition to how individual surrogates are represented, another issue is how to represent collections. One approach to representing collections would be to use lists of image and text surrogates instead of pure text in the result sets that search engines return. An alternative approach is taken by combinFormation [18], a tool that facilitates the construction of surrogates and their spatial and visual composition in a mixed-initiative system [23]. Compositions are produced by a generative agent whose actions can be overridden and directed by the user. combinFormation uses surrogates in a variety of ways, such as changing the interest model used by the agent, visually combining surrogates to illustrate an idea or concept, and navigating to the original document the surrogate was selected from. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DocEng’07, August 28-31, 2007, Winnipeg, Manitoba, Canada. Copyright 2007 ACM 978-1-59593-776-6/07/0008...$5.00. Imagine a space where interesting pieces of the most up-to-date information on your favorite topics are continuously discovered and presented to you. Now imagine if this space was full of advertisements, e-mail addresses, copyright notices, website navigation bars, etc. Sorting through and uncovering the information you are actually interested in becomes a difficult and 187cognitively expensive process. Unfortunately, in some simple exercises conducted using combinFormation, this is exactly what happened. In Figure 1, we have separated out all of these garbage elements. Having to perform this task every few seconds, as new information elements are presented, is a distracting task for the user. It would be better for the user if an application removed these elements automatically, freeing up her/his cognitive abilities for more important tasks such as processing the real information s/he is interested in. Automatically choosing the most


doceng07_koh

Download doceng07_koh
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view doceng07_koh and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view doceng07_koh 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?