View Full Document


Unformatted text preview:

Elimination of Junk Document Surrogate Candidates through Pattern Recognition Eunyee Koh Daniel Caruso Andruid Kerne Ricardo Gutierrez Osuna Interface Ecology Lab Center for Study of Digital Libraries Computer Science Department Texas A M University College Station TX 77843 USA eunyee dcaruso andruid rgutier cs tamu edu ABSTRACT 1 INTRODUCTION A surrogate is an object that stands for a document and enables navigation to that document Hypermedia is often represented with textual surrogates even though studies have shown that image and text surrogates facilitate the formation of mental models and overall understanding Surrogates may be formed by breaking a document down into a set of smaller elements each of which is a surrogate candidate While processing these surrogate candidates from an HTML document relevant information may appear together with less useful junk material such as navigation bars and advertisements Representing large collections of documents to users in ways that facilitate understanding the essential meanings that the documents convey is a hard problem This is a form of Vanevar Bush s problem which frames our field there is too much information 4 Surrogates are information elements selected from a specific document which can be used in place of the original document 3 25 Most responses to search queries are represented in the form of lists of textual surrogates 14 32 35 Yet studies have shown that users prefer image and text surrogates and understand them more readily 10 20 Further image and text representations facilitate the formation of mental models 13 Building good image and text surrogates for a document is not simple and straightforward One approach to this problem is to explicitly include image and text surrogates among the metadata that is specified for each document just as abstracts are kept as textual representations Image and text surrogates function as boosters 28 that add value to the process of content aggregation by promoting collection understanding 6 This paper develops a pattern recognition based approach for eliminating junk while building the set of surrogate candidates The approach defines features on candidate elements and uses classification algorithms to make selection decisions based on these features For the purpose of defining features in surrogate candidates we introduce the Document Surrogate Model DSM a streamlined Document Object Model DOM like representation of semantic structure Using a quadratic classifier we were able to eliminate junk surrogate candidates with an average classification rate of 80 By using this technique semiautonomous agents can be developed to more effectively generate surrogate collections for users We end by describing a new approach for hypermedia and the semantic web which uses the DSM to define value added surrogates for a document Alternatively one may extract surrogates from documents through procedural methods The nature of this task differs depending on the document format Some digital libraries and semantic web repositories include a large number of HTML documents and sites 26 27 Extracting good visual surrogates from documents in this type of collection is complicated by the presence of junk such as site navigational elements which may not represent the document s meaning Categories and Subject Descriptors H 3 3 Information Storage and Retrieval Selection process H 5 4 Information Interfaces and Presentation Hypertext Hypermedia Navigation In addition to how individual surrogates are represented another issue is how to represent collections One approach to representing collections would be to use lists of image and text surrogates instead of pure text in the result sets that search engines return An alternative approach is taken by combinFormation 18 a tool that facilitates the construction of surrogates and their spatial and visual composition in a mixed initiative system 23 Compositions are produced by a generative agent whose actions can be overridden and directed by the user combinFormation uses surrogates in a variety of ways such as changing the interest model used by the agent visually combining surrogates to illustrate an idea or concept and navigating to the original document the surrogate was selected from General Terms Algorithms Performance Design Human Factors Keywords surrogate document surrogate model navigation mixedinitiatives pattern recognition quadratic classifier principal components analysis semi autonomous agents Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page To copy otherwise or republish to post on servers or to redistribute to lists requires prior specific permission and or a fee DocEng 07 August 28 31 2007 Winnipeg Manitoba Canada Copyright 2007 ACM 978 1 59593 776 6 07 0008 5 00 Imagine a space where interesting pieces of the most up to date information on your favorite topics are continuously discovered and presented to you Now imagine if this space was full of advertisements e mail addresses copyright notices website navigation bars etc Sorting through and uncovering the information you are actually interested in becomes a difficult and 187 Figure 1 These pools of surrogate candidates have been manually separated into junk and non junk to illustrate what we mean by junk and how much of it there is In this paper we apply statistical pattern recognition techniques to cognitively expensive process Unfortunately in some simple a set of human judgments which have been systematized in the exercises conducted using combinFormation this is exactly what form of training data in order to determine if any subsequently happened In Figure 1 we have separated out all of these garbage encountered surrogate should be discarded as junk Our overall elements Having to perform this task every few seconds as new goal is to improve surrogate selection by increasing the number of information elements are presented is a distracting task for the junk surrogates that are correctly discarded user It would be better for the user if an application removed these elements automatically freeing up her his cognitive abilities for more important tasks such as processing the real information 2 RELATED WORK s he is interested in Automatically choosing the most informative

Access the best Study Guides, Lecture Notes and Practice Exams

Loading Unlocking...

Join to view doceng07_koh and access 3M+ class-specific study document.

We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view doceng07_koh and access 3M+ class-specific study document.


By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?