Document Representation

Home> Academic Documents> Document Representation

DOC PREVIEW

This preview shows page 1-2-3 out of 9 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Document Representation and Query Expansion Models for BlogRecommendationJaime Arguello and Jonathan L. Elsas and Jamie Callan and Jaime G. CarbonellLanguage Technologies InstituteSchool of Computer ScienceCarnegie Mellon UniversityPittsburgh, PA 15213, USAAbstractWe explore several different document representation modelsand two query expansion models for the task of recommend-ing blogs to a user in response to a query. Blog relevanceranking differs from traditional document ranking in ad-hocinformation retrieval in several ways: (1) the unit of output(the blog) is composed of a collection of documents (the blogposts) rather than a single document, (2) the query representsan ongoing – and typically multifaceted – interest in the topicrather than a passing ad-hoc information need and (3) due tothe propensity of spam, splogs, and tangential comments, theblogosphere is particularly challenging to use as a source forhigh-quality query expansion terms. We address these dif-ferences at the document representation level, by comparingretrieval models that view either the blog or its constituentposts as the atomic units of retrieval, and at the query expan-sion level, by making novel use of the links and anchor textin Wikipedia1to expand a user’s initial query. We developtwo complementary models of blog retrieval that perform atcomparable levels of precision and recall. We also show con-sistent and significant improvement across all models usingour Wikipedia expansion strategy.IntroductionBlog retrieval is the task of finding blogs with a principle,recurring interest in X, where X is some information needexpressed as a query. The input to the system is a short (i.e.,1-5 word) query and the output is a ranked list of blogs a per-son might want to subscribe to and read on a regular basis.This was the formulation of the TREC 2007 Blog Distilla-tion task (Macdonald, Ounis, & Soboroff 2007). Feed rec-ommendation systems may also suggest relevant feeds basedon the feeds a user already subscribes to (Java et al. 2007)2.However, in this work, a short query is assumed to be theonly evidence of a user’s interest. The output is a ranked listof feeds expected to satisfy the information need in a per-sistent manner and not just with a few relevant entries. Weinterchangeably refer to this query-in/blogs-out approach toblog/feed recommendation as blog/feed retrieval.Copyrightc 2008, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.1http://en.wikipedia.org2In this work, we will refer to “blogs” and “feeds” as the sameentity, as there is a one-to-one relationship between the two. “En-try” and “post” will also be used interchangeably.Blog retrieval differs from traditional ad-hoc r etrieval inseveral important ways. First, the ultimate unit of output(the blog) corresponds t o a collection of documents (its blogposts) rather than a single document. A single relevantpost does not imply the relevance of its corresponding blog.Therefore, we must be concerned with how relevance at thepost level corresponds to relevance at the overall blog level.Second, the nature of relevance at the blog-scale has impli-cations on the expected information needs of the users of ablog retrieval system. If blog authors are expected to have anongoing interest in a topic, that topic is likely multi-facetedand supports the authors’ desire to write posts on variousaspects of the central topic. Thus, users’ information needsappropriate for a blog retrieval system are likewise multi-faceted. A short query is an impoverished representation ofa user’s interest in feed recommendation as it does not con-vey these facets. Finally, a blog corpus is not a typical docu-ment collection, but susceptible to large amounts of reader-generated commentary of varying quality and topicality, andlarge amounts of comment-spam and spam blogs (splogs)intended only to route traffic to desired commercial sources.Any technique used in blog retrieval must be robust to this“noise” in the collection.Two dimensions of feed retrieval were investigated to ad-dress these unique aspects of blog search.1. Representation: How do we effectively represent blogsfor use in a retrieval system? In this work, we consid-ered two models of representation: the large documentmodel in which entire blogs are indexed as single doc-uments and the small document model where we indexat the post-level and aggregate a post ranking into a finalblog-ranking.2. Query Expansion: Does the nature of this t ask and thenoise in the collection require different techniques forquery expansion than traditional ad-hoc retrieval? In typi-cal retrieval systems, query expansion is often intended toovercome a vocabulary mismatch between the query andthe document collection. In this task, however, we may beable to view query expansion as bridging the gap betweena high-level general topic (expressed by the query) andthe more nuanced facets of that topic likely to be writtenabout in the blog posts.In this work, we develop several representations and re-trieval models for blog retrieval and present a novel tech-nique for mining the links and anchor text in Wikipedia forquery expansion terms and phrases. The remainder of thepaper is organized as follows. First we discuss our modelsof feed retrieval and query expansion. Our test collectionand evaluation setup are discussed next, followed by our ex-perimental results and a brief error analysis. We concludewith a discussion of related work and future directions forthis research.Feed Representation and Retrieval ModelsAs stated above, the issue of how to represent feeds for re-trieval is critical to the task of effectively ranking in responseto a query. In t his work we explored two primary models ofrepresentation for feed retrieval. The first, the “large docu-ment model”, represents each feed as a single document, avirtual concatenation of its respective entries. The second,the “small document model”, represents each entry as an in-dividual document and an entry ranking is aggregated into afeed ranking post-retrieval.Large document modelThe ultimate unit of retrieval is the feed, and for this reasonone clear approach is to index feeds as single documents.In this scenario, all posts or entries in the feed are concate-nated together to form one large bag of words or phrases.This large document approach is appealing for its simplic-ity: existing retrieval techniques can be


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 9 pages.

Please select your school