DOC PREVIEW
ODU CS 791 - Query-Free News Search

This preview shows page 1-2-3-18-19-37-38-39 out of 39 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 39 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 39 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 39 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 39 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 39 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 39 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 39 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 39 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 39 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Query Free News Search by Monika Henzinger Bay Wei Chang Sergey Brin Google Inc Brian Milch UC Berkeley presented by Martin Klein Santosh Vuppala mklein svuppala cs odu edu ODU Norfolk 03 21 2007 Introduction about news search supporting TV broadcast automatically selecting web pages news articles relevant to the ongoing stream of TV broadcast news presented to user while watching TV Outline Query Generation QG Algorithms Postprocessing Evaluation Conclusion Discussion Examples Examples How is it done extract queries from closed caption query generation issue queries to news search engine on the web determine what news articles to provide to the user post processing of top results Query Generation relevant articles shown at regular rate during news broadcast ergo QG needs to produce a query periodically every S seconds S 7 15 15sec eq 3 sentences 50 words QG is given the text segment T since the last query generation 7 different QG algorithms to create 2 or 3 word queries TF IDF Review tf how often this term appears in text segment T idf in how many text segments does term appear idf log N f 1 with N total number of text segments f number of text segments in which term appears The Baseline Algorithm A1 BASE simple calculates the product of tf and idf larger weights for more frequent terms larger weights for more unusual terms returns the two terms with largest weight as the query The A2 IDF2 Algorithm same as baseline but calculates weight of terms as tf idf idf gives even more importance to rare words The Simple Stemming Algorithm A3 STEM assigns a weight to each stem stem of a word is approximated by taking first 5 letters e g sports sportsman sportive sport for each stem store all terms plus their weight that generated stem c tf idf idf c 1 for nouns or else 0 5 top weighted terms for top 2 stems form query The stemming algorithm with compounds A4 COMP A3 STEM two word compounds list of allowed compounds New York San Francisco e g stem for the compound veterans administration is veter admin query formed as in A3 STEM can consist of 2 3 or 4 words now The History Algorithm A5 HIST A4 COMP history feature uses terms from previous text segments to aid in generating query for the current one data structure called stem vector represents previously seen text stems terms weight SV combines history information with information produced by A4 COMP for current text segment finds the top weighted stems The History Algorithm A5 HIST reset to empty stem vector new SV created for each new T computes similarity score between current T and text in old SV threshold to determine if similar somewhat similar and dissimilar if similar SVs are merged if dissimilar reset old SV and replace with new SV The Query Shortening Algorithm A6 3W identical to A5 HIST issues a multiple term query shortens the query until results are returned from news search engine starts with three terms Algorithm A7 IDF identical to A5 HIST tf idf replaced by tf idf idf Postprocessing issue search queries to news search engine retrieve top 15 results result contains one news article 40 duplicates in returned articles near duplicate backoff strategy detected by comparing titles and summaries if considered as duplicate dismissed next in line chosen if empty chose 1st Postprocessing Boosting use additional high weighted terms to select most relevant articles from search results QG returns boost terms top 5 terms like Q terms boost value their IDF values computes weight for each result re ranking Postprocessing Similarity Re ranking similarity between text segment and returned results is computed tf idf weighted term vectors for text segment and text of article first 500 chars compute normalized cosine similarity score can you see the downside Postprocessing Filtering discard articles very dissimilar to the caption F1 if query too vague top 2 results often dissimilar if found both articles discarded unless very similar to caption cosine similarity score css F2 F1 if css T of page A threshold b discard F2 if css of two pages A B threshold p discard BUT if css T of A or B threshold g retain page F3 EVALUATION by humans 0 if the article is not on the topic 1 if the article is about the topic in general but not the exact story 2 if the article is about the exact news story that is being discussed relevant R if score of 1 very relevant R if score 2 EVALUATION use Precision and relative Recall to compare algorithms measure topic coverage number of topics with 1 relevant articles Data Sets HN Headline News Three 30 minute sessions of CNN headline news each taken from a different day topics comprise 70min CNN One hour of Wolf Blitzer Reports on CNN from one day and 30 minutes from another day topics comprise 64min Evaluation of the QG Algorithms Evaluation of the QG Algorithms idf idf seems to work slightly better than just idf experiments on stemming are inconclusive adding compounds does not improve precision adding history feature is recommended 3 word queries only good without postprocessing Evaluation of Postprocessing filtering gains 20 30 in precision loses 6 in recall filtering sim re rank same precision but better recall Query Overlap and URL Overlap reason for similarity in performance of the different query selection algorithms issue similar queries return similar URLs Query Overlap of queries that have identical terms identical only for similar algorithms URL Overlap of URLs returned by 2 algorithms high only for similar algorithms Topic Coverage 70 of the topics have 1 relevant article Filtering Effectiveness of articles filtered out by what filter rule Filtering Effectiveness how often does a filtering rule make wrong decision Conclusion 7 algorithms and 3 postprocessing techniques to find news articles on the web top algorithms find relevant article every 16 20sec precision 84 91 relevant article found in 70 of topics filtering by similarity to caption and with each other improves precision not restricted to news Background Authors Brian Milch PhD in CS from UC Berkeley now Post Doc at MIT Cambridge MA Monika Henzinger Sergey Brin Bay Wei Chang Who Links to Whom paper picture from http people csail mit edu milch Background Paper published at WWW2003 Budapest Hungary journal article as well 2005 http dx doi org 10 1007 s11280 004 4870 6 last paper Brin has published DBLP Bibliography Server Discussion We have seen that these algorithms work fairly well they return 70 of topics with relevant or super relevant articles


View Full Document

ODU CS 791 - Query-Free News Search

Documents in this Course
Load more
Download Query-Free News Search
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Query-Free News Search and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Query-Free News Search 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?