Lecture 16 Web Search Engines Oct 27 2006 ChengXiang Zhai CS511 Advanced Database Management Systems 1 Characteristics of Web Information Infinite size Surface vs deep Web Surface static HTML pages Deep dynamically generated HTML pages DB Semi structured Structured HTML tags hyperlinks etc Unstructured Text Different format pdf word ps Multi media Textual audio images High variances in quality Many junks Universal coverage can be about any content CS511 Advanced Database Management Systems 2 General Challenges in Web Search Handling the size of the Web How to ensure completeness of coverage Efficiency issues Dealing with or tolerating errors and low quality information Addressing the dynamics of the Web Some pages may disappear permanently New pages are constantly created CS511 Advanced Database Management Systems 3 Tasks in Web Information Management Information Access Search Search engines e g Google Navigation Browsers e g IE Filtering Recommender systems e g Amazon Information Organization Categorization Web directories e g Yahoo Clustering Organize search results e g Vivsimo Information Discovery Mining e g CS511 Advanced Database Management Systems 4 Typology of Web Search Engines Web Browser Yahoo Google Directory Service Directories Search Engine Meta Engine Web Index Engine profiles Spider Robot CS511 Advanced Database Management Systems Vivisimo 5 Examples of Web Search Engines Web Browser Directory Service Meta Engines Autonomous Engines CS511 Advanced Database Management Systems 6 Basic Search Engine Technologies User Web Browser Query Host Info Efficiency Crawler Coverage Freshness Cached pages Indexer Error spam handling CS511 Advanced Database Management Systems Results Retriever Precision Inverted Index 7 Component I Crawler Spider Robot Building a toy crawler is easy Start with a set of seed pages in a priority queue Fetch pages from the web Parse fetched pages for hyperlinks add them to the queue Follow the hyperlinks in the queue A real crawler is much more complicated Robustness server failure trap etc Crawling courtesy server load balance robot exclusion etc Handling file types images PDF files etc URL extensions cgi script internal references etc Recognize redundant pages identical and duplicates Discover hidden URLs e g truncated Crawling strategy is a main research topic i e which page to visit next CS511 Advanced Database Management Systems 8 Major Crawling Strategies Breadth First most common balance server load Parallel crawling Focused crawling Targeting at a subset of pages e g all pages about automobiles Typically given a query Incremental repeated crawling Can learn from the past experience Probabilistic models are possible The Major challenge remains to maintain freshness and good coverage with minimum resource overhead CS511 Advanced Database Management Systems 9 Component II Indexer Standard IR techniques are the basis Lexicons Inverted index with positions General challenges at a much large scale Basic indexing decisions stop words stemming numbers special symbols Indexing efficiency space and time Updating Additional challenges Recognize spams junks How to index anchor text How to support fast summary generation CS511 Advanced Database Management Systems 10 Google s Solutions URL Queue List Cached source pages compressed Inverted index Hypertext structure CS511 Advanced Database Management Systems Use many features e g font layout 11 Component III Retriever Standard IR models apply but aren t sufficient Different information need home page finding vs topic driven Documents have additional information hyperlinks markups URL Information is often redundant and the quality varies a lot Server side feedback is often not feasible Major extensions Exploiting links anchor text link based scoring Exploiting layout markups font title field etc Spelling correction Spam filtering Redundancy elimination CS511 Advanced Database Management Systems 12 Effective Web Retrieval Heuristics High accuracy in home page finding can be achieved by Matching query with the title Matching query with the anchor text Plus URL based or link based scoring e g PageRank Imposing a conjunctive and interpretation of the query is often appropriate Queries are generally very short all words are necessary The size of the Web makes it likely that at least a page would match all the query words CS511 Advanced Database Management Systems 13 Home Entry Page Finding Evaluation Results TREC 2001 MRR 0 774 0 772 top10 88 3 87 6 fail 4 8 4 8 Unigram Query Likelihood Link URL prior i e p Q D p D Kraaij et al SIGIR 2002 Exploiting anchor text structure or links 0 382 62 1 11 7 0 340 60 7 15 9 Query example Haas Business School CS511 Advanced Database Management Systems 14 Named Page Finding Evaluation Results TREC 2002 Okapi BM25 Anchor Text Dirichlet Prior Title Anchor Text Lemur Ogilvie Callan SIGIR 2003 Best content only Query example America s century farms CS511 Advanced Database Management Systems 15 Many heuristics are proposed for Web retrieval but is there a model for Web retrieval How do we extend regular text retrieval models for Web retrieval Mostly an open question CS511 Advanced Database Management Systems 16 Free text vs Structured text So far we ve assumed free text Document word sequence Query word sequence Collection a set of documents Minimal structure But we may have structures on text e g title hyperlinks Can we exploit the structures in retrieval Sometimes structures may be more important than the text moving closer to relational databases CS511 Advanced Database Management Systems 17 Examples of Document Structures Intra doc structures relations of components Natural components title author abstract sections references Annotations named entities subtopics markups Inter doc structures relations between documents Topic hierarchy Hyperlinks citations hypertext CS511 Advanced Database Management Systems 18 Structured Text Collection General question How do we search such a collection A general topic Subtopic k Subtopic 1 CS511 Advanced Database Management Systems 19 Challenges in Structured Text Retrieval STR Challenge 1 Information Doc Unit What is an appropriate information unit Document may no longer be the most natural unit Components in a document or a whole Web site may be more appropriate Challenge 2 Query What is an appropriate query language Keyword free text query is no longer the only choice Constraints on the structures can be posed e g search in a particular
View Full Document