Stanford CS 276B - Text Retrieval and Mining Lecture

Unformatted text preview:

CS276B Text Retrieval and Mining Winter 2005 Lecture 7 Evolution of search engines First generation use only on page text data 1995 1997 AV Word frequency language Excite Lycos etc Boolean Second generation use off page web specific data From 1998 Made Link or connectivity analysis popular by Google Click through data but everyone now Anchor text How people refer to this page Third generation answer the need behind the query Semantic analysis what is this about Focus on user need rather than on query Context determination Evolving Helping the user UI spell checking query refinement query suggestion syntax driven feedback context help context transfer etc Integration of search and text analysis Third generation search engine answering the need behind the query Semantic analysis Query language determination Auto filtering Different ranking if query in Japanese do not return English Hard soft matches Personalities triggered on names Cities travel info maps Medical info triggered on names and or results Stock quotes news triggered on stock symbol Company info Plan for today Review search engine history slightly more technically than in the first lecture Web crawling corpus construction Connectivity servers Distributed crawling Connectivity analysis Idea mine hyperlink information of the Web Assumptions Links often connect related pages A link between pages is a recommendation people vote with their links Classic IR work citations links a k a Bibliometrics Kess63 Garf72 Smal73 See also Lars96 Much Web related research builds on this idea Piro96 Aroc97 Sper97 Carr97 Klei98 Brin98 Answering the need behind the query Context determination spatial user location target location query stream previous queries personal user profile explicit vertical search family friendly Context use Result restriction Ranking modulation 1 Spatial context geo search Geo coding Geometrical hierarchy squares Natural hierarchy country state county city zip codes Geo parsing Pages infer from phone nos zip etc Queries use dictionary of place names Users Helping the user UI Spell checking Query completion Explicit tell me your location From IP data Mobile phones In its infancy many issues display size privacy etc Crawling Issues Crawling How to crawl How much to crawl How much to index Quality Best pages first Efficiency Avoid duplication or near duplication Etiquette Robots txt Server load concerns Coverage How big is the Web How much do we cover Relative Coverage How much do competitors have How often to crawl Freshness How much has changed How much has really changed why is this a different question Basic crawler operation Begin with known seed pages Fetch and parse them Extract URLs they point to Place the extracted URLs on a queue Simple picture complications Web crawling isn t feasible with one machine Even non malicious pages pose challenges Latency bandwidth to remote servers vary Robots txt stipulations Site mirrors and duplicate pages Fetch each URL on the queue and repeat All of the above steps distributed Malicious pages How deep should you crawl a site s URL hierarchy Spam pages Lecture 1 plus others to be discussed Spider traps incl dynamically generated Politeness don t hit a server too often 2 Robots txt Protocol for giving spiders robots limited access to a website originally from 1994 Robots txt example www robotstxt org wc norobots html Website announces its request on what can not be crawled For a URL create a file URL robots txt This file specifies access restrictions No robot should visit any URL starting with yoursite temp except the robot called searchengine User agent Disallow yoursite temp User agent searchengine Disallow Crawling and Corpus Construction Where do we spider next Crawl order Distributed crawling Filtering duplicates Mirror detection URLs crawled and parsed URLs in queue Web Crawl Order Crawl Order Want best pages first Potential quality measures Final In degree Final Pagerank Want best pages first Potential quality measures What s this Final In degree Final Pagerank Crawl heuristic Measure of page quality we ll define later in the course Breadth First Search BFS Partial Indegree Partial Pagerank Random walk 3 BFS Spam Worst case scenario Start Page Stanford Web Base 179K 1998 Cho98 Start Page Overlap with best x by indegree BFS depth 2 Normal avg outdegree 10 BFS depth 3 2000 URLs on the queue 50 belong to the spammer 100 URLs on the queue including a spam page x crawled by O u Assume the spammer is able to generate dynamic pages with 1000 outlinks BFS depth 4 1 01 million URLs on the queue 99 belong to the spammer Web Wide Crawl 328M pages 2000 Najo01 Queue of URLs to be fetched BFS crawling brings in high quality pages early in the crawl What constraints dictate which queued URL is fetched next Politeness don t hit a server too often even from different threads of your spider How far into a site you ve crawled already Where do we spider next Web Keep all spiders busy Keep spiders from treading on each others toes URLs in queue This is a graph traversal problem Given a directed graph you ve partially visited where do you visit next Where do we spider next URLs crawled and parsed Most sites stay at 5 levels of URL hierarchy Which URLs are most promising for building a high quality corpus Avoid fetching duplicates repeatedly Respect politeness robots txt Avoid getting stuck in traps Detect minimize spam Get the best pages What s best Best for answering search queries 4 Where do we spider next Complex scheduling optimization problem subject to all the constraints listed Plus operational constraints e g keeping all machines load balanced We follow the treatment of Cho and Garcia Molina Scientific study limited to specific aspects Parallel Crawlers Which ones What do we measure What are the compromises in distributed crawling c proc s crawling the web Which c proc gets this URL Periodically update a master index Incremental update so this is cheap Compression differential update etc URLs crawled Focus on communication overhead during the crawl Communication by URLs passed between c procs c procs are independent Static assignment U Total number of web pages Quality sum over downloaded pages of their importance N number of pages fetched I number of distinct pages fetched Coverage I U where Crawler variations Overlap N I I where URLs in queues Also results in dispersed WAN load Measurements What do these mean Crawlers may be running in diverse geographies Europe Asia etc c proc


View Full Document

Stanford CS 276B - Text Retrieval and Mining Lecture

Download Text Retrieval and Mining Lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Text Retrieval and Mining Lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Text Retrieval and Mining Lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?