Unformatted text preview:

CS276B Text Retrieval and Mining Winter 2005Plan for todayEvolution of search enginesConnectivity analysisThird generation search engine: answering “the need behind the query”Answering “the need behind the query”Spatial context – geo-searchHelping the userCrawlingCrawling IssuesBasic crawler operationSimple picture – complicationsRobots.txtRobots.txt exampleCrawling and Corpus ConstructionWhere do we spider next?Crawl OrderSlide 18BFS & Spam (Worst case scenario)Stanford Web Base (179K, 1998) [Cho98]Web Wide Crawl (328M pages, 2000) [Najo01]Queue of URLs to be fetchedSlide 23Slide 24Slide 25Parallel CrawlersDistributed modelc-proc’s crawling the webMeasurementsCrawler variationsStatic assignmentExperimentsSummary of findingsFirewall mode coverageCrossover mode overlapExchange mode communicationConnectivity serversConnectivity Server [CS1: Bhar98b, CS2 & 3: Rand01]Most recent published workAdjacency listsStorageResourcesCS276BText Retrieval and MiningWinter 2005Lecture 7Plan for todayReview search engine history (slightly more technically than in the first lecture)Web crawling/corpus constructionDistributed crawlingConnectivity serversEvolution of search enginesFirst generation – use only “on page”, text dataWord frequency, languageBooleanSecond generation – use off-page, web-specific dataLink (or connectivity) analysisClick-through data Anchor-text (How people refer to this page)Third generation – answer “the need behind the query”Semantic analysis – what is this about?Focus on user need, rather than on queryContext determinationHelping the userUI, spell checking, query refinement, query suggestion, syntax driven feedback, context help, context transfer, etcIntegration of search and text analysis1995-1997 AV,Excite, Lycos, etcFrom 1998. Madepopular by Googlebut everyone nowEvolvingConnectivity analysisIdea: mine hyperlink information of the WebAssumptions: Links often connect related pages A link between pages is a recommendation “people vote with their links”Classic IR work (citations = links) a.k.a. “Bibliometrics” [Kess63, Garf72, Smal73, …]. See also [Lars96].Much Web related research builds on this idea [Piro96, Aroc97, Sper97, Carr97, Klei98, Brin98,…]Third generation search engine: answering “the need behind the query”Semantic analysisQuery language determinationAuto filtering Different ranking (if query in Japanese do not return English)Hard & soft matchesPersonalities (triggered on names)Cities (travel info, maps)Medical info (triggered on names and/or results)Stock quotes, news (triggered on stock symbol)Company info …Answering “the need behind the query”Context determination spatial (user location/target location)query stream (previous queries)personal (user profile) explicit (vertical search, family friendly)Context useResult restrictionRanking modulationSpatial context – geo-searchGeo-codingGeometrical hierarchy (squares)Natural hierarchy (country, state, county, city, zip-codes)Geo-parsingPages (infer from phone nos, zip, etc)Queries (use dictionary of place names) UsersExplicit (tell me your location)From IP dataMobile phones In its infancy, many issues (display size, privacy, etc)Helping the userUISpell checkingQuery completion …CrawlingCrawling IssuesHow to crawl? Quality: “Best” pages firstEfficiency: Avoid duplication (or near duplication)Etiquette: Robots.txt, Server load concernsHow much to crawl? How much to index?Coverage: How big is the Web? How much do we cover? Relative Coverage: How much do competitors have?How often to crawl?Freshness: How much has changed? How much has really changed? (why is this a different question?)Basic crawler operationBegin with known “seed” pagesFetch and parse themExtract URLs they point toPlace the extracted URLs on a queueFetch each URL on the queue and repeatSimple picture – complicationsWeb crawling isn’t feasible with one machineAll of the above steps distributedEven non-malicious pages pose challengesLatency/bandwidth to remote servers varyRobots.txt stipulationsHow “deep” should you crawl a site’s URL hierarchy?Site mirrors and duplicate pagesMalicious pagesSpam pages (Lecture 1, plus others to be discussed)Spider traps – incl dynamically generatedPoliteness – don’t hit a server too oftenRobots.txtProtocol for giving spiders (“robots”) limited access to a website, originally from 1994www.robotstxt.org/wc/norobots.htmlWebsite announces its request on what can(not) be crawledFor a URL, create a file URL/robots.txtThis file specifies access restrictionsRobots.txt exampleNo robot should visit any URL starting with "/yoursite/temp/", except the robot called “searchengine": User-agent: *Disallow: /yoursite/temp/ User-agent: searchengineDisallow:Crawling and Corpus ConstructionCrawl orderDistributed crawlingFiltering duplicatesMirror detectionWhere do we spider next?WebURLs crawledand parsedURLs in queueCrawl OrderWant best pages firstPotential quality measures:Final In-degree Final PagerankWhat’s this?Crawl OrderWant best pages firstPotential quality measures:Final In-degree Final PagerankCrawl heuristic:Breadth First Search (BFS)Partial IndegreePartial Pagerank Random walkMeasure of pagequality we’ll definelater in the course.BFS & Spam (Worst case scenario)BFS depth = 2Normal avg outdegree = 10100 URLs on the queue including a spam page.Assume the spammer is able to generate dynamic pages with 1000 outlinksStartPageStartPageBFS depth = 32000 URLs on the queue50% belong to the spammerBFS depth = 41.01 million URLs on the queue99% belong to the spammerOverlap with best x% byindegreex% crawled by O(u)Stanford Web Base (179K, 1998)[Cho98]Web Wide Crawl (328M pages, 2000) [Najo01] BFS crawling brings in high qualitypages early in the crawlQueue of URLs to be fetchedWhat constraints dictate which queued URL is fetched next?Politeness – don’t hit a server too often, even from different threads of your spiderHow far into a site you’ve crawled alreadyMost sites, stay at ≤ 5 levels of URL hierarchyWhich URLs are most promising for building a high-quality corpusThis is a graph traversal


View Full Document

Stanford CS 276B - Lecture Notes

Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?