Stanford CS 276B - Lecture Notes - D419024

Home> Schools> Stanford University> Computer Science (CS) > CS 276B> Lecture Notes

Stanford CS 276B - Lecture Notes

Course Cs 276b- Text Information Retrieval, Mining, and Exploitation

Pages 42

Download Save

Unformatted text preview:

CS276B Text Retrieval and Mining Winter 2005Plan for todayEvolution of search enginesConnectivity analysisThird generation search engine: answering “the need behind the query”Answering “the need behind the query”Spatial context – geo-searchHelping the userCrawlingCrawling IssuesBasic crawler operationSimple picture – complicationsRobots.txtRobots.txt exampleCrawling and Corpus ConstructionWhere do we spider next?Crawl OrderSlide 18BFS & Spam (Worst case scenario)Stanford Web Base (179K, 1998) [Cho98]Web Wide Crawl (328M pages, 2000) [Najo01]Queue of URLs to be fetchedSlide 23Slide 24Slide 25Parallel CrawlersDistributed modelc-proc’s crawling the webMeasurementsCrawler variationsStatic assignmentExperimentsSummary of findingsFirewall mode coverageCrossover mode overlapExchange mode communicationConnectivity serversConnectivity Server [CS1: Bhar98b, CS2 & 3: Rand01]Most recent published workAdjacency listsStorageResourcesCS276BText Retrieval and MiningWinter 2005Lecture 7Plan for todayReview search engine history (slightly more technically than in the first lecture)Web crawling/corpus constructionDistributed crawlingConnectivity serversEvolution of search enginesFirst generation – use only “on page”, text dataWord frequency, languageBooleanSecond generation – use off-page, web-specific dataLink (or connectivity) analysisClick-through data Anchor-text (How people refer to this page)Third generation – answer “the need behind the query”Semantic analysis – what is this about?Focus on user need, rather than on queryContext determinationHelping the userUI, spell checking, query refinement, query suggestion, syntax driven feedback, context help, context transfer, etcIntegration of search and text analysis1995-1997 AV,Excite, Lycos, etcFrom 1998. Madepopular by Googlebut everyone nowEvolvingConnectivity analysisIdea: mine hyperlink information of the WebAssumptions: Links often connect related pages A link between pages is a recommendation “people vote with their links”Classic IR work (citations = links) a.k.a. “Bibliometrics” [Kess63, Garf72, Smal73, …]. See also [Lars96].Much Web related research builds on this idea [Piro96, Aroc97, Sper97, Carr97, Klei98, Brin98,…]Third generation search engine: answering “the need behind the query”Semantic analysisQuery language determinationAuto filtering Different ranking (if query in Japanese do not return English)Hard & soft matchesPersonalities (triggered on names)Cities (travel info, maps)Medical info (triggered on names and/or results)Stock quotes, news (triggered on stock symbol)Company info …Answering “the need behind the query”Context determination spatial (user location/target location)query stream (previous queries)personal (user profile) explicit (vertical search, family friendly)Context useResult restrictionRanking modulationSpatial context – geo-searchGeo-codingGeometrical hierarchy (squares)Natural hierarchy (country, state, county, city, zip-codes)Geo-parsingPages (infer from phone nos, zip, etc)Queries (use dictionary of place names) UsersExplicit (tell me your location)From IP dataMobile phones In its infancy, many issues (display size, privacy, etc)Helping the userUISpell checkingQuery completion …CrawlingCrawling IssuesHow to crawl? Quality: “Best” pages firstEfficiency: Avoid duplication (or near duplication)Etiquette: Robots.txt, Server load concernsHow much to crawl? How much to index?Coverage: How big is the Web? How much do we cover? Relative Coverage: How much do competitors have?How often to crawl?Freshness: How much has changed? How much has really changed? (why is this a different question?)Basic crawler operationBegin with known “seed” pagesFetch and parse themExtract URLs they point toPlace the extracted URLs on a queueFetch each URL on the queue and repeatSimple picture – complicationsWeb crawling isn’t feasible with one machineAll of the above steps distributedEven non-malicious pages pose challengesLatency/bandwidth to remote servers varyRobots.txt stipulationsHow “deep” should you crawl a site’s URL hierarchy?Site mirrors and duplicate pagesMalicious pagesSpam pages (Lecture 1, plus others to be discussed)Spider traps – incl dynamically generatedPoliteness – don’t hit a server too oftenRobots.txtProtocol for giving spiders (“robots”) limited access to a website, originally from 1994www.robotstxt.org/wc/norobots.htmlWebsite announces its request on what can(not) be crawledFor a URL, create a file URL/robots.txtThis file specifies access restrictionsRobots.txt exampleNo robot should visit any URL starting with "/yoursite/temp/", except the robot called “searchengine": User-agent: *Disallow: /yoursite/temp/ User-agent: searchengineDisallow:Crawling and Corpus ConstructionCrawl orderDistributed crawlingFiltering duplicatesMirror detectionWhere do we spider next?WebURLs crawledand parsedURLs in queueCrawl OrderWant best pages firstPotential quality measures:Final In-degree Final PagerankWhat’s this?Crawl OrderWant best pages firstPotential quality measures:Final In-degree Final PagerankCrawl heuristic:Breadth First Search (BFS)Partial IndegreePartial Pagerank Random walkMeasure of pagequality we’ll definelater in the course.BFS & Spam (Worst case scenario)BFS depth = 2Normal avg outdegree = 10100 URLs on the queue including a spam page.Assume the spammer is able to generate dynamic pages with 1000 outlinksStartPageStartPageBFS depth = 32000 URLs on the queue50% belong to the spammerBFS depth = 41.01 million URLs on the queue99% belong to the spammerOverlap with best x% byindegreex% crawled by O(u)Stanford Web Base (179K, 1998)[Cho98]Web Wide Crawl (328M pages, 2000) [Najo01] BFS crawling brings in high qualitypages early in the crawlQueue of URLs to be fetchedWhat constraints dictate which queued URL is fetched next?Politeness – don’t hit a server too often, even from different threads of your spiderHow far into a site you’ve crawled alreadyMost sites, stay at ≤ 5 levels of URL hierarchyWhich URLs are most promising for building a high-quality corpusThis is a graph traversal

View Full Document


School:
Email:
New Password:
Confirm Password:

Stanford CS 276B - Lecture Notes

Sign up for free to view:

Please select your school