Stanford CS 276B - Lecture Notes - D423219

Home> Schools> Stanford University> Computer Science (CS) > CS 276B> Lecture Notes

Stanford CS 276B - Lecture Notes

Course Cs 276b- Text Information Retrieval, Mining, and Exploitation

Pages 53

Download Save

Unformatted text preview:

CS276B Text Retrieval and Mining Winter 2005Plan for todaySize of the webWhat is the size of the web ?What can we attempt to measure?Statistical methodsURL sampling via Random QueriesRandom queries [Bhar98a]Random searchesRandom searches [Lawr98, Lawr99]Queries from Lawrence and Giles studySize of the Web Estimation [Lawr98, Bhar98a]Random IP addresses [Lawr99]Random IP addresses [ONei97, Lawr99]IssuesRandom walks [Henz99, BarY00, Rusm01]Slide 17ConclusionsDuplicates and mirrorsDuplicate/Near-Duplicate DetectionComputing SimilarityShingles + Set IntersectionComputing Sketch[i] for Doc1Test if Doc1.Sketch[i] = Doc2.Sketch[i]However…Mirror DetectionRepackaged MirrorsMotivationBottom Up Mirror Detection [Cho00]Can we use URLs to find mirrors?Top Down Mirror Detection [Bhar99, Bhar00c]ImplementationLink Analysis on the WebCitation AnalysisQuery-independent orderingQuery processingSpamming simple popularityPagerank scoringNot quite enoughTeleportingResult of teleportingMarkov chainsSlide 43Ergodic Markov chainsSlide 45Probability vectorsChange in probability vectorSteady state exampleHow do we compute this vector?One way of computing aPagerank summaryThe realityResourcesCS276BText Retrieval and MiningWinter 2005Lecture 9Plan for todayWeb size estimationMirror/duplication detectionPagerankSize of the webWhat is the size of the web ?IssuesThe web is really infinite Dynamic content, e.g., calendar Soft 404: www.yahoo.com/anything is a valid pageStatic web contains syntactic duplication, mostly due to mirroring (~20-30%)Some servers are seldom connectedWho cares?Media, and consequently the userEngine designEngine crawl policy. Impact on recallWhat can we attempt to measure?The relative size of search engines The notion of a page being indexed is still reasonably well defined.Already there are problemsDocument extension: e.g. Google indexes pages not yet crawled by indexing anchortext.Document restriction: Some engines restrict what is indexed (first n words, only relevant words, etc.) The coverage of a search engine relative to another particular crawling process.Statistical methods Random queries Random searches Random IP addresses Random walksURL sampling via Random QueriesIdeal strategy: Generate a random URL and check for containment in each index. Problem: Random URLs are hard to find!Random queries [Bhar98a]Sample URLs randomly from each engine20,000 random URLs from each engineIssue random conjunctive query with <200 resultsSelect a random URL from the top 200 resultsTest if present in other engines. Query with 8 rarest words. Look for URL matchCompute intersection & size ratioIssuesRandom narrow queries may bias towards long documents(Verify with disjunctive queries)Other biases induced by processIntersection = x% of E1 = y% of E2E1/E2 = y/xE1 E2Random searchesChoose random searches extracted from a local log [Lawr97] or build “random searches” [Note02]Use only queries with small results sets. Count normalized URLs in result sets.Use ratio statisticsAdvantage:Might be a good reflection of the human perception of coverageRandom searches [Lawr98, Lawr99]575 & 1050 queries from the NEC RI employee logs6 Engines in 1998, 11 in 1999Implementation:Restricted to queries with < 600 results in totalCounted URLs from each engine after verifying query matchComputed size ratio & overlap for individual queries Estimated index size ratio & overlap by averaging over all queriesIssuesSamples are correlated with source of logDuplicatesTechnical statistical problems (must have non-zero results, ratio average, use harmonic mean? )adaptive access control neighborhood preservation topographic hamiltonian structures right linear grammar pulse width modulation neural unbalanced prior probabilities ranked assignment method internet explorer favourites importing karvel thornber zili liuQueries from Lawrence and Giles studysoftmax activation function bose multidimensional system theory gamma mlp dvi2pdf john oliensis rieke spikes exploring neural video watermarking counterpropagation network fat shattering dimension abelson amorphous computingSize of the Web Estimation[Lawr98, Bhar98a]Capture – Recapture techniqueAssumes engines get independent random subsets of the WebE2 contains x% of E1.Assume, E2 contains x% of the Web as wellKnowing size of E2 compute size of the WebSize of the Web = 100*E2/xE1E2WEBBharat & Broder: 200 M (Nov 97), 275 M (Mar 98) Lawrence & Giles: 320 M (Dec 97)Random IP addresses [Lawr99]Generate random IP addressesFind, if possible, a web server at the given addressCollect all pages from serverAdvantagesClean statistics, independent of any crawling strategyRandom IP addresses [ONei97, Lawr99]HTTP requests to random IP addresses Ignored: empty or authorization required or excluded[Lawr99] Estimated 2.8 million IP addresses running crawlable web servers (16 million total) from observing 2500 servers.OCLC using IP sampling found 8.7 M hosts in 2001Netcraft [Netc02] accessed 37.2 million hosts in July 2002[Lawr99] exhaustively crawled 2500 servers and extrapolatedEstimated size of the web to be 800 millionEstimated use of metadata descriptors:Meta tags (keywords, description) in 34% of home pages, Dublin core metadata in 0.3%IssuesVirtual hostingServer might not accept http://102.93.22.15No guarantee all pages are linked to root pagePower law for # pages/hosts generates biasRandom walks [Henz99, BarY00, Rusm01]View the Web as a directed graph from a given list of seeds.Build a random walk on this graphIncludes various “jump” rules back to visited sitesConverges to a stationary distributionTime to convergence not really knownSample from stationary distribution of walkUse the “small results set query” method to check coverage by SE“Statistically clean” method, at least in theory!IssuesList of seeds is a problem.Practical approximation might not be valid: Non-uniform distribution, subject to link spammingStill has all the problems associated with “strong queries”ConclusionsNo sampling solution is perfect. Lots of new ideas .......but the problem is getting harderQuantitative studies are fascinating and a good research problemDuplicates and

View Full Document


School:
Email:
New Password:
Confirm Password:

Stanford CS 276B - Lecture Notes

Sign up for free to view:

Please select your school