CS276B Text Retrieval and Mining Winter 2005Plan for todaySize of the webWhat is the size of the web ?What can we attempt to measure?Statistical methodsURL sampling via Random QueriesRandom queries [Bhar98a]Random searchesRandom searches [Lawr98, Lawr99]Queries from Lawrence and Giles studySize of the Web Estimation [Lawr98, Bhar98a]Random IP addresses [Lawr99]Random IP addresses [ONei97, Lawr99]IssuesRandom walks [Henz99, BarY00, Rusm01]Slide 17ConclusionsDuplicates and mirrorsDuplicate/Near-Duplicate DetectionComputing SimilarityShingles + Set IntersectionComputing Sketch[i] for Doc1Test if Doc1.Sketch[i] = Doc2.Sketch[i]However…Mirror DetectionRepackaged MirrorsMotivationBottom Up Mirror Detection [Cho00]Can we use URLs to find mirrors?Top Down Mirror Detection [Bhar99, Bhar00c]ImplementationLink Analysis on the WebCitation AnalysisQuery-independent orderingQuery processingSpamming simple popularityPagerank scoringNot quite enoughTeleportingResult of teleportingMarkov chainsSlide 43Ergodic Markov chainsSlide 45Probability vectorsChange in probability vectorSteady state exampleHow do we compute this vector?One way of computing aPagerank summaryThe realityResourcesCS276BText Retrieval and MiningWinter 2005Lecture 9Plan for todayWeb size estimationMirror/duplication detectionPagerankSize of the webWhat is the size of the web ?IssuesThe web is really infinite Dynamic content, e.g., calendar Soft 404: www.yahoo.com/anything is a valid pageStatic web contains syntactic duplication, mostly due to mirroring (~20-30%)Some servers are seldom connectedWho cares?Media, and consequently the userEngine designEngine crawl policy. Impact on recallWhat can we attempt to measure?The relative size of search engines The notion of a page being indexed is still reasonably well defined.Already there are problemsDocument extension: e.g. Google indexes pages not yet crawled by indexing anchortext.Document restriction: Some engines restrict what is indexed (first n words, only relevant words, etc.) The coverage of a search engine relative to another particular crawling process.Statistical methods Random queries Random searches Random IP addresses Random walksURL sampling via Random QueriesIdeal strategy: Generate a random URL and check for containment in each index. Problem: Random URLs are hard to find!Random queries [Bhar98a]Sample URLs randomly from each engine20,000 random URLs from each engineIssue random conjunctive query with <200 resultsSelect a random URL from the top 200 resultsTest if present in other engines. Query with 8 rarest words. Look for URL matchCompute intersection & size ratioIssuesRandom narrow queries may bias towards long documents(Verify with disjunctive queries)Other biases induced by processIntersection = x% of E1 = y% of E2E1/E2 = y/xE1 E2Random searchesChoose random searches extracted from a local log [Lawr97] or build “random searches” [Note02]Use only queries with small results sets. Count normalized URLs in result sets.Use ratio statisticsAdvantage:Might be a good reflection of the human perception of coverageRandom searches [Lawr98, Lawr99]575 & 1050 queries from the NEC RI employee logs6 Engines in 1998, 11 in 1999Implementation:Restricted to queries with < 600 results in totalCounted URLs from each engine after verifying query matchComputed size ratio & overlap for individual queries Estimated index size ratio & overlap by averaging over all queriesIssuesSamples are correlated with source of logDuplicatesTechnical statistical problems (must have non-zero results, ratio average, use harmonic mean? )adaptive access control neighborhood preservation topographic hamiltonian structures right linear grammar pulse width modulation neural unbalanced prior probabilities ranked assignment method internet explorer favourites importing karvel thornber zili liuQueries from Lawrence and Giles studysoftmax activation function bose multidimensional system theory gamma mlp dvi2pdf john oliensis rieke spikes exploring neural video watermarking counterpropagation network fat shattering dimension abelson amorphous computingSize of the Web Estimation[Lawr98, Bhar98a]Capture – Recapture techniqueAssumes engines get independent random subsets of the WebE2 contains x% of E1.Assume, E2 contains x% of the Web as wellKnowing size of E2 compute size of the WebSize of the Web = 100*E2/xE1E2WEBBharat & Broder: 200 M (Nov 97), 275 M (Mar 98) Lawrence & Giles: 320 M (Dec 97)Random IP addresses [Lawr99]Generate random IP addressesFind, if possible, a web server at the given addressCollect all pages from serverAdvantagesClean statistics, independent of any crawling strategyRandom IP addresses [ONei97, Lawr99]HTTP requests to random IP addresses Ignored: empty or authorization required or excluded[Lawr99] Estimated 2.8 million IP addresses running crawlable web servers (16 million total) from observing 2500 servers.OCLC using IP sampling found 8.7 M hosts in 2001Netcraft [Netc02] accessed 37.2 million hosts in July 2002[Lawr99] exhaustively crawled 2500 servers and extrapolatedEstimated size of the web to be 800 millionEstimated use of metadata descriptors:Meta tags (keywords, description) in 34% of home pages, Dublin core metadata in 0.3%IssuesVirtual hostingServer might not accept http://102.93.22.15No guarantee all pages are linked to root pagePower law for # pages/hosts generates biasRandom walks [Henz99, BarY00, Rusm01]View the Web as a directed graph from a given list of seeds.Build a random walk on this graphIncludes various “jump” rules back to visited sitesConverges to a stationary distributionTime to convergence not really knownSample from stationary distribution of walkUse the “small results set query” method to check coverage by SE“Statistically clean” method, at least in theory!IssuesList of seeds is a problem.Practical approximation might not be valid: Non-uniform distribution, subject to link spammingStill has all the problems associated with “strong queries”ConclusionsNo sampling solution is perfect. Lots of new ideas .......but the problem is getting harderQuantitative studies are fascinating and a good research problemDuplicates and
View Full Document