1CS276BText Retrieval and MiningWinter 2005Lecture 9Plan for todayWeb size estimationMirror/duplication detectionPagerankSize of the webWhat is the size of the web ?IssuesThe web is really infinite Dynamic content, e.g., calendar Soft 404: www.yahoo.com/anything is a valid pageStatic web contains syntactic duplication, mostly due to mirroring (~20-30%)Some servers are seldom connectedWho cares?Media, and consequently the userEngine designEngine crawl policy. Impact on recallWhat can we attempt to measure?The relative size of search engines The notion of a page being indexed is still reasonably well defined.Already there are problemsDocument extension: e.g. Google indexes pages not yet crawled by indexing anchortext.Document restriction: Some engines restrict what is indexed (first n words, only relevant words, etc.) The coverage of a search engine relative to another particular crawling process.Statistical methodsRandom queriesRandom searchesRandom IP addressesRandom walks2URL sampling via Random QueriesIdeal strategy: Generate a random URL and check for containment in each index.Problem: Random URLs are hard to find!Random queries [Bhar98a]Sample URLs randomly from each engine20,000 random URLs from each engineIssue random conjunctive query with <200 resultsSelect a random URL from the top 200 resultsTest if present in other engines. Query with 8 rarest words. Look for URL matchCompute intersection & size ratioIssuesRandom narrow queries may bias towards long documents(Verify with disjunctive queries)Other biases induced by processIntersection = x% of E1 = y% of E2E1/E2 = y/xE1 E2Random searchesChoose random searches extracted from a local log [Lawr97] or build “random searches” [Note02]Use only queries with small results sets. Count normalized URLs in result sets.Use ratio statisticsAdvantage:Might be a good reflection of the human perception of coverageRandom searches [Lawr98, Lawr99] 575 & 1050 queries from the NEC RI employee logs 6 Engines in 1998, 11 in 1999 Implementation:Restricted to queries with < 600 results in totalCounted URLs from each engine after verifying query matchComputed size ratio & overlap for individual queries Estimated index size ratio & overlap by averaging over all queries IssuesSamples are correlated with source of logDuplicatesTechnical statistical problems (must have non-zero results, ratio average, use harmonic mean? )adaptive access control neighborhood preservation topographic hamiltonian structures right linear grammar pulse width modulation neural unbalanced prior probabilities ranked assignment method internet explorer favourites importing karvel thornberzili liuQueries from Lawrence and Giles studysoftmax activation function bose multidimensional system theory gamma mlpdvi2pdf john oliensisrieke spikes exploring neural video watermarking counterpropagationnetwork fat shattering dimension abelson amorphous computingSize of the Web Estimation[Lawr98, Bhar98a]Capture – Recapture techniqueAssumes engines get independent random subsets of the WebE2 contains x% of E1.Assume, E2 contains x% of the Web as wellKnowing size of E2 compute size of the WebSize of the Web = 100*E2/xE1E2WEBBharat & Broder: 200 M (Nov 97), 275 M (Mar 98) Lawrence & Giles: 320 M (Dec 97)3Random IP addresses [Lawr99]Generate random IP addressesFind, if possible, a web server at the given addressCollect all pages from serverAdvantagesClean statistics, independent of any crawling strategyRandom IP addresses [ONei97, Lawr99]HTTP requests to random IP addresses Ignored: empty or authorization required or excluded[Lawr99] Estimated 2.8 million IP addresses running crawlable web servers (16 million total) from observing 2500 servers.OCLC using IP sampling found 8.7 M hosts in 2001Netcraft [Netc02] accessed 37.2 million hosts in July 2002[Lawr99] exhaustively crawled 2500 servers and extrapolatedEstimated size of the web to be 800 millionEstimated use of metadata descriptors:Meta tags (keywords, description) in 34% of home pages, Dublin core metadata in 0.3%IssuesVirtual hostingServer might not accept http://102.93.22.15No guarantee all pages are linked to root pagePower law for # pages/hosts generates biasRandom walks [Henz99, BarY00, Rusm01] View the Web as a directed graph from a given list of seeds. Build a random walk on this graphIncludes various “jump” rules back to visited sitesConverges to a stationary distributionTime to convergence not really knownSample from stationary distribution of walkUse the “small results set query” method to check coverage by SE“Statistically clean” method, at least in theory!IssuesList of seeds is a problem.Practical approximation might not be valid: Non-uniform distribution, subject to link spammingStill has all the problems associated with “strong queries”ConclusionsNo sampling solution is perfect. Lots of new ideas .......but the problem is getting harderQuantitative studies are fascinating and a good research problem4Duplicates and mirrorsDuplicate/Near-Duplicate DetectionDuplication: Exact match with fingerprintsNear-Duplication: Approximate matchOverviewCompute syntactic similarity with an edit-distance measureUse similarity threshold to detect near-duplicatesE.g., Similarity > 80% => Documents are “near duplicates”Not transitive though sometimes used transitivelyComputing SimilarityFeatures:Segments of a document (natural or artificial breakpoints) [Brin95]Shingles (Word N-Grams) [Brin95, Brod98]“a rose is a rose is a rose” => a_rose_is_arose_is_a_roseis_a_rose_is Similarity MeasureTFIDF [Shiv95]Set intersection [Brod98](Specifically, Size_of_Intersection / Size_of_Union )Jaccard measureShingles + Set IntersectionComputing exact set intersection of shingles between all pairs of documents is expensive/intractableApproximate using a cleverly chosen subset of shingles from each (a sketch) Estimate size_of_intersection / size_of_union based on a short sketch ( [Brod97, Brod98] )Create a “sketch vector” (e.g., of size 200) for each documentDocuments which share more than t (say 80%) corresponding vector elements are similarFor doc
View Full Document