Unformatted text preview:

1CS276BText Retrieval and MiningWinter 2005Lecture 9Plan for todayWeb size estimationMirror/duplication detectionPagerankSize of the webWhat is the size of the web ?IssuesThe web is really infinite Dynamic content, e.g., calendar Soft 404: www.yahoo.com/anything is a valid pageStatic web contains syntactic duplication, mostly due to mirroring (~20-30%)Some servers are seldom connectedWho cares?Media, and consequently the userEngine designEngine crawl policy. Impact on recallWhat can we attempt to measure?The relative size of search engines The notion of a page being indexed is still reasonably well defined.Already there are problemsDocument extension: e.g. Google indexes pages not yet crawled by indexing anchortext.Document restriction: Some engines restrict what is indexed (first n words, only relevant words, etc.) The coverage of a search engine relative to another particular crawling process.Statistical methodsRandom queriesRandom searchesRandom IP addressesRandom walks2URL sampling via Random QueriesIdeal strategy: Generate a random URL and check for containment in each index.Problem: Random URLs are hard to find!Random queries [Bhar98a]Sample URLs randomly from each engine20,000 random URLs from each engineIssue random conjunctive query with <200 resultsSelect a random URL from the top 200 resultsTest if present in other engines. Query with 8 rarest words. Look for URL matchCompute intersection & size ratioIssuesRandom narrow queries may bias towards long documents(Verify with disjunctive queries)Other biases induced by processIntersection = x% of E1 = y% of E2E1/E2 = y/xE1 E2Random searchesChoose random searches extracted from a local log [Lawr97] or build “random searches” [Note02]Use only queries with small results sets. Count normalized URLs in result sets.Use ratio statisticsAdvantage:Might be a good reflection of the human perception of coverageRandom searches [Lawr98, Lawr99] 575 & 1050 queries from the NEC RI employee logs 6 Engines in 1998, 11 in 1999 Implementation:Restricted to queries with < 600 results in totalCounted URLs from each engine after verifying query matchComputed size ratio & overlap for individual queries Estimated index size ratio & overlap by averaging over all queries IssuesSamples are correlated with source of logDuplicatesTechnical statistical problems (must have non-zero results, ratio average, use harmonic mean? )adaptive access control neighborhood preservation topographic hamiltonian structures right linear grammar pulse width modulation neural unbalanced prior probabilities ranked assignment method internet explorer favourites importing karvel thornberzili liuQueries from Lawrence and Giles studysoftmax activation function bose multidimensional system theory gamma mlpdvi2pdf john oliensisrieke spikes exploring neural video watermarking counterpropagationnetwork fat shattering dimension abelson amorphous computingSize of the Web Estimation[Lawr98, Bhar98a]Capture – Recapture techniqueAssumes engines get independent random subsets of the WebE2 contains x% of E1.Assume, E2 contains x% of the Web as wellKnowing size of E2 compute size of the WebSize of the Web = 100*E2/xE1E2WEBBharat & Broder: 200 M (Nov 97), 275 M (Mar 98) Lawrence & Giles: 320 M (Dec 97)3Random IP addresses [Lawr99]Generate random IP addressesFind, if possible, a web server at the given addressCollect all pages from serverAdvantagesClean statistics, independent of any crawling strategyRandom IP addresses [ONei97, Lawr99]HTTP requests to random IP addresses Ignored: empty or authorization required or excluded[Lawr99] Estimated 2.8 million IP addresses running crawlable web servers (16 million total) from observing 2500 servers.OCLC using IP sampling found 8.7 M hosts in 2001Netcraft [Netc02] accessed 37.2 million hosts in July 2002[Lawr99] exhaustively crawled 2500 servers and extrapolatedEstimated size of the web to be 800 millionEstimated use of metadata descriptors:Meta tags (keywords, description) in 34% of home pages, Dublin core metadata in 0.3%IssuesVirtual hostingServer might not accept http://102.93.22.15No guarantee all pages are linked to root pagePower law for # pages/hosts generates biasRandom walks [Henz99, BarY00, Rusm01] View the Web as a directed graph from a given list of seeds. Build a random walk on this graphIncludes various “jump” rules back to visited sitesConverges to a stationary distributionTime to convergence not really knownSample from stationary distribution of walkUse the “small results set query” method to check coverage by SE“Statistically clean” method, at least in theory!IssuesList of seeds is a problem.Practical approximation might not be valid: Non-uniform distribution, subject to link spammingStill has all the problems associated with “strong queries”ConclusionsNo sampling solution is perfect. Lots of new ideas .......but the problem is getting harderQuantitative studies are fascinating and a good research problem4Duplicates and mirrorsDuplicate/Near-Duplicate DetectionDuplication: Exact match with fingerprintsNear-Duplication: Approximate matchOverviewCompute syntactic similarity with an edit-distance measureUse similarity threshold to detect near-duplicatesE.g., Similarity > 80% => Documents are “near duplicates”Not transitive though sometimes used transitivelyComputing SimilarityFeatures:Segments of a document (natural or artificial breakpoints) [Brin95]Shingles (Word N-Grams) [Brin95, Brod98]“a rose is a rose is a rose” => a_rose_is_arose_is_a_roseis_a_rose_is Similarity MeasureTFIDF [Shiv95]Set intersection [Brod98](Specifically, Size_of_Intersection / Size_of_Union )Jaccard measureShingles + Set IntersectionComputing exact set intersection of shingles between all pairs of documents is expensive/intractableApproximate using a cleverly chosen subset of shingles from each (a sketch) Estimate size_of_intersection / size_of_union based on a short sketch ( [Brod97, Brod98] )Create a “sketch vector” (e.g., of size 200) for each documentDocuments which share more than t (say 80%) corresponding vector elements are similarFor doc


View Full Document

Stanford CS 276B - Lecture Notes

Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?