Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of AlgorithmsOverviewPowerPoint PresentationExperimental DataSlide 5Sample HTML PageRemove HTML &Formatting InfoRemove "." and "/" from URLsTokens in the PageTokens in a Similar PagePreprocessing step contd..Broder’s AlgorithmSlide 14Experimental results of Broder’s AlgorithmSlide 16Correctness of a near-duplicate pairSlide 19Slide 20Slide 21Charikar’s AlgorithmExperimental results of Charikar’s algorithmSlide 24Slide 25Slide 26Comparisons of both the algorithmsSlide 28Slide 29Combined Algorithm:Slide 31Slide 32Slide 33Slide 34ConclusionDiscussionFinding Near-Duplicate Web Pages: A Large-ScaleEvaluation of Algorithms By Monika Henzinger http://labs.google.com/people/monika/ Presented By Harish Rayapudi Shiva Prasad MalladiOverview•Introduction•Broder’s Algorithm •Charikar’s Algorithm •Comparing the algorithms•Combined algorithm•ConclusionDuplicate web pages•more space to store index•Slow down the performanceHow to identify duplicate pages? •Compare ….. requires O(n^2) comparisons•Indexed web ….. 13.92 billion pagesExperimental Data•1.6B pages from a real Google crawl•25%-30% of identical pages removed prior to authors recving the data –unknown exactly how many pages identical•Of the remainder:–Broder's Algorithm (Alg B) found 1.7% near duplicates–Charikar's Algorithm (Alg C) found 2.2% near duplicatesAlgorithms•Broder’s and Charikar’s algorithm were not evaluated against each other previously•Used by successful web search engines•Algorithms comparison: 1. Precision on a random subset 2. The distribution of the number of term differences per near-duplicate pair 3. The distribution of the number of near-duplicates per page. •The algorithms were evaluated on a set of 1.6B unique pagesSample HTML Page<html><body bgcolor="cream"><H3>Harish Rayapudi Website </H3><H4><a href="http://www.google.com" target="_blank">Google </a></H4><H4>I am a Computer Science graduate student</H4></body></html>Remove HTML &Formatting Info Harish Rayapudi Websitehttp://www.google.com GoogleI am a Computer Science graduate studentRemove "." and "/" from URLs Harish Rayapudi Websitehttp www google com GoogleI am a Computer Science graduate studentTokens in the Page Harish1 Rayapudi2 Website3http4 www5 google6 com7 Google9 I10 am11 a12 Computer13 Science14 graduate15 student16 •We'll only look at the first 7 tokens•The token sequence for this page P1 is {1,2,3,4,5,6,7} •This token sequence is used by both the algorithmsTokens in a Similar Page Harish1 Rayapudi2 Website3http4 www5 yahoo8 com7 Yahoo17 I10 am11 a12 Computer13 Science14 graduate15 student16 •We'll only look at the first 7 tokens•The token sequence for this page P2 is {1,2,3,4,5,8,7} •This token sequence is used by both the algorithmsPreprocessing step contd.. •Let n be the length of token sequence for pages P1 and P2, n=7•k subsequence of tokens is fingerprinted resulting in n-k+1 shingles•For k=2, shingles for page P1 {1,2,3,4,5,6,7} and page P2 {1,2,3,4,5,8,7} are, P1 {12,23,34,45,56,67}P2 {12,23,34,45,58,87}Broder’s Algorithm•Shingles are fingerprinted with m different fingerprinting functions•For m = 4 we have, F1,F2,F3 and F4 fingerprinting functions•For P1, the result of applying m different functions: F1 F2 F3 F4 12 4 7 5 9 23 7 4 8 5 34 1 2 3 6 45 8 5 9 7 56 1 8 7 8 67 6 3 4 5•Smallest value of each function is taken and a m-dimensional vector of min-values is stored for each page•4-dimensional vector for page P1 is {1,2,3,5}•For P2, the result of applying m functions: F1 F2 F3 F4 12 4 7 5 9 23 7 4 8 5 34 1 2 3 6 45 8 5 9 7 58 9 7 3 1 87 5 3 6 4•4-dimensional vector for page P2 is {1,2,3,1}•m dimensional vector reduced to m' dimensional vector of supershingles, m' is chosen such that m is divisible by m' •Since m=4, we take m' = 2•Non-overlapping sequence of P1 {1,2,3,5} is {12,35} and for page P2 {1,2,3,1} is {12,31}•Generating supershingle vector from non-overlapping sequence For P1, SS {12,35} = {x, y} and for P2, SS {12,31} = {x, z}•B-similarity of two pages is the identical number of entries in their supershingle vector•B-similarity of pages P1 and P2 is 1 (common entry x)Experimental results of Broder’s Algorithm•Algorithm generated 6 supershingles per page and a total of 10.1 B supershingles For pages P1 and P2 we had 2 supershingles, they are {x, y}, {x, z}•For each pair of pages with an identical supershingle B-similarity is determinedFor pages P1 and P2 we had B-similarity of 1B-similarity graph•Every page is a node in the graph•Edge between two nodes if and only if the pair is B-similar •Label of an edge is the B-similarity of the pair •A node is considered a near-duplicate page if and only if it is incident to at least one edge. 1 The average degree of the B-similarity graph is about 135P1P2A random sample of 96556 B-similar pairs. Sub sampled and 1910 pairs were chosen •The overall precision is 0.38 •The precision for pairs on same site is .34 while for pairs on different site is .84 Table taken from the paperCorrectness of a near-duplicate pair•Text differs only by URL, session id, a timestamp, visitor count•Difference is invisible to the visitors •Difference is a combination of above items •Pages are entry pages to the same site•URL-only differences account for 41% of the correct pairs Table taken from the paper•92% of them are on the same site. Almost half the cases are pairs that could not be evaluated. Table taken from the paper> diff google yahoo 1,2c1,2
View Full Document