Finding Replicated web collectionsSlide 2IntroductionSlide 4Slide 5What we will know in this paper?Slide 7Slide 8Similarity MeasuresSlide 10Slide 11Slide 12Slide 13Slide 14Similar collectionsSimilar ClustersSlide 17Slide 18Slide 19Slide 20Slide 21Slide 22Quality of ClustersConcept of fingerprints:Slide 25Slide 26Slide 27Exploiting ClustersSlide 29Slide 30Slide 31Slide 32Slide 33Slide 34Slide 35Finding Replicated web collectionsAuthorsJunghoo Cho, Narayanan Shivakumar, Hector Garcia-MolinaPaper Presentation ByRadhika Malladi and Vijay Reddy Mara•Introduction•Similarity Measures Similarity in Web Pages Similarity in Link Structure•Similar Clusters Computing Implementing•Quality of Clusters•Exploiting Clusters Improving Crawling Improving search engine resultsIntroduction Replication across the web ? Replicated collections constitute several hundreds and thousands of pages. Mirrored Web Pages?Mirrored pages.What we will know in this paper?•Similarity Measures: Compute similarity measures for collections of web pages.•Improved Crawling: use replication information to improve crawling.•Reduce clutter from search engines: use replication information.By automatically identifying mirrored collectionsWe can improve the following: •Crawling: fetches web pages.•Ranking: pages ranked by replication factor.•Archiving: stores subset of the web.•Caching: holds web pages that are frequently accessed.Difficulties detecting replicated collections:•Update frequency: Mirror copies may not be updated regularly.•Mirror partial coverage: differ from primary•Different formats•Partial crawls: snapshots may be differentSimilarity MeasuresDefinition (Web Graph): Graph G=(V,E) having nodes vi for each page pi and a directed edge from vi and vJ ,if there is a hyperlink from pi to pJ .Definition (Collection): Collection Size: No. of pages in the collectionDefinition (Identical Collections): Equisized collections C1 and C2 are identical if there is a one-to-one mapping M that maps C1 pages to C2 pages such that:Identical pages: For each page p C1 ,p M( p).Identical link structure: For each link in C1 from page a to b, we have a link from M(a) to M(b) in C2 .Similarity of link structure:Collection sizes:One-to-One:Break points:Link Similarity:Definition (Similar Collections): Equisized collections C and C are similar if there is a one-to-one mapping M that maps all C pages to all C pages such thatSimilar pages: For each page p C1 ,p M( p).Similar links: Two corresponding pages should have atleast one parent in their corresponding collections that are also similar pages.Similar collectionsSimilar ClustersComputingExample 1 of similar clustersCluster size(cardinality):2 Collection size:5ComputingExample 2 of similar clustersCluster size(cardinality):3 Collection size:3Cluster Algorithm Example:Step 1: Find all trivial clustersStep2: Merge trivial clusters that leads to similar clustersStep 3: OutcomeAnother example of cluster algorithm with two possible clusters.Cont…Quality of ClustersConcept of fingerprints:entire document fingerprintfour line fingerprinttwo line fingerprintExploiting ClustersImproving Crawling:Improving Search engine results:Reduces clutter by using a prototype.Prototype has links to ‘Collections’ and ‘Replica’.ConclusionDiscussion Question: Should Cluster size be more or collection size?Discussion Question:Suppose p is similar to pI and pII is similar pIand p and pII are not similar.Do you think all the three pages are
View Full Document