DOC PREVIEW
ODU CS 791 - Finding Replicated web collections

This preview shows page 1-2-16-17-18-34-35 out of 35 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Finding Replicated web collectionsSlide 2IntroductionSlide 4Slide 5What we will know in this paper?Slide 7Slide 8Similarity MeasuresSlide 10Slide 11Slide 12Slide 13Slide 14Similar collectionsSimilar ClustersSlide 17Slide 18Slide 19Slide 20Slide 21Slide 22Quality of ClustersConcept of fingerprints:Slide 25Slide 26Slide 27Exploiting ClustersSlide 29Slide 30Slide 31Slide 32Slide 33Slide 34Slide 35Finding Replicated web collectionsAuthorsJunghoo Cho, Narayanan Shivakumar, Hector Garcia-MolinaPaper Presentation ByRadhika Malladi and Vijay Reddy Mara•Introduction•Similarity Measures Similarity in Web Pages Similarity in Link Structure•Similar Clusters Computing Implementing•Quality of Clusters•Exploiting Clusters Improving Crawling Improving search engine resultsIntroduction Replication across the web ? Replicated collections constitute several hundreds and thousands of pages. Mirrored Web Pages?Mirrored pages.What we will know in this paper?•Similarity Measures: Compute similarity measures for collections of web pages.•Improved Crawling: use replication information to improve crawling.•Reduce clutter from search engines: use replication information.By automatically identifying mirrored collectionsWe can improve the following: •Crawling: fetches web pages.•Ranking: pages ranked by replication factor.•Archiving: stores subset of the web.•Caching: holds web pages that are frequently accessed.Difficulties detecting replicated collections:•Update frequency: Mirror copies may not be updated regularly.•Mirror partial coverage: differ from primary•Different formats•Partial crawls: snapshots may be differentSimilarity MeasuresDefinition (Web Graph): Graph G=(V,E) having nodes vi for each page pi and a directed edge from vi and vJ ,if there is a hyperlink from pi to pJ .Definition (Collection): Collection Size: No. of pages in the collectionDefinition (Identical Collections): Equisized collections C1 and C2 are identical if there is a one-to-one mapping M that maps C1 pages to C2 pages such that:Identical pages: For each page p C1 ,p M( p).Identical link structure: For each link in C1 from page a to b, we have a link from M(a) to M(b) in C2 .Similarity of link structure:Collection sizes:One-to-One:Break points:Link Similarity:Definition (Similar Collections): Equisized collections C and C are similar if there is a one-to-one mapping M that maps all C pages to all C pages such thatSimilar pages: For each page p C1 ,p M( p).Similar links: Two corresponding pages should have atleast one parent in their corresponding collections that are also similar pages.Similar collectionsSimilar ClustersComputingExample 1 of similar clustersCluster size(cardinality):2 Collection size:5ComputingExample 2 of similar clustersCluster size(cardinality):3 Collection size:3Cluster Algorithm Example:Step 1: Find all trivial clustersStep2: Merge trivial clusters that leads to similar clustersStep 3: OutcomeAnother example of cluster algorithm with two possible clusters.Cont…Quality of ClustersConcept of fingerprints:entire document fingerprintfour line fingerprinttwo line fingerprintExploiting ClustersImproving Crawling:Improving Search engine results:Reduces clutter by using a prototype.Prototype has links to ‘Collections’ and ‘Replica’.ConclusionDiscussion Question: Should Cluster size be more or collection size?Discussion Question:Suppose p is similar to pI and pII is similar pIand p and pII are not similar.Do you think all the three pages are


View Full Document

ODU CS 791 - Finding Replicated web collections

Documents in this Course
Load more
Download Finding Replicated web collections
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Finding Replicated web collections and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Finding Replicated web collections 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?