GT CS 8803 - A Distributed P2P Link Analysis Based Ranking System
School name Georgia Tech
Pages 5

Unformatted text preview:

A Distributed P2P Link Analysis Based Ranking System CS 8803 Advanced Internet Application Development Project Proposal Bhuvan Bamba Sponsors – Prof. Ling Liu, James Caverlee, Mudhakar Srivatsa, Aameek Singh Abstract Link Based approaches are among the most popular ranking approaches employed by search engines. They make use of the inherent linkage based structure of World Wide Web documents assigning each document an importance score. This importance score is based on the incoming links for a document; a document which is pointed to by many high quality documents should have a higher importance score. Googles’ highly popular search technology [1] exemplifies the success of the link based ranking algorithms in identifying important pages. However, such link analysis based algorithms suffer from some drawbacks. Googles’ PageRank algorithm has an update time extending into months which is not feasible for frequent updating of the system. Secondly, the algorithm is susceptible to manipulation by malicious Web Spammers, who manipulate the link based analysis to favor their fake websites. The problem commonly termed as Web Spam can seriously hurts the performance of PageRank algorithm, leading the algorithm into providing unjustified high PageRank to spam web pages. In [2], the authors propose the SourceRank approach for enhancing PageRank through source-based link analysis, which can potentially help combat the problem of web spam. In this project, we propose to implement the SourceRank technique on top a P2P crawler Apoidea [3]. We plan to perform experimental analysis using the SourceRank technique on the www.gatech.edu domain and analyse the results to verify the claims made in [2]. In addition, we analyze the results obtained by the above experimentation to enable us to answer an important question raised in [2] – How can we identify a collection of pages which constitutes a single source? Motivation PageRank-based algorithms suffer from several critical problems, like a very long update time and the problem of web spam. The first problem is a direct consequence of the huge size of the World Wide Web and the inherent expense associated with PageRanks’ eigen vector based iterative algorithm. The World Wide Web comprises of Terabytes of data; search engines are capable of crawling and indexing a few billion pages. A web graph comprising of these billions of pages needs to be analyzed by the PageRank algorithm. The end result of the attempts to perform a link-based analysis results in weeks and months spent in updating the PageRanks of the web pages. As the size of the World Wide Web continues to grow there are no clear cut solutions available to deal with the complexity associated with the PageRank computation. An obvious solution is to somehow reduce the size of the graph. The size of the graph can indeed be reduced by grouping together pages according to some criteria. In [2], the authors propose the idea of using a source- based linkanalysis to reduce the complexity of these operations. A source can be loosely identified as a collection of pages, such as pages belonging to the same domain. Distributed processing of the web graph can also help speed up the processing time for calculating pagerank. A logical division of the web pages into collections is required for such an algorithm to work. SourceRank provides us with a methodology for dividing the web graph into separate collections. The web graph can be split into different sources and a distributed system can be used to calculate the PageRank of pages within the source and the SourceRank of the particular source in relation to other sources. We plan to implement the SourceRank (and PageRank) algorithm on top of Apoidea [3], a decentralized P2P crawler for the World Wide Web. SourceRank can also help deal with the problem of web spam. Spammers manipulate PageRanks’ link based propagation model to promote their own pages. A user can increase the pagerank of a target page by using “link farms” of pages that point to this particular target page. “Hijacking” of other pages to add links to the target page is another way of promoting the pagerank for a page. Such link based manipulation of web pages is intended to fool the search engine into misjudging the quality of a particular page. Recent results suggest that the amount of Web spam is a significant portion of all Web content – estimates range from 8.1% of pages [4] to 18% of sites [5]. An implementation of SourceRank on top of Apoidea will solve multiple problems. Firstly, it provides a distributed setup for calculation of PageRank (as well as SourceRank). Secondly, SourceRank can also be used to rank the nodes in the P2P system in terms of the content they are holding. This will prove to be highly beneficial when searching across a large number of peers. Each file in the P2P system can now be given a SourceRank and a LocalRank and the overall rank of this file will be a combination of the two factors. Related Work Our system builds on top of a web based crawler, Apoidea, developed at the College of Computing at Georgia Institute of Technology. Apoidea is a decentralized P2P crawler for the World Wide Web and provides an ideal setup for implementing a distributed PageRank system. Apoidea is self managing and uses geographical proximity of the web resources to the peers for a better and faster crawl. It uses DHT based protocols to perform URL duplicate and content-duplicate tests. Content-duplication may be associated with the web spam problem which this system attempts to solve. The SourceRank algorithm attempts to solve the problem of web spam by treating the web as a collection of sources rather than as a collection of pages. SourceRank emphasizes the quality of the source a page belongs to, which can be extremely useful to judge instances of manipulated PageRanks. A page which has a high PageRank but a very low SourceRank has a high probability of being associated with a spamming operation. Proposed Work In this project we propose to develop a complete


View Full Document

GT CS 8803 - A Distributed P2P Link Analysis Based Ranking System

Download A Distributed P2P Link Analysis Based Ranking System
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view A Distributed P2P Link Analysis Based Ranking System and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view A Distributed P2P Link Analysis Based Ranking System 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?