DOC PREVIEW
Chico CSCI 693 - DistanceRank: An intelligent ranking algorithm for web pages

This preview shows page 1-2-3-4-5 out of 16 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

DistanceRank: An intelligent ranking algorithm for web pagesIntroductionDistanceRankExperimental resultsDistanceRank convergenceDistanceRank and ranking problemBackground and related workConclusion and future workReferencesDistanceRank: An intelligent ranking algorithmfor web pagesAli Mohammad Zareh Bidoki*, Nasser YazdaniDepartment of Electrical and Computer Engineering, University of Tehran, Tehran, IranReceived 11 January 2007; received in revised form 27 June 2007; accepted 29 June 2007Available online 13 August 2007AbstractA fast and efficient page ranking mechanism for web crawling and retrieval remains as a challenging issue. Recently,several link based ranking algorithms like PageRank, HITS and OPIC have been proposed. In this paper, we proposea novel recursive method based on reinforcement learning which considers distance between pages as punishment, called‘‘DistanceRank’’ to compute ranks of web pages. The distance is defined as the number of ‘‘average clicks’’ between twopages. The objective is to minimize punishment or distance so that a page with less distance to have a higher rank. Exper-imental results indicate that DistanceRank outperforms other ranking algorithms in page ranking and crawling scheduling.Furthermore, the complexity of DistanceRank is low. We have used University of California at Berkeley’s web for ourexperiments.Ó 2007 Elsevier Ltd. All rights reserved.Keywords: Web ranking; Crawling; Web graph; Reinforcement learning1. IntroductionThe past decade has witnessed the birth and exp losive growth of the World Wide Web. Exponential growthin the number of web servers has extended the web from a few dozen in 1981 up to over 400 millions today(ISC, 2007). Web servers can potentially host millions of pages which make the number of web pages extre-mely difficult to track. To track web pages, one has to download them, a process that can take weeks ormonths. Unfortunately, during this process, a considerable fraction of pages might have been modified,deleted or added. New pages are created at the rate of 8% per week. It is expected that only 20% of the currentpages will be accessible after one year (Ntoulas, Cho, & Olston, 2004). As a rough (under) estimate of the totalnumber of pages, Google claims that it has indexed 3 billion documents at the end of 2001, 4.28 billions onMarch 2004 (Google, 2004) and about 8 billions pages currently (Gulli & Signorini, 2005).One of the most important challenging issues in any web search engine is finding high quality web pages.Quality of pages is defined based on the user preferences. Then, the problem of ranking is to sort web pages0306-4573/$ - see front matter Ó 2007 Elsevier Ltd. All rights reserved.doi:10.1016/j.ipm.2007.06.004*Corresponding author. Tel.: +98 21 66946927; fax: +98 21 8497642.E-mail addresses: [email protected] (A.M. Zareh Bidoki), [email protected] (N. Yazdani).Available online at www.sciencedirect.comInformation Processing and Management 44 (2008) 877–892www.elsevier.com/locate/infopromanbased on users’ requests or preferences. Definitely, to make the web more interesting and prod uctive, we needa good and efficient ranking algorithm for crawling and searching. Unfortunately, search engines do not indexthe entire web (Gulli & Signorini, 2005; Lawrence & Giles, 1999). Therefore, we have to focus on the mostvaluable and appealing pages. To do this, a better ranking criterion is required and a more efficient mechanismhas to be devised. This will ena ble search engines to present the best related pages to the user in response to herqueries. However, current ranking algorithms have low precision and high complexity. Meanwhile, they aresensitive to the ‘‘rich-get-richer’’ problem (Cho et al., 2005) meaning that the popular web pages receive evenmore popularity by time passing. Obviously, we need a solution to remedy these problems and make webranking algorithms fairer.In fact, web search is a subset of the general Information Retrieval (IR) problem . In Information Retrieval(Baeza-Yates & Ribeiro-Neto, 1999), the system tries to find documents related to the user query. IR Algo-rithms usually work based on matching words in documents. However, the web consists of huge unstructureddocuments linked together which create a massive graph. This poses new challenges to IR. New algorithms uselinks between web pages beside words relevancy in the documents with respect to a user query. Previous stud-ies show that algorithms using hyperlinks for ranking yield good results (Henzinger, 2001). Their main powerscome from using content of other pages to rank current pages. In other words, links carry information whichcan be used to evaluate relative importance of pages to the user query. Instances of the connectivity-basedranking algorithms are PageRank (Page, Brin, Motwani, & Winograd, 1998), HITS (Kleinberg, 1999) andOPIC (Abiteboul, Preda, & Cobena, 2003).In this paper, we propose a ranking algorithm, called DistanceR ank, based on reinforcement learning (Sut-ton & Barto, 1998) in which the distance between pages are considered as punishment. We define the distanceas the number of ‘‘average clicks’’ between two pages or logarithm of a page’s out degree (number of outputlinks) as defined in Matsuo, Ohsawa, and Ishizuka (2003). In our method the main objective is to minimize thesum of received punishments (distance) by the user agent so that the page with the low distance will have ahigher rank.The distance djof page j is computed as dj=(1 a)*dj+ a*mini(log(O(i)) + di) where i is a member ofpages that point to j and O(i) shows out degree of i and a is the learning rate of the user.One of major contribution of our method is modelling a user surfing the web randomly. Initially, a userbrowsing the web does not have any background about pages and she clicks based on the current status(page’s content) of each page under consideration. As time goes on, the user selects a page (clicks on a link)based on both her background and the current content of each page. As she continues, she accumulates moreknowledge from the environment and other web pages, and improves her link or page selection process.We have used University of California at Berkeley’s web to evaluate our algorithm. Our algorithm outper-forms other ranking algorithms like PageRank and OPIC while removing some restricted conditions andproblems. The time complexity of our solution is about O( p*jEj) where p


View Full Document

Chico CSCI 693 - DistanceRank: An intelligent ranking algorithm for web pages

Download DistanceRank: An intelligent ranking algorithm for web pages
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view DistanceRank: An intelligent ranking algorithm for web pages and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view DistanceRank: An intelligent ranking algorithm for web pages 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?