Chico CSCI 693 - DistanceRank: An intelligent ranking algorithm for web pages - D3053024

Home> Schools> California State University, Chico> Computer Science (CSCI) > CSCI 693> DistanceRank: An intelligent ranking algorithm for web pages

DOC PREVIEW

Chico CSCI 693 - DistanceRank: An intelligent ranking algorithm for web pages

School name California State University, Chico

Course Csci 693- Research Methods in Computer Science

Pages 16

This preview shows page 1-2-3-4-5 out of 16 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

DistanceRank: An intelligent ranking algorithm for web pagesIntroductionDistanceRankExperimental resultsDistanceRank convergenceDistanceRank and ranking problemBackground and related workConclusion and future workReferencesDistanceRank: An intelligent ranking algorithmfor web pagesAli Mohammad Zareh Bidoki*, Nasser YazdaniDepartment of Electrical and Computer Engineering, University of Tehran, Tehran, IranReceived 11 January 2007; received in revised form 27 June 2007; accepted 29 June 2007Available online 13 August 2007AbstractA fast and eﬃcient page ranking mechanism for web crawling and retrieval remains as a challenging issue. Recently,several link based ranking algorithms like PageRank, HITS and OPIC have been proposed. In this paper, we proposea novel recursive method based on reinforcement learning which considers distance between pages as punishment, called‘‘DistanceRank’’ to compute ranks of web pages. The distance is deﬁned as the number of ‘‘average clicks’’ between twopages. The objective is to minimize punishment or distance so that a page with less distance to have a higher rank. Exper-imental results indicate that DistanceRank outperforms other ranking algorithms in page ranking and crawling scheduling.Furthermore, the complexity of DistanceRank is low. We have used University of California at Berkeley’s web for ourexperiments.Ó 2007 Elsevier Ltd. All rights reserved.Keywords: Web ranking; Crawling; Web graph; Reinforcement learning1. IntroductionThe past decade has witnessed the birth and exp losive growth of the World Wide Web. Exponential growthin the number of web servers has extended the web from a few dozen in 1981 up to over 400 millions today(ISC, 2007). Web servers can potentially host millions of pages which make the number of web pages extre-mely diﬃcult to track. To track web pages, one has to download them, a process that can take weeks ormonths. Unfortunately, during this process, a considerable fraction of pages might have been modiﬁed,deleted or added. New pages are created at the rate of 8% per week. It is expected that only 20% of the currentpages will be accessible after one year (Ntoulas, Cho, & Olston, 2004). As a rough (under) estimate of the totalnumber of pages, Google claims that it has indexed 3 billion documents at the end of 2001, 4.28 billions onMarch 2004 (Google, 2004) and about 8 billions pages currently (Gulli & Signorini, 2005).One of the most important challenging issues in any web search engine is ﬁnding high quality web pages.Quality of pages is deﬁned based on the user preferences. Then, the problem of ranking is to sort web pages0306-4573/$ - see front matter Ó 2007 Elsevier Ltd. All rights reserved.doi:10.1016/j.ipm.2007.06.004*Corresponding author. Tel.: +98 21 66946927; fax: +98 21 8497642.E-mail addresses: [email protected] (A.M. Zareh Bidoki), [email protected] (N. Yazdani).Available online at www.sciencedirect.comInformation Processing and Management 44 (2008) 877–892www.elsevier.com/locate/infopromanbased on users’ requests or preferences. Deﬁnitely, to make the web more interesting and prod uctive, we needa good and eﬃcient ranking algorithm for crawling and searching. Unfortunately, search engines do not indexthe entire web (Gulli & Signorini, 2005; Lawrence & Giles, 1999). Therefore, we have to focus on the mostvaluable and appealing pages. To do this, a better ranking criterion is required and a more eﬃcient mechanismhas to be devised. This will ena ble search engines to present the best related pages to the user in response to herqueries. However, current ranking algorithms have low precision and high complexity. Meanwhile, they aresensitive to the ‘‘rich-get-richer’’ problem (Cho et al., 2005) meaning that the popular web pages receive evenmore popularity by time passing. Obviously, we need a solution to remedy these problems and make webranking algorithms fairer.In fact, web search is a subset of the general Information Retrieval (IR) problem . In Information Retrieval(Baeza-Yates & Ribeiro-Neto, 1999), the system tries to ﬁnd documents related to the user query. IR Algo-rithms usually work based on matching words in documents. However, the web consists of huge unstructureddocuments linked together which create a massive graph. This poses new challenges to IR. New algorithms uselinks between web pages beside words relevancy in the documents with respect to a user query. Previous stud-ies show that algorithms using hyperlinks for ranking yield good results (Henzinger, 2001). Their main powerscome from using content of other pages to rank current pages. In other words, links carry information whichcan be used to evaluate relative importance of pages to the user query. Instances of the connectivity-basedranking algorithms are PageRank (Page, Brin, Motwani, & Winograd, 1998), HITS (Kleinberg, 1999) andOPIC (Abiteboul, Preda, & Cobena, 2003).In this paper, we propose a ranking algorithm, called DistanceR ank, based on reinforcement learning (Sut-ton & Barto, 1998) in which the distance between pages are considered as punishment. We deﬁne the distanceas the number of ‘‘average clicks’’ between two pages or logarithm of a page’s out degree (number of outputlinks) as deﬁned in Matsuo, Ohsawa, and Ishizuka (2003). In our method the main objective is to minimize thesum of received punishments (distance) by the user agent so that the page with the low distance will have ahigher rank.The distance djof page j is computed as dj=(1 a)*dj+ a*mini(log(O(i)) + di) where i is a member ofpages that point to j and O(i) shows out degree of i and a is the learning rate of the user.One of major contribution of our method is modelling a user surﬁng the web randomly. Initially, a userbrowsing the web does not have any background about pages and she clicks based on the current status(page’s content) of each page under consideration. As time goes on, the user selects a page (clicks on a link)based on both her background and the current content of each page. As she continues, she accumulates moreknowledge from the environment and other web pages, and improves her link or page selection process.We have used University of California at Berkeley’s web to evaluate our algorithm. Our algorithm outper-forms other ranking algorithms like PageRank and OPIC while removing some restricted conditions andproblems. The time complexity of our solution is about O( p*jEj) where p

View Full Document