Chico CSCI 693 - FICA: A Fast Intelligent Crawling Algorithm

Unformatted text preview:

FICA: A Fast Intelligent Crawling Algorithm Ali Mohammad Zareh Bidoki University of Tehran [email protected] Nasser Yazdani University of Tehran [email protected] Pedram Ghodsnia University of Tehran [email protected] Abstract Due to the proliferation and highly dynamic nature of the web, an efficient crawling and ranking algorithm for retrieving the most important pages has remained as a challenging issue. Several algorithms like PageRank [13] and OPIC [1] have been proposed. Unfortunately, they have high time complexity. In this paper, an intelligent crawling algorithm based on reinforcement learning, called FICA is proposed that models a real surfing user. The priority for crawling pages is based on a concept which we name as logarithmic distance. FICA is easy to implement and its time complexity is O(E*logV) where V and E are the number of nodes and edges in the web graph respectively. Comparison of the FICA with other proposed algorithms shows that FICA outperforms them in discovering highly important pages. Furthermore, FICA computes the importance (ranking) of each page during the crawling process. Thus, we can also use FICA as a ranking method for computation of page importance. We have used UK’s web graph for our experiments. 1. Introduction One of the most challenging issues for web search engines is finding high quality web pages or pages with high popularity for users. To make the web more interesting and productive, we need an efficient ranking algorithm for crawling and searching. In [6], it has been shown that search engines do not index the entire Web. Therefore, we have to focus on the most valuable and appealing pages. To do this, a better crawling technique is required and a more efficient mechanism has to be applied. This enables search engines to present the most relevant pages to the user in response to her query. Crawling algorithms use a ranking mechanism to discover important pages. In other words, a ranking algorithm is applied to the current web graph and pages with higher ranks will have a higher priority for crawling. Ranking methods are usually based on links between web pages or the graph structure of pages. Their main strength comes from using content of other pages to rank current pages [8]. Examples of connectivity based ranking algorithms are PageRank [13], HITS [10] and OPIC [1]. Unfortunately, the current ranking algorithms usually have high complexity and low throughput. Obviously, these drawbacks are inherited in the crawling processes which use them. We propose a crawling algorithm, called FICA (Fast Intelligent Crawling Algorithm), based on reinforcement learning [14] in which “punishment” is calculated using the distance between pages. We define this distance as the number of “average clicks” between two pages as defined in [11]. In our method, the aim is to minimize sum of received punishments (distance) by the crawler agent so that a page with a low distance to have the highest priority for crawling. On currently crawled page p with distance dp, the distance of each of its child nodes, q, is computed as dq=α*log(O(p))+(1-α)*dp where O(p) shows out-degree of p and α is the learning rate of the crawler agent. A nice property of our method is that it tries to model a user surfing the web randomly. Initially, a user browsing the web does not have any background about the pages and clicks based on the current status of each page. As time goes on, the user selects a page (clicks on a link) based on both her background and the current content of each page. As she continues, she accumulates more knowledge from the environment and other web pages, and improves her selection. A major contribution of this paper is an efficient crawling algorithm that finds hot pages faster (earlier) than former algorithms. The method also computes the importance of web pages during the crawling process with low complexity and less resource while crawling each page only once. In other words, after the crawling process ranking of the pages will be available too. This means that ranking is performed online or while 2007 IEEE/WIC/ACM International Conference on Web Intelligence0-7695-3026-5/07 $25.00 © 2007 IEEEDOI 10.1109/WI.2007.916352007 IEEE/WIC/ACM International Conference on Web Intelligence0-7695-3026-5/07 $25.00 © 2007 IEEEDOI 10.1109/WI.2007.916352007 IEEE/WIC/ACM International Conference on Web Intelligence0-7695-3026-5/07 $25.00 © 2007 IEEEDOI 10.1109/WI.2007.916352007 IEEE/WIC/ACM International Conference on Web Intelligence0-7695-3026-5/07 $25.00 © 2007 IEEEDOI 10.1109/WI.2007.916352007 IEEE/WIC/ACM International Conference on Web Intelligence0-7695-3026-5/07 $25.00 © 2007 IEEEDOI 10.1109/WI.2007.91635crawling. This is why we talk about ranking and crawling simultaneously in this paper. UK’s web graph (.uk websites) has been used to evaluate our algorithm. The complexity of our solution is at most O(E*logV) where V and E are the number of nodes (vertices) and edges in the web graph respectively. We have compared our algorithm with other crawling algorithms. Our algorithm outperforms other algorithms in throughput, i.e. it finds important pages faster than others. Furthermore, we used FICA as a ranking algorithm and compared it with PageRank and OPIC. Interestingly, its ranking similarity against PageRank is 0.61. The next section reviews background and related work. Section 3 discusses our solution, FICA. Experimental analysis and comparison to some of the well-known algorithms come in section 4. Section 5 describes using FICA as a ranking algorithm and finally, our conclusion and future areas of research are presented in section 6. 2. Background and related work There are two broad categories of crawling algorithms. Some algorithms are based on ranking algorithms like PageRank [13] and some are based on scheduling crawling algorithms like OPIC [1] and the Breadth-first algorithm [12]. In PageRank based crawling the pages with higher ranking will have higher priority for retrieval from the web. PageRank has been designed such that the known relationships between web pages is taken into account. For example, if page p1 has a link to page p2, then, p2’s subject is probably interesting for p1’s creator. Therefore, the number of input links to a web page shows the interest degree of the page to others. Clearly, the interest degree of a page increases with the growing number of input links. Furthermore, when a


View Full Document

Chico CSCI 693 - FICA: A Fast Intelligent Crawling Algorithm

Download FICA: A Fast Intelligent Crawling Algorithm
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view FICA: A Fast Intelligent Crawling Algorithm and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view FICA: A Fast Intelligent Crawling Algorithm 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?