Berkeley COMPSCI 294 - Distributed Web Crawling over DHTs - D930753

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 294> Distributed Web Crawling over DHTs

DOC PREVIEW

Berkeley COMPSCI 294 - Distributed Web Crawling over DHTs

School name University of California, Berkeley

Course Compsci 294- Special Topics

Pages 13

This preview shows page 1-2-3-4 out of 13 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Distributed Web Crawling over DHTsBoon Thau Loo Owen Cooper Sailesh Krishnamurthy{boonloo, owenc, sailesh} @cs.berkeley.eduUniversity of California, BerkeleyAbstractIn this paper, we present the design and implementationof a distributed web crawler. We begin by motivatingthe need for such a crawler, as a basic building blockfor decentralized web search applications. It harnessesthe excess bandwidth and computing resources of clientsto crawl the web. Nodes participating in the crawl areuse a Distributed Hash Table (DHT) to coordinate anddistribute work. We study different crawl distributionstrategies and investigate the trade-offs in communica-tion overheads, crawl throughput, balancing load on thecrawlers as well as crawl targets, and the ability to ex-ploit network proximity. We present an implementationof the distributed crawler using PIER, a relational queryprocessor that runs over the Bamboo DHT, and com-pare different crawl strategies on PlanetLab queryinglive web sources.1 IntroductionSearch engines such as Google[3] have become an inte-gral part of the Internet. These systems typically exposea limited search interface to end-users through whichusers enter search terms receive results according to theengine’s ranking function.Beneath this interface, search engines are generallycomprised of three core components. First, the crawlcomponent trawls the web and downloads web contentto be cached locally. Second, the index component pre-computes indexes for the cached web content for effi-cient search. Last, the search component uses the in-dexes for executing user search queries returning rankedresults.This interface has served us well thus far with the webconsisting of a relatively small set of fairly static webpages. As the web has grown and evolved towards dy-namic content, however, search engines face increasingchallenges in maintaining a fresh index over the entireweb. As a result, search engines today index only a frac-tion of the web, and design their crawlers to prioritizeupdates of some web sites over others. It is estimatedthat Google indexes around 3 billion web documents,which is a tiny fraction of the 550 billion documents [10]mostly within the deep web estimated in the year 2001.Further, search engines are known to censor and skewresults rankings [8], mostly in response to external pres-sures. Examples of Google’s censorships abound inresponse to DMCA complaints from KaZaA [5], TheChurch of Scientology’s attempts to silence its critics,and the banning of seditious web sites by certain coun-tries. If Google is the guardian of the web, it is reason-able to ask: quis custodiet ipsos custodes ?1.To address the shortcomings of centralized search en-gines, there have been several proposals [17, 25] to builddecentralized search engines over peer-to-peer (P2P)networks. In this paper, we focus on the design issues ofdistributed crawling, which is an essential substrate forany of the proposed decentralized search applications.The web content harvested by a distributed crawler canbe indexed by decentralized search infrastructures, orarchived using a persistent storage infrastructure suchas OceanStore[16]. Here we focus our discussion oncrawling and do not address the orthogonal issues ofpersistent storage and indexing.The distributed crawler harnesses the excess band-width and computing resources of clients to crawl theweb. At a high-level, crawls are expressed as recur-sive queries and executed using PIER [15], a P2P re-lational query processor over DHTs. A DHT providesuseful properties of load balancing, reliable content-based routing in the absence of network failures, andlogarithmic (in the number of nodes) lookup costs. Thisallows us to easily dispatch crawl requests evenly acrosscrawler nodes in a decentralized fashion.The query is executed in a distributed fashion as fol-lows. Each crawler node runs the PIER query engineand is responsible for crawling a different set of webpages. The query is first sent to all the crawler nodes,and set to run according to some exit criterion (crawl-1Who will guard the guards ?1ing time, depth of crawl etc.). A URL is partitionedamongst the participating crawler nodes by publishing itin the DHT. The partitioning (discussed in Section ref-sec:crawlscheme) is determined by how URLs are pub-lished into the DHT such as hashing the entire URL oronly its hostname. Each node is responsible for crawl-ing the URLs published in its partition of the DHT. Thecrawl is begun by publishing a set of seed URLs start-ing a distributed, recursive computation in which thePIER query continuously scans the DHT for new URLsto be crawled. A web page is downloaded for each URLcrawled the links it contains are refined according touser predicates and then republished into the DHT forfurther crawling.Note that the distributed query processor naturally co-ordinates and partitions the work across the participatingnodes. Using only data partitioning of the intermediateURLs to be processed, it parallelizes the crawl withoutexplicit code to ensure that multiple sites do not crawlthe same web pages redundantly.The potential of such a distributed crawler is im-mense. We envision this distributed crawler to be usedin the following ways:• Infrastructure for Crawl Personalization: Userscan customize their own crawl queries and executethem using this service. Special interest groups canalso perform collaborative filtering by executing aquery that matches the group’s interests.• High-throughput Crawling: By harnessing thepower of a large number of nodes, the crawling ser-vice is more scalable than centralized systems.• Generalized Crawler for P2P Overlays: Al-though we focus our attention on web crawling, ourideas can be used to build a generalized crawlerfor querying distributed graph structures over theInternet [18]. Such a generalized crawler can beused to run queries over the link structure or meta-data of P2P networks such as Gnutella [2] andOceanStore [16].The rest of this paper is organized as follows. Sec-tion 2 provides a high-level overview of the distributedcrawler. In Section 3, we describe a detailed descriptionof the crawl query execution of our distributed crawler.Next, in Section 4 we present a comparison of differ-ent query execution strategies by running our crawlingservice on nodes of PlanetLab [6] crawling live websources. We survey centralized and parallel crawlersas well as other distributed web crawling schemes inSection 5. Finally, we

View Full Document

Berkeley COMPSCI 294 - Distributed Web Crawling over DHTs

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4 out of 13 pages.

Berkeley COMPSCI 294 - Distributed Web Crawling over DHTs

Sign up for free to view:

Please select your school