DOC PREVIEW
Berkeley COMPSCI 294 - Distributed Web Crawling over DHTs

This preview shows page 1-2-20-21 out of 21 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Distributed Web Crawling over DHTsBoon Thau Loo, Owen Cooper, Sailesh KrishnamurthyCS294-4IndexSearch TodaySearchCrawlCrawlWhat’s Wrong?Users have a limited search interface Today’s web is dynamic and growing: n Timely re-crawls required.n Not feasible for all web sites.Search engines control your search results:n Decide which sites get crawled:w 550 billion documents estimated in 2001 (BrightPlanet)w Google indexes 3.3 billion documents.n Decide which sites gets updated more frequentlyn May censor or skew results rankings.Challenge: User customizable searches that scale.Our Solution: A Distributed CrawlerP2P users donate excess bandwidth and computation resources to crawl the web.n Organized using Distributed Hash tables (DHTs) n DHT and Query Processor agnostic crawler:w Designed to work over any DHTw Crawls can be expressed as declarative recursive queriesn Easy for user customization.n Queries can be executed over PIER, a DHT-based relational P2P Query ProcessorCrawlers: PIER nodesCrawlees: Web ServersPotentialInfrastructure for crawl personalization: n User-defined focused crawlersn Collaborative crawling/filtering (special interest groups)Other possibilities:n Bigger, better, faster web crawlern Enables new search and indexing technologiesw P2P Web Searchw Web archival and storage (with OceanStore)Generalized crawler for querying distributed graph structures.n Monitor file-sharing networks. E.g. Gnutella.n P2P network maintenance:w Routing information.w OceanStore meta-data.Challenges that We InvestigatedScalability and Throughputn DHT communication overheads.n Balance network load on crawlersw 2 components of network load: Download and DHT bandwidth.n Network Proximity. Exploit network locality of crawlers.Limit download rates on web sites n Prevents denial of service attacks.Main tradeoff: Tension between coordination and communicationn Balance load either on crawlers or on crawlees ! n Exploit network proximity at the cost of communication.Crawler ThreadCrawl as a Recursive QueryDownloaderExtractorPublish WebPage(url)Input UrlsOutput LinksΠ: Link.destUrl àWebPage(url)RedirectCrawlWrapperDupElimDup ElimFiltersPublish Link (sourceUrl, destUrl)Rate Throttle & ReorderDHT Scan: WebPage(url)Seed UrlsCrawl Distribution StrategiesPartition by URLn Ensures even distribution of crawler workload.n High DHT communication traffic.Partition by Hostnamen One crawler per hostname.n Creates a “control point” for per-server rate throttling.n May lead to uneven crawler load distributionn Single point of failure:w “Bad” choice of crawler affects per-site crawl throughput.n Slight variation: X crawlers per hostname.Redirection• Simple technique that allows a crawler to redirect or pass on its assigned work to another crawler (and so on….)• A second chance distribution mechanism orthogonal to the partitioning scheme. • Example: Partition by hostname • Node responsible for google.com (red) dispatches work (by URL) to grey nodes• Load balancing benefits of partition by URL• Control benefits of partition by hostname• When? Policy-based• Crawler load (queue size)• Network proximity • Why not? Cost of redirection• Increased DHT control traffic• Hence, put a limit number of redirections per URL.www.google.comExperimentsDeploymentn WebCrawler over PIER, Bamboo DHT, up to 80 PlanetLab nodesn 3 Crawl Threads per crawler, 15 min crawl durationDistribution (Partition) Schemesn URLn Hostnamen Hostname with 8 crawlers per unique hostn Hostname, one level redirection on overload.Crawl Workloadn Exhaustive crawl w Seed URL: http://www.google.comw 78244 different web serversn Crawl of fixed number of sites w Seed URL: http://www.google.comw 45 web servers within googlen Crawl of single site within http://groups.google.comCrawl of Multiple Sites IHostname: Can only exploit at most 45 crawlers.Redirect (hybrid hostname/url) does the best.Partition by Hostname shows poor imbalance (70% idle). Better off when more crawlers are busyCDF of Per-crawler Downloads (80 nodes)Crawl Throughput ScaleupCrawl of Multiple Sites IIRedirection incurs higher overheads only after queue size exceeds a threshold. Hostname incurs low overheads since crawl only looks at google.com which has lots of self-links.Redirect: The per-URL DHT overheads hit their maximum around 70 nodes. Per-URL DHT OverheadsNetwork ProximitySampled 5100 crawl targets and measured ping times from each of 80 PlanetLab hostsPartition by hostname approximates random assignment Best-3 random is “close enough” to Best-5 randomSanity check: what if a single host crawls all targets ?Summary of Schemes+-+Load-balance download bandwidthDHT Communication overheadsNetwork proximityRate limitCrawleesLoad-balance DHT bandwidth--++?Redirect+?+-Hostname---+URLRelated WorkHerodotus, at MIT (Chord-based) n Partition by URL n Batching with ring-based forwarding.n Experimented on 4 local machinesApoidea, at GaTech (Chord-based)n Partition by hostname.n Forwards crawl to DHT neighbor closest to website.n Experimented on 12 local machines.ConclusionOur main contributions:n Propose a DHT and QP agnostic Distributed Crawler.n Express crawl as a query.w Permits user-customizable refinement of crawlsn Discover important trade-offs in distributed crawling: w Co-ordination comes with extra communication costsn Deployment and experimentation on PlanetLab.w Examine crawl distribution strategies under different workloads on live web sourcesw Measure the potential benefits of network proximity.Backup slidesExisting CrawlersCluster-based crawlersn Google: Centralized dispatcher sends urlsto be crawled.n Hash-based parallel crawlers.Focused Crawlersn BINGO!n Crawls the web given basic training set.Peer-to-Peern Grub SETI@Home infrastructure. n 23993 members .Exhaustive CrawlPartition by Hostname shows imbalance. Some crawlers are over-utilized for downloads.Little difference in throughput. Most crawler threads are kept busy.Single SiteURL is best, followed by redirect and hostname.Future WorkFault ToleranceSecuritySingle-Node ThroughputWork-Sharing between Crawl Queriesn Essential for overlapping users.Crawl Global Prioritization n A requirement of personalized crawls.n Online relevance feedback.Deep web


View Full Document

Berkeley COMPSCI 294 - Distributed Web Crawling over DHTs

Documents in this Course
"Woo" MAC

"Woo" MAC

11 pages

Pangaea

Pangaea

14 pages

Load more
Download Distributed Web Crawling over DHTs
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Distributed Web Crawling over DHTs and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Distributed Web Crawling over DHTs 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?