Unformatted text preview:

11Crawling the Web2Web CrawlingRetrieve (for indexing, storage, …) Webpages by using the links found on apage to locate more pages.Must have some starting point23Type of crawl• Web crawl versuscrawl of more limited network – web– cs.princeton.edu– internal co. network• complete crawl versusfocused crawl by some criteria– pages on one topic• Type of crawl will affect necessity/usability ofvarious techniques4Issues?35Main Issues I• starting set of pages?• can visit whole of Web (or web)?• how determine order to visit links?– graph model: breadth first vs depth first• what are pros and cons of each?• “black holes”– other aspects /considerations• how deep want to go?• associate priority with links6• Breadth-first:• Depth-first:1st47Main Issues II• Web is dynamic– time to crawl “once”– how mix crawl and re-crawl• priority of pages• Social behavior– robot exclusion protocol– not flood servers8Technical issues• maintain one or more queues of URLs to bevisited– order of URLs in queues?• FIFO = breadth first• LIFO = depth first• priority queues• bottleneck: resolve hostname in URLs to getactual IP addresses – Domain Name Serviceservers (DNS lookup)• To do large crawls must have multiplecrawlers with multiple network connections(sockets) open and probably multiple queues59DNS lookup• don’t want temporal locality of reference– be nice to servers (or else)• cache DNS map– large, local, in memory– hold most recently used mappings• prefetch DNS resolution for URLs on pagewhen it parsed– put in cache– use when URL gets to head of queue• How “large” ?– Problems?10Duplicate URL removalHas URL been visited already?• Use:– canonical, fully specified URLs– canonical hostname provided by DNS• Visited? hash table– hash canonical URL to entry• Visited? table may be too large for MM611Caching Visited? table ?• not temporal but “spatial” locality:– most popular URLs– most popular sites• some temporal locality within• to exploit site-level locality need hash thatbrings pages on same site together:– two-level hash:• hash hostname and port• hash path• can use B+ tree, sorted on i then ii– if no entry for URL in tree, not visited12Duplicate Page RemovalHas page been visited already?• mirror sites – different URLs, same page– bad: duplicate page in search results– worse?: add links from duplicate pages to queues• also mirrors?• table with fingerprint (hash) of each page– fit in main memory?– if not, costs disk access per page retrieved bycrawler• bad713“almost” duplicates• mirrored pages my have slight differences– indicate which mirror they on?• need something like shingling• when apply?– while crawlingv.s.– for search results• cost?14Good and bad behavior• Crawler not flood servers– queue for each server of near-term visits• Crawler check robot exclusion for each server• Sites may be badly behaved– dynamically generated pages to create:• infinitely many pages• infinitely deep paths• Need strategies to detect/avoid bad behaviorby sites815Re-crawling• When re-crawl what pages?– finish crawl and start over• finish = have enough?– re-crawl high priority pages in middle ofcrawl– how determine priority?• How integrate re-crawl of high prioritypages?– One choice – separate cycle for crawl ofhigh priority pages16Crawling large number pages• indexing is not dynamic and continuous– Index all pages collected at certain time(end of crawl?)– Provide search half of engine with newindex• crawling is continuous– start over• in some


View Full Document

Princeton COS 435 - Crawling the Web

Download Crawling the Web
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Crawling the Web and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Crawling the Web 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?