Web Crawling and Automatic Discovery Donna Bergmark March 14 2002 Web Resource Discovery Surfing Serendipity Search Specific Information Inverted keyword list Page lookup Crawler Text for keyword indexing Hence crawlers are needed for discovery of Web resources Definition Spider robot crawler Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web Some History First crawlers appeared in 1994 Why Web growth April 1993 62 registered web servers In 1994 Web http traffic grew 15 X faster than the Internet itself Lycos was announced in 1994 as a search engine So why not write a robot You d think a crawler would be easy to write Pick up the next URL Connect to the server GET the URL When the page arrives get its links optionally do other stuff REPEAT Crawler Issues The URL itself Politeness Visit Order Robot Traps The hidden web System Considerations Standard for Robot Exclusion Martin Koster 1994 http any server 80 robots txt Maintained by the webmaster Forbid access to pages directories Commonly excluded cgi bin Adherence is voluntary for the crawler The Four Laws of Web Robotics A Crawler must show identifications A Crawler must obey the robots txt A Crawler must not hog resources A Crawler must report errors Visit Order The frontier Breadth first FIFO queue Depth first LIFO queue Best first Priority queue Random Refresh rate Robot Traps Cycles in the Web graph Infinite links on a page Traps set out by the Webmaster The Hidden Web Dynamic pages increasing Subscription pages Username and password pages Research in progress on how crawlers can get into the hidden web System Issues Crawlers are complicated systems Efficiency is of utmost importance Crawlers are demanding of system and network resources Mercator 1 Written in Java One file configures a crawl How many threads What analyzers to use What filters to use How to place links on the frontier How long to run Mercator 2 Tell it what seed URL s to start with Can add your own code Extend one or more of M s base classes Add totally new classes called by your own Is very efficient at memory usage URLs are hashed Documents are finger printed Mercator 3 Industrial strength crawler Multi threaded for parallel crawls Polite one thread for one server Mercator implements own host lookup Mercator uses its own DNS The Web as a Graph Crawling is meant to traverse the web Remove some edges to create a tree I e do not revisit URLs You can only crawl forwards I e need explicit back links Page rank The Web is a BIG Graph Diameter of the Web Cannot crawl even the static part completely New technology the focused crawl Conclusion Clearly crawling is not simple Hot topic of the late 90 s research Good technologies as a result Focused crawling is where crawling is going next hot topic of early 2000 s
View Full Document
Unlocking...