DOC PREVIEW
CU-Boulder CSCI 5417 - Lecture 19

This preview shows page 1-2-3-25-26-27 out of 27 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1 CSCI 5417 Information Retrieval Systems Jim Martin!Lecture 19 11/1/2011 Today  Crawling  Start on link-based ranking2 11/11/11 CSCI 5417 - IR 3 Updated crawling picture URLs crawled and parsed Unseen Web Seed Pages URL frontier Crawling thread 11/11/11 CSCI 5417 - IR 4 URL frontier  URL frontier contains URLs that have been discovered but have not yet been explored (retrieved and analyzed for content and more URLs)  Can include multiple URLs from the same host  Must avoid trying to fetch them all at the same time  Even from different crawling threads  Must try to keep all crawling threads busy3 11/11/11 CSCI 5417 - IR 5 Robots.txt  Protocol for giving spiders (“robots”) limited access to a website, originally from 1994  Website announces its request on what can(not) be crawled  For a URL, create a file URL/robots.txt  This file specifies access restrictions 11/11/11 CSCI 5417 - IR 6 Robots.txt example  No robot should visit any URL starting with "/yoursite/temp/", except the robot called “searchengine": User-agent: * Disallow: /yoursite/temp/ User-agent: searchengine Disallow:4 11/11/11 CSCI 5417 - IR 7 Processing steps in crawling  Pick a URL from the frontier  Fetch the document at the URL  Parse the document  Extract links from it to other docs (URLs)  Check if document has content already seen  If not, add to indexes  For each extracted URL  Ensure it passes certain URL filter tests  Check if it is already in the frontier (duplicate URL elimination) E.g., only crawl .edu, obey robots.txt, etc. Which one? 11/11/11 CSCI 5417 - IR 8 Basic crawl architecture WWW Fetch DNS Parse Content seen? Doc FP’s Dup URL elim URL set URL Frontier URL filter robots filters5 11/11/11 CSCI 5417 - IR 9 DNS (Domain Name Server)  A lookup service on the internet  Given a URL, retrieve its IP address  Service provided by a distributed set of servers – thus, lookup latencies can be high (even seconds)  Common implementations of DNS lookup are blocking: only one outstanding request at a time  Solutions  DNS caching  Batch DNS resolver – collects requests and sends them out together 11/11/11 CSCI 5417 - IR 10 Parsing: URL normalization  When a fetched document is parsed, some of the extracted links are relative URLs  en.wikipedia.org/wiki/Main_Page has a relative link to /wiki/Wikipedia:General_disclaimer which is the same as the absolute URL en.wikipedia.org/wiki/Wikipedia:General_disclaimer  Must expand such relative URLs  URL shorteners (bit.ly, etc) are a new problem6 11/11/11 CSCI 5417 - IR 11 Content seen?  Duplication is widespread on the web  If the page just fetched is already in the index, do not further process it  This is verified using document fingerprints or shingles 11/11/11 CSCI 5417 - IR 12 Filters and robots.txt  Filters – regular expressions for URL’s to be crawled/not  Once a robots.txt file is fetched from a site, need not fetch it repeatedly  Doing so burns bandwidth, hits web server  Cache robots.txt files7 11/11/11 CSCI 5417 - IR 13 Duplicate URL elimination  Check to see if an extracted+filtered URL has already been put into to the URL frontier  This may or may not be needed based on the crawling architecture 11/11/11 CSCI 5417 - IR 14 Distributing the crawler  Run multiple crawl threads, under different processes – potentially at different nodes  Geographically distributed nodes  Partition hosts being crawled into nodes8 Overview: Frontier 11/11/11 CSCI 5417 - IR 15 URL Frontier (discovered, but not yet crawled sites) Crawler threads request URLs to crawl Crawlers provide discovered URLs (subject to filtering) 11/11/11 CSCI 5417 - IR 16 URL frontier: two main considerations  Politeness: do not hit a web server too frequently  Freshness: crawl some sites/pages more often than others  E.g., pages (such as News sites) whose content changes often These goals may conflict each other.9 11/11/11 CSCI 5417 - IR 17 Politeness – challenges  Even if we restrict only one thread to fetch from a host, it can hit it repeatedly  Common heuristic: insert time gap between successive requests to a host that is >> time for most recent fetch from that host Overview: Frontier 11/11/11 CSCI 5417 - IR 18 URL Frontier (discovered, but not yet crawled sites) Crawler threads request URLs to crawl Crawlers provide discovered URLs (subject to filtering)10 Back queue selector B back queues: Unexplored URLs from a single host on each Crawl thread requesting URL URL Frontier: Mercator scheme Biased front queue selector Back queue router Prioritizer K front queues URLs Sec. 20.2.3 Mercator URL frontier  URLs flow in from the top into the frontier  Front queues manage prioritization  Back queues enforce politeness  Each queue is FIFO Sec. 20.2.311 11/11/11 CSCI 5417 - IR 21 Explicit and implicit politeness  Explicit politeness: specifications from webmasters on what portions of site can be crawled  robots.txt  Implicit politeness: even with no specification, avoid hitting any site too often Front queues  Prioritizer assigns each URL an integer priority between 1 and K  Appends URL to corresponding queue  Heuristics for assigning priority  Refresh rate sampled from previous crawls  Application-specific (e.g., “crawl news sites more often”) Sec. 20.2.312 Front queues Prioritizer 1 K Biased front queue selector Back queue router Sec. 20.2.3 Biased front queue selector  When a back queue requests URLs (in a sequence to be described): picks a front queue from which to pull a URL  This choice can be round robin biased to queues of higher priority, or some more sophisticated variant Sec. 20.2.313 Back queues Biased front queue selector Back queue router Back queue selector 1 B Heap Sec. 20.2.3 Back queue invariants  Each back queue is kept non-empty while the crawl is in progress  Each back queue only contains URLs from a single host  Maintain a table from hosts to back queues Sec. 20.2.314 Back queue heap  One entry for each back queue  The entry is the earliest time te at which the host corresponding to the back queue can be hit again  This


View Full Document

CU-Boulder CSCI 5417 - Lecture 19

Download Lecture 19
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 19 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 19 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?