ODU CS 791 - Efficient Crawling Through URL Ordering - D1649796

Home> Schools> Old Dominion University> Computer Science (CS) > CS 791> Efficient Crawling Through URL Ordering

DOC PREVIEW

ODU CS 791 - Efficient Crawling Through URL Ordering

School name Old Dominion University

Course Cs 791- Graduate Seminar

Pages 38

This preview shows page 1-2-3-18-19-36-37-38 out of 38 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Efficient Crawling Through URL OrderingSlide 2IntroductionIntroduction contd..Slide 5Slide 6Importance MetricsSimilarity to a Driving query Q IS(p)Similarity Metric IS(p) contd.BackLink Count IB(p)Page RankForward Link Count: IF(p)Location MetricProblem DefinitionCrawl & Stop ModelCrawl and stop with Threshold modelLimited Buffer Crawl ModelOrdering MetricsSlide 19ExperimentsExperiments contdBackLink-based Crawlers:BackLink-based Crawlers contd.Slide 24Backlink-based crawlers:contdSlide 26Slide 27Slide 28Slide 29Slide 30Slide 31Similarity-based CrawlersSimilarity-based Crawlers contd..Similarity-based Crawlers contd.Similarity-based crawlers contd..Slide 36Slide 37ConclusionEfficient Crawling Through URL OrderingBy: Junghoo Cho, Hector Garcia-Molina, and Lawrence PagePresenter : Omkar S. Kasinadhuni Simerjeet KaurOverview 1) Introduction 2) Importance metrics 3) Problem Definition 4) Ordering Metrics 5) Experiments BackLink-based CrawlerSmall-scale CrawlerSimilarity-based Crawler 6) ConclusionHow does a web crawler work?:–It starts scanning with an initial page p0.–Collects the URLs from page p0 and stores them in a queue.–The scanned pages are given to the Client (for indexing, summarizing or for analyses).– Repeats the same process all over again, with each URL in the queue. .IntroductionWeb Crawler:An automated program that accesses a web site and traverses through the site by following all the links present on the pages.Designing Web Crawler is Challenging – Why ?Introduction contd..–Externally, the crawler should avoid overloading websites or network links.–Internally, it has to deal with huge volumes of data–Limited storage capacity on client.The author says that the entire Web is about 1.5 TB ( http://www.worldwidewebsize.com/ ) – Revisit previously scanned pages to update latest changes.“The Crawler cannot visit the entire web”Introduction contd..The crawler visits limited number of web pages.-Therefore, we want the crawler to visit the “important” pages first-Within a given list of URLs crawler must use “efficient order” to visit the “important” pages first.How do you say a page is “Important”?Introduction contd..Importance MetricsThe importance of a Web Page (p), is represented by I(p). It is defined by the following metrics:–Similarity to a Driving query Q IS(p)–BackLink Count IB(p)–PageRank IR(p)–Forward Link Count IF(p)–Location Metrics IL(p)Similarity to a Driving query Q IS(p)Similarity to a Driving query Q IS(p)Importance is defined as textual Similarity between page(p) and Query(Q).–Visualize documents (p or Q) as n-dimensional vector (w1,w2…..wn) wi - ith word in the vocabulary. –If wi does not appear in the document, wi has no significance.Else, wi is set to represent significance of the word.Two ways to compute Significance :–Significance = number of times wi appears in the document * IDFInverse Document Frequency (IDF) = 1/ (number of times the word appears in the entire collection).–Significance based on where wi appears in the page. (e.g. In page title, body etc.)Two ways to calculate the similarity between page p and Q:–Calculate the inner product of the p and Q vectors –Use Cosine similarity measure i.e., the inner product of the Normalized vector. IDF is estimated using the already crawled pages and not the entire web.The Similarity Importance Metric calculated with an estimated IDF is referred as estimated Importance of page IS'(p).SimilaritySimilarity Metric IS(p) contd.Metric IS(p) contd.BackLink Count IB(p)BackLink Count IB(p) BackLink count of page p IB(p) is the number of links to Page p that appears over the entire web. Instead of using exact BackLink count we use an estimated BackLink count IB'(p) – Why ?Note: Links means endorsements (Discussed in the previous papers)Page RankPage Rank IR(p) recursively assigns importance of a page as the weighted sum of the importance of all the pages that has backlinks to p. IR(p) = (1-d) +d [ IR(t1)/c1+……+ IR(tn)/cn]-The page p is pointed by pages t1……tn.-ci is the number of links going out of page ti.- d- damping factor: The probability that the surfer would jump to a completely random page which has no relation with the current page. -IR(p) gives the probability that the random surfer is at page (p) at any given time.Forward Link Count: IF(p) Forward Link Count: IF(p) -IF(p) counts the number of links that emanate from page p. (eg: Web Directory)IF'(p) = IF(p) WHY?Location MetricLocation MetricIn this metric.., the importance of a page is based on its location in the Web, rather than its contents. Examples:-URLs ending with .com may be more important-URL’s containing string ‘home’ may be more important than others. -URL’s with fewer slashes may be more important.-All these metric can be evaluated by looking simply at the URL (Hence Location Metrics).Our important metric could be a combination of any/all of these metrics.Problem DefinitionProblem Definition- Our task is to find a scheme for selecting pages in the decreasing order of Importance.- Depending on how we want the crawler to operate, we have three schemes…,- Crawl & Stop- Crawl & Stop with Threshold- Limited Buffer CrawlCrawl & Stop Model-The crawler C starts at its initial po and stops after visiting k pages. (r1,r2..,rk)-For a perfect crawler, I(r1)> I(r2) >…..>I(rk).-We call the pages r1..rk (“HOT”) pages.-The real crawler would actually visit M pages with rank higher than or equal to I(rk).-The performance of the crawler C, Pcs(C) = M/K. (Note: Pcs(C) =1 for perfect crawler).-Performance of a random crawler= K/T T is the total number of pages in the entire web.- The expected number of desired pages when the crawler stops in K2/T.Crawl and stop with Threshold model–Similar to Crawl and stop model.–We have an importance target G, i.e., a page is hot if I(p) ≥ G.H - total hot pages, if K<H then the performance (ideal crawler) = K/H. If K>=H, then the performance = 1.–A purely random crawler visits (H/T).K hot pages when it stops. performance = 1 {Random Crawler}Limited Buffer Crawl Model-Crawler has limited storage.- Crawler can only store B pages at most.- When the buffer gets filled up, the crawler should flush some of its pages. -Ideal crawler will flush the pages with the least importance in the buffer.- A real crawler must guess which pages in the buffer will eventually have low I(p).-Performance of the crawler

View Full Document