Princeton COS 521 - Lecture - D2758791

Home> Schools> Princeton University> Computer Science (COS) > COS 521> Lecture

DOC PREVIEW

Princeton COS 521 - Lecture

School name Princeton University

Course Cos 521- Intro. To Genomics and Computation

Pages 5

This preview shows page 1-2 out of 5 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

COS 521: Advanced Algorithm Design Hashing / Content Distribution02/09/12 Scribe: Shilpa Nadimpalli Professor: Sanjeev Arora1 Introduction to Inter net Content Distribution1.1 The “Hot-Spot” Problem1Certain web sites (e.g. New York Time s) are accessed thousands of times per day byunique users from multiple locations, which causes severe network congestion around such“hot-spots” on the web. A simple solution would be to simply copy the hot-spot site andcache it at multiple other locations around the web, and redirect users to these new URLs.However, web sites have dynamic content. We need a scheme to eﬃciently keep track ofand update all site copies dynami cal l y.1.2 Characteristics of an Ideal Content Delivery Service:• has no centralized control• is robust to error s – caches can b e added and removed from the network at any time• red u ces traﬃc on Internet backbone (repetitive long-distance data retrieval)• pr events swamping of “hot spots” (overl oade d servers)• updates automatically (dynamic)• is transparent to browsers (Diﬀerent browsers have diﬀerent views on which cachesare accessible at a given t i me. )2 Poss ibl e Solutions Using Hashing• Let URL denote the set of all possible webpage URLs• Let P ⊆ URL denote the set of all webpages served aroun d the world at any givenpoint in time. Notice that P is a huge set, wh i ch changes dynamical l y.• Let C denote a set of caches (run by a company such as Akamai) that will store andserve this content.Idea 1: Use a single hashing fun ct i on to map each URL to a specific cache.h : URL → C (1)Problem with Idea 1: This doesn’t solve our congestion pr obl em ; New York Times site isalways mapped to the same cache. This i s useful for a task like web hosting.Idea 2: Cache each URL k times using diﬀerent hash funct ion s. Redirect URL request toone of the k copies.h1,h2,h3,...,hk: URL → C (2)1Modified from old scribe notes by A li n a Ene in 2005.1Problem with Idea 2: Adding or removing caches changes the range of the hash functions,forcing us t o remap all data if one cache goes down.Idea 3: Use Consistent Hashing, an algorithm implemented by the Akamai Company.Let circle denote the unit circle in �2. Caches and URLs are hashed to circle using twohash functions.hA: C → circle (3)hB: URL → circle (4)Each URL is stored in the cache that is hashed closest to it in the clockwise direction oncircle. This way, if one cache crashes, only1# cachesof the URLs need to be remapped(assuming caches are roughly uniformly distributed around circle).The domain nameserver computes the two hash val u es and asks the cache for the page.If the requested page is not in the cache, the cache is responsible for looking up the pageand serving it. This is a first cut of the consi st ent hashing idea; we extend it below.Figure 1: Caches and URLs are hashed to the unit circle, circle. For all webpages w ∈URL,ifhB(w) is in the region of circle marked by A, w is stored in cache B.3 Analysis of Above IdeaLet us assume that hA,hBare random hash functions. A summary and analysis of randomhash functions can be found in Section 6.Claim 3.1 The number of URLs to fall in any interval on circle between two caches isproportional to the length of th at interval.Argument: We can assume this is true. Assuming the hash hB: URL → circle israndom, a complete proof is possible using the Strong Law of Large Numbers or ChernoﬀBounds.Claim 3.2 The load on each cache is approximately well-balanced. Specifically, the portionof circle assigned to each cache is ≤ O�lg(C)C�with high probabil i ty, where C is thenumber of caches.Proof:2• De fin e some bad event, B, where a single cache controls >8 · lg(C)Cof circle.• Cons id er the arc of thi s length. However we spli t this arc into two piece s, one of thosepieces must have length >4 · lg(C)C. Define a litt l e interval L to be found on thislarger piece and to have length2 · lg(C)C.IfB occurs, then ∃L.Figure 2: The arc on circle in the bad event that a single cache controls a proportion ofcircle >8 · lg(C)C.• Pr [n o cache l an ds in L]=�1−2 · lg(C)C�C. This val ue has C in the exponent becausewe are using random mapping.• Recal l that�1+xn�n≈ exif x does not increase as quickly as n. Using this, we knowPr[no cache lands in L] ≈ e−2·lg(C),whichis<1c2.• The re for e, Pr[B] ≤{L}c2�1c2� 1 as C →∞.✷Thus the hashing scheme has the following property:Definition 1 Monotonicity: If we add or remove a cache, the only web page URLs thatneed to migrate are those that are assigned to the new cache, or were assigned to the deletedcache. No other web pages are reassigned, and so changes in the number of caches causesminimal disruption.Problem with Consistent Hashing: However, this method still does not solve thecongestion problem; The New York Times website is still mapped to only one cache.4 Random Abstract TreesEach URL needs to be stored in multiple caches to prevent swamping of hot spots. Addi-tionally, the lookup for a cache given a URL must be robust, because caches may go down3at times. An extension of the original Consistent Hashing Algorithm, de sc ri bed below,addresses these issues.• Repr e sent each URL bydk+1− 1d − 1≈ dkpseudo-URLs for some small d, k.Thesepseudo-URLs refer to the nodes of a tree wit h height k and branching degree d.• Each tree node is assigned to a cache using a consistent hash function. The cache atthe root node of the tree keeps a copy of the page p oi nted to by t h e original URL.• Upon looking up a URL, the user is directed to one of the leaves of the tree randomly.If the cache referred to at that leaf does not contain a copy of the page, the requestis redirected up the path until a node is reached that refers to a cache with a copy ofthe page.• The re i s a thresh old q such that if some tree node is accessed at least q times in theabove process, then the corresponding cache acquires the URL content and caches it.Figure 3: Assignment of a page URL to a tree, before any copies of t h e page have beenstored at caches ref er re d to by internal and leaf n odes. Each node of th e tree is assigned toa cache using a consistent hash function.Notice t h at if a page is pop u lar , copies of the page will quickly migrate from the rootcache to the caches at all the internal nodes and leaves. We can see that th e latency ofthis algorithm is O(logd(C)), and hot spots ar e avoided.

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 5 pages.

Princeton COS 521 - Lecture

Sign up for free to view:

Please select your school