Princeton COS 521 - Internet Content Distribution - D854486

Home> Schools> Princeton University> Computer Science (COS) > COS 521> Internet Content Distribution

Princeton COS 521 - Internet Content Distribution

Course Cos 521- Intro. To Genomics and Computation

Pages 4

Download Save

Unformatted text preview:

COS 521: Advanced Algorithm Design Internet Content Distribution12/12/05 Scribe: Alina Ene Professor: Sanjeev Arora1 Internet Content DistributionProblem:• Websites get swamped. (Important new event → New York Times homepage is re-quested millions of times.)• Websites have dynamic contentOf course, the first problem would be easy to solve in isolation: just make many copiesof the webpage and cache them all over the internet. But given the second problem, theabove scheme doesn’t work. We need a scheme to efficiently keep track of and update allthese copies dynamically.Desire: Content Delivery Service• transparent to browsers• dynamic (updates should be automatic)• reduce trafic on Internet backbone• lower latency for browsers• avoid ”hot spots” (overloaded servers)• no centralized control• robust to errors/crash• scale gracefullyLet’s see how to use hashing to solve these problems. Let U RL denote the set of allpossible webpage URL’s, and at any point in time P ⊆ URL denotes the set of all webpagesthat have to be served around the world. Then P is a huge set, and changing dynamically.Let C denote a set of caches (run by a company such as Akamai) that will store and servethis content.Try 1: Map each U RL to a single cacheHash h : URLs→ C.Problems:• How do we add/drop caches? (It changes the range of the hash function!)• Which caches are visible? (Different browsers have different views on which cachesare accessible at any time.)1Figure 1: Caches and content are hashed to points on a circle. Each piece of content isstored in the cache that is hashed closest to it.1.1 Consistent HashingThe Akamai company’s solution was based upon their consistent hashing algorithm.Let circle denote the unit circle in <2. We define two hash functions as follows:hA: URL → circlehB: {1, .., C} → circleEach webpage p is assigned to the cache that maps closes to hA(p) in the clockwisedirection. Here C is an upper bound on the number of caches at any time.The domain nameserver computes the two hash values and asks the cache for the page.If the requested page is not in the cache, the cache is responsible for looking up the pageand serving it.Analysis: Assume for now that both hash functions are random functions. The first claimis that the load on each cache is approximately well-balanced.Claim 1.1 The amount of circle assigned to each cache is ≤ O(log CC)Proof: Let’s look at any particular cache.Pr[no other cache gets mapped within distance s from our cache] = (1 − s)C−1= e−s·CBy applying Union Bound, we conclude that:Pr[the amount of circle assigned to any cache is s] ≤ C· e−s·C= δ⇒ log C − s· C = log δ ⇒=log C−log δC= O(log CC)2Monotonicity: If we add a new cache, the only pages that need to migrate amongcaches are the ones that get assigned to this cache. No other pages get reassigned, soadding a cache causes minimal disruption. Similarly for dropping a cache.In practice one cannot use a random hash function, of course. Instead a k-wise inde-pendent hash function suffices where k = O(log |U RL|). This is an exercise.2Figure 2: Tree assigned to a page. Each node of the tree is assigned to a cache using aconsistent hash function.1.2 Random Abstract TreesThe above scheme still doesn’t reduce the hot spot problem: the New York Times webpageis mapped to a single cache and therefore that cache will get overloaded with requests duringan important world event. Thus each URL needs to be stored in multiple caches. Anotherissue is to handle the fact that caches themselves may go down at times, so the lookupneeds to be robust. The following modification of the above idea does it.Abstractly speaking, associate a d-regular tree to each page.Number of nodes = CEach node of the tree is assigned to a cache using a consistent hash function. When welookup a page, we follow a random path up the tree until we find a cache that has it. (Notethat if a page is popular, it will quickly migrate from the root to all the leaves.)Note: latency = O(logdC)Advantage: avoids hot spotsRobustness:Assumption: Adversary does not get see the random coins used by the algorithm.Adversary can make ρ fraction of the machines go down (different for different users).Pr[path dies] = Pr[leaf goes down]Pr[leaf does not go down] = (1 − ρ)logdC31.3 Practical DetailsThe Akamai Company caches mainly pictures and embedded files. There are two reasonsfor caching pictures and embedded files:1. these files are usually big files and they are responsible for the high latency in retrievinga web page2. by caching only pictures, the content provider is still able to monitor the traffic attheir siteAkamai uses distributed caches that employ consistent hashing. Akamai uses a nameserverhack that allows file retrieval from the nearest cache.The hashing is done in two steps:1. the content provider hashes the URLs to a serial number; when the document isrequested, the local nameserver first asks the Akamai high-level nameserver for theIP address of the low-level nameserver2. the low-level nameserver evaluates the consistent hash function for the current view atthe given serial number and returns the IP addresses of the caches that the documentis mapped to under the consistent hash functionAkamai provides the content provider with a program that maps URLs to serial

View Full Document


School:
Email:
New Password:
Confirm Password:

Princeton COS 521 - Internet Content Distribution

Sign up for free to view:

Please select your school