Slide 1P2P In GeneralP2P AppsSlide 4What is DNS?Environment and WorkloadObserved PerformanceTraditional DNS FailuresWhat is not working?Time Spent on DNS lookupsSuspected Failure ClassificationCoDNS IdeasCoDNS Counter-thoughtsCoDNS ImplementationPeer Management & CommunicationResultsResults: One Day of TrafficObservationsOverheadQuestionsSlide 21PAST IntroductionPastry ReviewPastry Review, continuedPAST – InsertPAST – LookupPAST – ReclaimIs this good enough?The ProblemThe Solution: Storage ManagementReplica DiversionReplica DiversionReplica DiversionReplica DiversionReplica DiversionFile DiversionReplica ManagementReplica ManagementCachingSecurityEvaluationEvaluation (1)Evaluation (2)Evaluation (3)DiscussionSlide 46Background: UsenetUsenetDHTDiscussionP R E S E N T E D BYK E V I N L A R S O N&W I L L D I E T Z1P2P AppsP2P In General2Distributed systems where workloads are partitioned between peersPeer: Equally privileged members of the systemIn contrast to client-server models, peers both provide and consume resources.Classic Examples:NapsterGnutellaP2P Apps3CoDNSDistribute DNS load to other clients in order to greatly reduce latency in the case of local failures PAST Distribute files and replicas across many peers, using diversion and hashing to increase utilization and insertion successUsenetDHTUse peers to distribute the storage and costs of the Usenet serviceO S D I 2 0 0 4P R I N C E T O NK Y O U N G S O O P A R KZ H E W A N GV I V E K P A IL A R R Y P E T E R S O NP R E S E N T E D B Y K E V I N L A R S O N4CoDNSWhat is DNS?5Domain Name SystemRemote serverLocal resolver Translates hostnames into IP addressesEx: www.illinois.edu -> 128.174.4.87Ubiquitous and long-standing: Average user not aware of its existenceDesired Performance, as observed PlanetLab nodes at Rice and University of UtahEnvironment and Workload6PlanetLab Internet scale test-bedVery large scaleGeographically distributedCoDeeNLatency-sensitive content delivery network (CDN)Uses a network of caching Web proxy serversComplex distribution of node accesses + external accessesBuilt on top of PlanetLabWidely used (4 million plus accesses/day)Observed Performance 7CornellUniversity of OregonUniversity of MichiganUniversity of TennesseeTraditional DNS Failures8Comcast DNS failureCyber Monday 2010Complete failure, not just high latencyMassive overloadingWhat is not working?9DNS lookups have high reliability, but make no latency guarantees:Reliability due to redundancy, which drives up latencyFailures significantly skew average lookup timesFailures defined as:5+ second latency – the length of time where the system will contact a secondary local nameserverNo answerTime Spent on DNS lookups10Three classifications of lookup times: Low: <10ms Regular: 10ms to 100ms High: >100msHigh latency lookups account for 0.5% to 12.9% of accesses71%-99.2% of time is spent on high latency lookupsSuspected Failure Classification11CornellUniversity of OregonUniversity of MichiganUniversity of TennesseeLong lasting, continuous failures: - Result from nameserver failures and/or extended overloadingShort sporadic failures: - Result from temporary overloadingPeriodic Failures – caused by cron jobs and other scheduled tasksCoDNS Ideas12Attempt to resolve locally, then request data from peers if too slowDistributed DNS cache - peer may have hostname in cacheDesign questions:How important is locality? How soon should you attempt to contact a peer? How many peers to contact?CoDNS Counter-thoughts13This seems unnecessarily complex – why not just go to another local or root nameserver?Many failures are overload related, more aggressive contact of nameservers would just aggravate the problemIs this worth the increased load on peer’s DNS servers and the bandwidth of duplicating requests?Failure times were not consistent between peers, so this likely will have minimal negative effectCoDNS Implementation14Stand-alone daemon on each nodeMaster & slave processes for resolutionMaster reissues requests if slaves are too slowDoubles delay after first retryHow soon before you contact peers?It dependsGood local performance – Increase reissue delay up to 200msFrequently relying on remote lookups – Reduce reissue delay to as low as 0msPeer Management & Communication15Peers maintain a set of neighborsBuilt by contacting list of all peersPeriodic heartbeats determine livenessReplace dead nodes with additional scanning of node listUses Highest Random Weight (HRW) hashingGenerates ordered list of nodes given a hostnameSorted by a hash of hostname and peer addressProvides request localityResults16Overall, average responses improved 16% to 75%Internal lookups: 37ms to 7msReal traffic: 237ms to 84msAt Cornell, the worst performing node, average response times massively reduced:Internal lookups: 554ms to 21msReal traffic: 1095ms to 79msResults: One Day of Traffic17Local DNSCoDNSObservations18Three observed cases where CoDNS doesn’t provide benefit:Name does not existInitialization problems result in bad neighbor setNetwork prevents CoDNS from contacting peersCoDNS uses peers for 18.9% of lookups34.6% of remote queries return faster than local lookupOverhead19Extra DNS lookups: Controllable via variable initial delay timeNaive 500ms delay adds about 10% overheadDynamic delay adds only 18.9%Extra Network Traffic:Remote queries and heartbeats only account for about 520MB/day across all nodesOnly 0.3% overheadQuestions20The CoDeeN workload has a very diverse lookup set, would you expect different behavior from a less diverse set of lookups?CoDNS proved to work remarkably well in the PlanetLab environment, where else could the architecture prove useful?The authors took a black box approach towards observing and working with the DNS servers, do you think a more integrated method could further improve observations or results? It seems a surprising number of failures result from Cron jobs, should this have been a task for policy or policy enforcement?“ S T O R A G E M A N A G E M E N T A N D C A C H I N G I N P A S T , A L A R G E - S C A L E P E R S I S T E N T P E E R -T O - P E E R S T O R A G E U T I L I T Y ”S O S P 2 0 0 1A N
View Full Document