Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utilityAntony Rowstron and Peter DruschelSOSP ’01Presented by:Virajith Jalaparti and Giang NguyenMarch 4th, 2010.1PAST• Peer-to-Peer storage utility• (Tries to) Guarantee (with high probability)▫ Strong persistence Exploits diversity of nodes▫ High availability Replicates data▫ Scalability State logarithmic in number of nodes▫ Security PKI2PAST• Routing and content location▫ PASTRY Is this required?• Not a general purpose file system▫ Cannot Search for files – Interesting problem!!▫ No directories▫ Need file_id (distributed offline)▫ Files are immutable3PASTRY (review)• Incremental Prefix matching4PAST OperationsPASTRY Overlay-------------------------InternetUser i(P_ki/s_ki)100100 = Insert(i, pki, 2, file.txt)-hash(file_name,pki,salt)- storage debited-certificate signed by user and sent alongId = 1000Id = 1010-Ack- Store receiptsLookup(100100)File InsertFile LookupRouted to correct nodeReplyFile reclaimreclaim(100100, credentials)- != delete- Does not gaurenteeconsistencyFile.txtFile.txt5Storage Management• Ensure storage utilization as system approaches the maximum▫ Balance free space among nodes▫ K copies of file maintained at nodes closest to file-id• Reasons▫ Statistical▫ Varying file sizes▫ Varying node storage capacities6An Assumption• Storage across nodes differs no more than 2 orders of magnitude• What about large clusters?▫ Multiple PASTRY nodes▫ Should not significantly affect balance• Is this correct/justified?7Replica Diversion• What happens if a node’s free space is reaching 0%?• Node A cannot hold a file locally▫ File_size > t*free_space_available (tpriv and tdiv)▫ Node B in leaf set Not one of k closest nodeids Does not already have a copy▫ Stores a pointer to B8Replica Diversion• No such B exists▫ File diversion Create a new id for the file to store in another part of ring• Handling Failures▫ A fails Another node C (having (k+1)st closest id) points to B▫ B fails?9Maintaining Replicas• Invariant: K copies at all times• PASTRY helps detect failures/additions▫ A new node becomes one of the k-closest nodes to a file• Copy the file to the newly added node▫ Optimization: new node points to old node, if it exists▫ Eventually copied to new node• Very High storage utilization▫ <k copies of file.10Caching• Popular files cached (>k copies)▫ A node that participates in routing a file caches it• Unused space at a node used for caching• Cached files evicted if new files are to be stored▫ GreedyDual-Size policy used File with lowest (access_count/size) ratio evicted Unfair for large files11PAST Security• Public/Private key-pair▫ Smart cards▫ PKI• Owner file/reclaim certificate▫ Prevents masquerading• Store/Reclaim receipts▫ Ensures k copies are stored/reclaimed• Randomized routing in Pastry to prevent continuous interception▫ How would this affect bounds guaranteed?12Eval (1)13Web proxy log trace: 1.86 million objects, totaling 18.7 GBWith no diversions and using distribution d1: 51% insertions fail, global utilization 60%Eval (2)14Eval (3)15Discussion• Can I always get back my file?• Churn• Storing of large files▫ CFS is block oriented▫ Tradeoff?• Why would any one use P2P storage?▫ Economics▫ User Quotas!▫ Oceanstore Uses Tapestry Money involved Hierarchical16UsenetDHT: A low-overhead design for UsenetEmil Sit, Robert Morris, and M. Frans KaashoekPresented by Giang NguyenMarch 04, 20101Background• Usenet: a news system operating since '81• Similar to online forums/bulletin boards• Overviewo Newsgroups such as “class.sp10.cs525” or “comp.lang.python”o A Usenet server stores articles for newsgroups of its choosingo Paying users post/download articles to/from their “local” (e.g., ISP) Usenet servero Usenet servers flood articles to interested peersRetention policy: how long a server retains articles2BackgroundservernewsgroupsclientSource: Benjamin D. Esham/Wikipedia3Motivation• Usenet still popular:o Sept. '07, usenetserver.com saw 40,000 concurrent clients, 20 Gb/s• Growing content volume, largely binary articles• Top servers receive from peers 3 TB/day• Each server stores articles for its interested newsgroups --> lots of duplication 4Chord ring reviewSource: I. Stoica et al. Chord: a scalable peer-to-peer lookup service for Internet applications5UsenetDHT• Mutually trusting peer sites use (DHash) DHT to store articles (content but not metadata)• "Front-ends" preserve the client- and external peer-facing interface (NNTP protocol)o Translate clients' post/download to put/getrequests to DHT storageo Key == SHA-1 of article's contento Cache article content to reduce load on DHT storageo Flood article's metadata to peer front-endso Store all articles' metadata 6UsenetDHT7UsenetDHT replication • Goal: store replicas at k (= 2) successor nodes• Requires keeping track of current number of replica • Challenges:o no efficient data structure that supports both insertion/lookup of millions of entries and synchronization among nodeso high write rates --> local state quickly becomes stale• Solution: Passing Tone algorithm:o local: each node ensures it has replicas of articles for which it is responsibleo global: it ensures articles for which it no longer is responsible are given to new responsible node8Local maintenance• Each node ensures it has replicas of articles for which it is responsible • Each node has a hash/Merkle tree of the keys that it has• Replicates articles from immediate predecessor & successor • Algorithm @ node n:a, b = n.pred_k, n # "a" is k-th predecessorfor partner in n.succ, n.pred:diffs = partner.synchronize(a, b)for o in diffs:data = partner.fetch(o)n.db.insert(o, data) 9Local maintenancek = 3• When n fails, s1 becomes responsible for storing replicas of articles in region A and will obtain them (most likely) from p1• When n joins/returns from failure, it obtains articles in region B from s110Global maintenance• Each node ensures articles for which it no longer is responsible are given to new responsible node• Algorithm @ node n:a, b = n.pred_k, n # "a" is k-th predecessorkey = n.db.first_succ (b) # first key after bwhile not between (a, b, key):s = n.find_successor (key)diffs = s.reverse_sync (key, s)for o in diffs:data = n.db.read (o)s.store
View Full Document