MTU CS 6461 - The Price of Validity in Dynamic Networks - D1072979

Home> Schools> Michigan Tech> Computer Science (CS) > CS 6461> The Price of Validity in Dynamic Networks

DOC PREVIEW

MTU CS 6461 - The Price of Validity in Dynamic Networks

School name Michigan Tech

Course Cs 6461- Advanced Computer Networks f 3

Pages 12

This preview shows page 1-2-3-4 out of 12 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

The Price of Validity in Dynamic NetworksMayank Bawa, Aristides Gionis, Hector Garcia-Molina, Rajeev MotwaniComputer Science DepartmentStanford UniversityStanford, CA 94305{bawa,gionis,hector,rajeev}@db.stanford.eduABSTRACTMassive-scale self-administered networks like Peer-to-Peerand Sensor Networks have data distributed across thou-sands of participant hosts. These networks are highly dy-namic with short-lived hosts being the norm rather thanan exception. In recent years, researchers have investigatedbest-effort algorithms to efficiently process aggregate queries(e.g., sum, count, average, minimum and maximum) [6, 13,21, 34, 35, 37] on these networks. Unfortunately, query se-mantics for best-effort algorithms are ill-defined, making ithard to reason about guarantees associated with the resultreturned. In this paper, we specify a correctness condition,single-site validity, with respect to which the above algo-rithms are best-effort. We present a class of algorithms thatguarantee validity in dynamic networks. Experiments onreal-life and synthetic network topologies validate perfor-mance of our algorithms, revealing the hitherto unknownprice of validity.1. INTRODUCTIONConsider an aggregate query that has to be computedover data hosted in a network of n hosts. A simple ap-proach is to ship data from the n hosts and store it in acentral database where query processing can take place (thewarehousing approach [5]). The database can take steps(e.g., concurrency control) to ensure valid semantics for thequery: the aggregate reflects data for some snapshot of thenetwork. Although simple, such data shipping from all hostsin the network incurs a high communication cost both in thenetwork and at the central database host.The alternative approach is to leverage the computationalcapabilities of hosts in the network by shipping the queryand processing it using a distributed query plan (the in-network processing approach [5]). An efficient query planenables only relevant data to be shipped thereby reducingcommunication cost.The emergence of large-scale self-administered networksover the last decade has forced us to think about scenariosPermission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage, and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SIGMOD 2004 June 13-18, 2004, Paris, France.Copyright 2004 ACM 1-58113-859-8/04/06 . .. $5.00.Q Q QABFigure 1: Ill-Defined Semantics (a) Sensor Networkwith 16 sensors, (b) Broadcast, and (c) Converge-cast. Failure of sensors A and B after Broadcastleads to counts of 15 and 6 respectively. Which ofthese is correct and why?where n is of the order of thousands or millions of short-lived hosts. Peer-to-Peer (P2P) and Sensor Networks areprime examples of such massively distributed data-centricnetworks. P2P Networks like Gnutella, KaZaa and Freenethave been used for file-sharing by millions of hosts thathave short lifetimes.1A sensor network [11, 17, 29] con-sists of thousands of sensors that monitor their environmentin real-time, communicate over a wireless network and havelifetimes dictated by their internal battery unit.In-network processing has been the approach of choice forlarge-scale networks in several projects [6, 13, 21, 34, 35,37]. A distributed query plan however requires us to dealwith dynamism in the network, and the semantics of thefinal answer returned. In traditional distributed databasesystems, a host failure leads to an abort of the query. How-ever, in large-scale networks, we want query processing tocontinue even if a “few” hosts fail. Unfortunately, the al-gorithms proposed have been best-effort and the semanticsof the aggregate thus computed are rather ill-defined in thepresence of host failures.EXAMPLE 1.1. Consider the Sensor Network shown inFigure 1-a on which a user wishes to count the number ofactive sensors. Proposed algorithms [21, 35, 37] processsuch aggregate queries in two phases. In the first (Broad-cast) phase, the query floods the network and a spanningtree is built on the sensors (Figure 1-b). In the second (Con-vergecast) phase, sub-tree counts are propagated up from theleaves to the root (Figure 1-c). Each interior host adds to-gether its sub-tree counts with a 1 for itself, and propagatesthe result to its parent.A failure-free execution of Convergecast returns count=16as the answer. However, even if there is a single failure in1The median session duration for a Gnutella host was mea-sured as 60 minutes in 2001 [32].the network, the answer returned could vary from count=16to count=6 depending on which sensor failed and when. Theuser is thus unable to associate a meaning to the count valuereturned: What is the correct answer and how does it relateto the value returned?In large-scale self-administered networks, host failures arethe norm rather than being an exception, and it is necessaryto give a meaning to possible interleaving of host failureswith query processing. We propose defining query semanticsfor dynamic networks, and by extension revisiting the queryprocessing algorithm itself. In this paper we present moti-vation, methodology and performance results on designingalgorithms that ensure valid semantics.EXAMPLE 1.2. What is the meaning of the answer to aquery? The desire to associate semantics with a query resultis certainly not new and this question has been answered ina variety of domains by researchers.For instance, Hellerstein et. al. [14] suggest performingaggregation online to allow users to observe the progress oftheir queries. A running aggregate is displayed to the useras an estimate of the final result based on the tuples retrievedso far. By itself, the running aggregate is meaningless to theuser. The semantics of the running aggregate are fixed by as-sociating properties of “Confidence” and “Interval Bounds”.The twin properties together give a probabilistic estimate ofthe running aggregate’s proximity to the final result.Our Contribution (Validity Semantics): In this pa-per, we develop a simple and intuitively appealing correct-ness condition for queries on dynamic networks. We saythat a query result is single-site valid if it is “equivalent”, ina sense formally defined in Section 4, to a legal

View Full Document