UVA CS 662 - Distributed Operator Placement and Data Caching in Large-Scale Sensor Networks - D421977

Home> Schools> University Of Virginia> Computer Science (CS) > CS 662> Distributed Operator Placement and Data Caching in Large-Scale Sensor Networks

UVA CS 662 - Distributed Operator Placement and Data Caching in Large-Scale Sensor Networks

School name University Of Virginia

Course Cs 662- Database Systems

Pages 9

Download Save

Unformatted text preview:

Distributed Operator Placement and Data Cachingin Large-Scale Sensor NetworksLei Ying∗,ZhenLiu†,DonTowsley‡,CathyH.Xia†∗Department of Electrical and Computer Engineering, UIUCEmail: [email protected]†IBM T.J. Watson Research Center, Hawthorne, NY 10532Email: {zhenl,cathyx}@us.ibm.com‡Department of Computer Science, UMASS-AmherstEmail: [email protected]—Recent advances in computer technology and wire-less communications have enabled the emergence of stream-basedsensor networks. In such sensor networks, real-time data aregenerated by a large number of distributed sources. Queries aremade that may require sophisticated processing and filtering ofthe data. A query is represented by a query graph. In orderto reduce the data transmission and to better utilize resources,it is desirable to place operators of the query graph inside thenetwork, and thus to perform in-network processing. Moreover,given that various queries occur with different frequencies andthat only a subset of sensor data may actually be queried, cachingintermediate data objects inside the network can help improvequery efficiency. In this paper, we consider the problem of placingboth operators and intermediate data objects inside the networkfor a set of queries so as to minimize the total cost of storage,computation, and data transmission. We propose distributedalgorithms that achieve optimal solutions for tree-structuredquery graph topologies and general network topologies. Thealgorithms converge in Lmax(HQ+1) iterations, where Lmaxis the order of the diameter of the sensor network, and HQrepresents the depth of the query graph, defined as the maximumnumber of operations needed for a raw data to become a finaldata. For a regular grid network and complete binary tree querygraph, the complexity is O(√N log2M ),whereN is the numberof nodes in the sensor network and M is the number of dataobjects in a query graph. The most attractive features of thesealgorithms are that they require only information exchangesbetween neighbors, can be executed asynchronously, are adaptiveto cost change and topology change, and are resilient to node orlink failures.I. INTRODUCTIONRecent advances in computer technology and wireless com-munications have enabled the emergence of stream-basedsensor networks. Broad applications include network trafficmonitoring, real-time financial data analysis, environmentalsensing, large-scale reconnaissance, surveillance, etc. In theseapplications, real-time data are generated by a large number ofdistributed sources, such as sensors, and must be processed,filtered, interpreted or aggregated in order to provide usefulservices to users. The sensors, together with many othershared resources, such as Internet hosts and edge servers,are organized into a network that collectively provides arich set of query processing services. Resource-efficient datamanagement is a key challenge in such stream-based sensornetworks.Two different approaches are commonly used for managingand processing data in such networks. The first approachinvolves setting up a centralized data warehouse (e.g. fusioncenter) to which sensors push potentially useful sensor datafor storage and processing [1]. Users then make a variety ofsophisticated queries on the data s tored at this central database.Clearly, this approach does not efficiently use resources be-cause all data must be transmitted to the central warehousewhether or not they are of interest. Moreover, this may requirea large investment in processing resources at the warehousewhile at the same time processing resources within the networkmay go idle. The second approach involves pushing queries allthe way to the remote sensors, see e.g. [2]. Such querying ofremote sensor nodes is generally more bandwidth efficient.However, it is greatly limited by the low capability, lowavailability, and low reliability of the edge devices such assensors.A third approach that pushes query processing into thenetwork is thus desirable in order to reduce data transmissionand better utilize available shared resources in the network.Furthermore, given the fact that various queries are generatedat different rates and only a subset of sensor data may beactually queried, caching some intermediate data objects insidethe network may help increase query efficiency. In this paperwe address the question of where to place specific queryoperators and when and where to store intermediate dataobjects. Intuitively, one would like to place the operatorsas close as possible to the edge devices (sensors) so as toreduce transmission costs. However, devices close to the edgeare likely to have limited processing and storage capabilities.Thus they may not be capable of handling sophisticatedqueries. Efficient placement algorithms are therefore neededto achieve the minimum overall cost in computation, storageand communication.In-network query processing has received increasing atten-tion in past several years. Previous work on in-network queryprocessing has mostly focused on the operator placementproblem [3], [4], [5], [6], [7], [8], [9]. Most studies assumethat queries are generated at the same rate as the data andare applied to all data. They do not exploit the fact that somequeries may be generated at such low rates that they need onlyThis full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE INFOCOM 2008 proceedings.978-1-4244-2026-1/08/$25.00 © 2008 IEEE 1651be applied to a fraction of the data generated.In this paper, besides considering the placement of variousquerying operators, we also account for the randomness of thequerying process. We assume that different queries occur atdifferent frequencies and that only a subset of sensor datamay actually be queried. Each query is associated with agiven query graph, consisting of operators and data objects,that describes how a response to the query is obtained logi-cally. We explore the advantages of caching intermediate dataobjects inside the network relative to conventional push andpull mechanisms. We integrate the operator placement andcaching problem together by considering node assignment forboth operators and intermediate data objects that yields theminimum total communication, computation, and storage cost.We consider large-scale sensor networks whose conditionscan change dynamically. Thus centralized solutions becomeexpensive as they require global knowledge of the

View Full Document