New version page

IJBIDM 3106 Law and Zaniolo

This preview shows page 1-2-3-4-5-6 out of 19 pages.

View Full Document
View Full Document

End of preview. Want to read all 19 pages?

Upload your study docs or become a GradeBuddy member to access this document.

View Full Document
Unformatted text preview:

Int. J. Business Intelligence and Data Mining, Vol. x, No. x, 200x 99 Copyright © 2008 Inderscience Enterprises Ltd. Improving the accuracy of continuous aggregates and mining queries on data streams under load shedding Yan-Nei Law Bioinformatics Institute, 30 Biopolis Street, #07-01, Matrix, 138671, Singapore Fax: +65 6478 9048 E-mail: [email protected] Carlo Zaniolo* UCLA Computer Science Department, 4732 Boelter Hall, Los Angeles, CA 90095, USA Fax: +1 310 825 2273 E-mail: [email protected] *Corresponding author Abstract: Random samples are common in data streams applications due to limitations in data sources and transmission lines, or to load-shedding policies. Here we introduce a formal error model and show that, besides providing accurate estimates, it improves query answer accuracy by exploiting past statistics. The method is general, robust in the presence of concept drift, and minimises uncertainties due to sampling with negligible time and space overhead. We describe the application of the method, and the results obtained for SQL window aggregates, statistical aggregates such as quantiles, and data mining functions such as k-means clustering and naive Bayesian classifiers. Keywords: load shedding; data stream; query processing; sampling. Reference to this paper should be made as follows: Law, Y.N. and Zaniolo, C. (2008) ‘Improving the accuracy of continuous aggregates and mining queries on data streams under load shedding’, Int. J. Business Intelligence and Data Mining, Vol. x, No. x, pp.xxx–xxx. Biographical notes: Yan-Nei Law received her PhD in Computer Science at the University of California, Los Angeles. She is a post doctoral research fellow in Bioinformatics Institute, Singapore. Her research interests include issues related to bioimaging with emphasis on developing image processing and data mining techniques for high throughput images. Carlo Zaniolo is a Professor of Computer Science with the University of California at Los Angeles, where he occupies the N.E. Friedmann Chair in Knowledge Science. His current research interests include mining data bases and data streams, data stream management systems, data bases and decision support systems, archival information systems, temporal databases, schema evolution, data bases and knowledge bases, nonmonotonic reasoning.100 Y.N. Law and C. Zaniolo 1 Introduction Many data stream applications require complex queries involving aggregates, analysis, and mining functions (Babcock et al., 2002). Examples include financial analysis (Chen et al., 2000), network monitoring (Sullivan, 1996), sensor networks (Chandrasekaran et al., 2003), web logs and click-streams, telecommunication data management (Cortes et al., 2000) and many others. These applications are often characterised by high data rates, and a demand for real-time (or near real-time) responses. Traditional database management systems were designed to manage persistent databases and transient queries, and they are not suitable for stream applications that instead require support for persistent queries and transient data. Data Stream Management Systems (DSMS) are therefore being developed to satisfy the unique requirements and technical challenges posed by such applications. Foremost among these, we find the problems created by the bursty nature and unpredictably high arrival rates of data streams, which often make it impossible to provide exact real-time answers to all inputs and continuous queries. To deal with such overload situations, the DSMS must incorporate intelligent load shedding policies, to minimise the loss of quality of the query answers in the presence of such overloads. This problem has motivated interesting research work seeking techniques to minimise the loss of information caused by load shedding (Kang et al., 2003; Das et al., 2003; Tatbul et al., 2003; Babcock et al., 2004). Effective solutions are often made possible by the fact that many data stream applications and DSMS tasks, such as aggregate queries and stream mining methods, only require approximate answers – provided that these satisfy certain statistical accuracy requirements. Therefore, the interesting problem arises of designing optimal policies to optimise answer accuracy when a certain amount of load shedding is needed on the input. The problem of designing load-shedding policies for aggregate queries was treated in Babcock et al. (2004), where an optimum policy based on random sampling was proposed. The interesting approach proposed in Babcock et al. (2004) is based on statistical properties of the incoming data streams. In this paper, we show that, by the same statistical properties used in Babcock et al. (2004), it is also possible to improve the very quality of the answers, since more accurate estimates can now be derived from random samples. We first present the theory that makes these improvements possible and then introduce a method that realises them and yields significant benefits in many queries of practical interest. We first consider the traditional aggregates, such as SUM, COUNT and AVG, studied in Babcock et al. (2004), and then we extend our approach to more complex aggregate-like queries, such as quantiles, and mining queries. It should be noted that the benefits of the proposed theory and method are even more important on statistical functions than on simple aggregates, where a ‘naive’ uniform sampling of the content of the current sliding window tends to be reasonably effective. However, for complex mining and analytic functions, the errors in the approximate answers can be very large, unless sophisticated estimation techniques are used. In this paper therefore, we will propose a technique to improve the accuracy of these queries given a sample of the window. Our method is not limited to the load shedding scenarios, but is also applicable to all situations where a random sample of data is available. For instance, random sampling might be caused by faulty data sources or transmission lines.Improving the accuracy of continuous aggregates and mining queries 101 Related work. Most recent work has focused on computing approximate query answers. A first line of research concentrates on computing sliding window joins (Das et al., 2003; Srivastava and Widom, 2004) and aggregates (Dobra et al., 2002). Another interesting study presented in Arasu and Manku (2004) focused on


Loading Unlocking...
Login

Join to view IJBIDM 3106 Law and Zaniolo and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view IJBIDM 3106 Law and Zaniolo and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?