Efficiently Supporting Ad Hoc Queries in Large Datasets of TimeSequencesFlip Kern”H. V. JagaciishChristos Falotdsos~Dept. of Computer ScienceAT&T LaboratoriesDept. of Computer Science andUniversity of MarylandFlorham Park, NJ 07932 Inst. for Systems ResearchCollege Park, MD 20742j ag@research. att. comUniversity of Marylandflip@cs .umd. eduCollege Park, MD 20742christos@cs. umd. eduAbstractAd hoc querying is difficult on very large datasets, since itisusually not possible to have the entire dataset on disk.While compression can be used to decrease the size of thedataset, compressed data is notoriously difficult to indexor access.In this paper we consider a very large dataaet compris-ing multiple distinct time sequences. Each point in thesequence is a numerical value. We show how to compresssuch a dataset into a format that supports ad hoc query-ing, provided that a small error can be tolerated when thedata is uncompressed. Experiments on large, real worlddatasets (AT&T customer calling patterns) show that theproposed method achieves an average of less thau570 errorin any data value after compressingto a mere 2.5~o of theoriginal space (i. e., a40:1 compression ratio), with thesenumbers not very sensitive to dataset size. Experimentson aggregate queriesachieved a 0.570 reconstruction errorwith a space requirement under 270.1 IntroductionThe bulk of the data in most data warehouses has a timecomponent (e. g., sales per week, transactions per minute,phone calls per day, etc.). More formally, these datasetsare of N time sequences, each of duratiou M, organized inan N x M matrix (N row vectors of dimensionality M). Insuch databases, decision support (i.e., statistical analysis)requires the ability to perform ad hoc queries. What onewould like is a way to compress data in such a way that adhoc queries are still supported efficiently. In this paper, we“ Work performed while visiting AT&T.twork perfomed while visiting AT&T. Partially supported byNSF grants EEC-94-023S4,IRI-9205273, IRI.9625428.Permissionto make digitallhardcopy of part or all this work forpersonalor classroomuse is granted without fee provided thatcopiee sre not made or distributedfor profit or commercialadvan-tage, the cop~ight notice, the titleof the publicationand itadeteappear, and notice is given that copying ia by permissionof ACM,Inc. To copy otherwise, to republish,to post on aervera,or toredistributeto Iiets,requiresprior specificpermissionandlor a fee.SIGMOD ’97 AZ,USA01997 ACM 0-89791 -911 -419710005...$3.50introduce a way to do this, for numerical (time sequence)data, at the cost of a small loss in numerical accuracy.When the data.set is very large, accessing specific datavalues is a difficult problem. For instante, if the data ison tape, such access is next to impossible. When the datais all on disk, the cost of disk storage, even with today’sfalling disk prices, is typically a major concern, and any-thing one can do to decrease the amount of disk storagerequired is of value. We, the authors, ourselves have ex-perience with more than one dataset that ran into hun-dreds of gigabytes, making storage of the data on diskprohibitively expensive. Unfortunately, most data com-pression techniques require large blocks of data to be ef-fective, so that random access to arbitrary pieces of thedata is no longer conveniently possible. This makes it dif-ficult to issue ad hoc queries, and therefore most techniquesdo not support the sort of random ad hoc access desiredfor data mining and for many forms of decision support.Instead, the query style is forced to be one of careful plan-ning for a “processing run” in which large chunks of dataare temporarily uncompressed, examined as needed, andthen compressed back immediately.The goal of this paper is to develop techniques that willpermit the compression of such large datasets in a mannerthat continues to permit random access to the cells of thematrix. By the term “random access” we mean that thetime to reconstruct the value of any single cell is constantwith respect to the number of rows N and columns M,with a small proportionality constant. Ideally, it shouldrequire1 or 2 disk accesses (versus 1 disk access that theuncompressed file would require if the whole file could fiton the disk). This is what is required to support ad hocqueries efficiently.Table 1 provides an example of the kind of matrix thatis typical in warehousing applications, where rows are cus-tomers, columns are days, and the values are the dollaramounts spent on phone calls each day. Alternatively,rows could correspond to patients, with hourly recordingsof their temperature for the past48 hours, or companies,with stock closing prices over the past 365 days. Sucha setting also appears in other contexts. In informationretrieval systems rows could be text documents, columnscould be vocabulary terms, with the (i, j) entry showing289the importance of the j-th term for the i-th document.customerABC Inc.DEF Ltd.GHI Inc.KLM Co.SmithJohnsonThompsonWe Th Fr Sa Su7f10 7jll 7/12 7/13 7/141 1 1 0 02 2 20 01 11 0 055 5 0 00 002 20 0 0 3 30 001 1Table 1: Example ofa (customer-day) matrixTo make our discussion more concrete, we will refer torows as “customers” and to columns as “days”. The math-ematical machinery inapplicable to many different applica-tions, such asthose mentioned in the preceding paragraph,including ones where there is no notion of a customer ora day, as long as the problem involves a set of vectors or,equivsJently, an N x M matrix X.Decision support and data mining on large datssetsoften involves, at the lowest level, obtaining answers toqueries, both exploratory queries as well as queries tover-ify hypotheses. These queries may require access to datarecords, either individuallyor in the aggregate: for one,some, or all customers; for one, some, or all days. TwotypicaJ queries are:● Queries on specific cells of the data matrix: ‘whatwas the amount of sales to GHI Inc. on July 11,1996 ?’● Aggregate queries unselected rows and columns: ‘findthe total sales to business customers (ABC, DEF,GHI, and h’LM) for the week ending July 12, 1996.’We study these two main classes of queries in this paper.There are three underlying assumptions/motivations be-hind●●●the present work:The data matrix is huge, of the order of several Giga-Bytes. For example, in large corporations like AT&T,there are millions of customers (= rows);The number of rows N is much larger than the
View Full Document