DOC PREVIEW
USC CSCI 599 - p289-korn

This preview shows page 1-2-3-4 out of 12 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Efficiently Supporting Ad Hoc Queries in Large Datasets of TimeSequencesFlip Kern”H. V. JagaciishChristos Falotdsos~Dept. of Computer ScienceAT&T LaboratoriesDept. of Computer Science andUniversity of MarylandFlorham Park, NJ 07932 Inst. for Systems ResearchCollege Park, MD 20742j ag@research. att. comUniversity of Marylandflip@cs .umd. eduCollege Park, MD 20742christos@cs. umd. eduAbstractAd hoc querying is difficult on very large datasets, since itisusually not possible to have the entire dataset on disk.While compression can be used to decrease the size of thedataset, compressed data is notoriously difficult to indexor access.In this paper we consider a very large dataaet compris-ing multiple distinct time sequences. Each point in thesequence is a numerical value. We show how to compresssuch a dataset into a format that supports ad hoc query-ing, provided that a small error can be tolerated when thedata is uncompressed. Experiments on large, real worlddatasets (AT&T customer calling patterns) show that theproposed method achieves an average of less thau570 errorin any data value after compressingto a mere 2.5~o of theoriginal space (i. e., a40:1 compression ratio), with thesenumbers not very sensitive to dataset size. Experimentson aggregate queriesachieved a 0.570 reconstruction errorwith a space requirement under 270.1 IntroductionThe bulk of the data in most data warehouses has a timecomponent (e. g., sales per week, transactions per minute,phone calls per day, etc.). More formally, these datasetsare of N time sequences, each of duratiou M, organized inan N x M matrix (N row vectors of dimensionality M). Insuch databases, decision support (i.e., statistical analysis)requires the ability to perform ad hoc queries. What onewould like is a way to compress data in such a way that adhoc queries are still supported efficiently. In this paper, we“ Work performed while visiting AT&T.twork perfomed while visiting AT&T. Partially supported byNSF grants EEC-94-023S4,IRI-9205273, IRI.9625428.Permissionto make digitallhardcopy of part or all this work forpersonalor classroomuse is granted without fee provided thatcopiee sre not made or distributedfor profit or commercialadvan-tage, the cop~ight notice, the titleof the publicationand itadeteappear, and notice is given that copying ia by permissionof ACM,Inc. To copy otherwise, to republish,to post on aervera,or toredistributeto Iiets,requiresprior specificpermissionandlor a fee.SIGMOD ’97 AZ,USA01997 ACM 0-89791 -911 -419710005...$3.50introduce a way to do this, for numerical (time sequence)data, at the cost of a small loss in numerical accuracy.When the data.set is very large, accessing specific datavalues is a difficult problem. For instante, if the data ison tape, such access is next to impossible. When the datais all on disk, the cost of disk storage, even with today’sfalling disk prices, is typically a major concern, and any-thing one can do to decrease the amount of disk storagerequired is of value. We, the authors, ourselves have ex-perience with more than one dataset that ran into hun-dreds of gigabytes, making storage of the data on diskprohibitively expensive. Unfortunately, most data com-pression techniques require large blocks of data to be ef-fective, so that random access to arbitrary pieces of thedata is no longer conveniently possible. This makes it dif-ficult to issue ad hoc queries, and therefore most techniquesdo not support the sort of random ad hoc access desiredfor data mining and for many forms of decision support.Instead, the query style is forced to be one of careful plan-ning for a “processing run” in which large chunks of dataare temporarily uncompressed, examined as needed, andthen compressed back immediately.The goal of this paper is to develop techniques that willpermit the compression of such large datasets in a mannerthat continues to permit random access to the cells of thematrix. By the term “random access” we mean that thetime to reconstruct the value of any single cell is constantwith respect to the number of rows N and columns M,with a small proportionality constant. Ideally, it shouldrequire1 or 2 disk accesses (versus 1 disk access that theuncompressed file would require if the whole file could fiton the disk). This is what is required to support ad hocqueries efficiently.Table 1 provides an example of the kind of matrix thatis typical in warehousing applications, where rows are cus-tomers, columns are days, and the values are the dollaramounts spent on phone calls each day. Alternatively,rows could correspond to patients, with hourly recordingsof their temperature for the past48 hours, or companies,with stock closing prices over the past 365 days. Sucha setting also appears in other contexts. In informationretrieval systems rows could be text documents, columnscould be vocabulary terms, with the (i, j) entry showing289the importance of the j-th term for the i-th document.customerABC Inc.DEF Ltd.GHI Inc.KLM Co.SmithJohnsonThompsonWe Th Fr Sa Su7f10 7jll 7/12 7/13 7/141 1 1 0 02 2 20 01 11 0 055 5 0 00 002 20 0 0 3 30 001 1Table 1: Example ofa (customer-day) matrixTo make our discussion more concrete, we will refer torows as “customers” and to columns as “days”. The math-ematical machinery inapplicable to many different applica-tions, such asthose mentioned in the preceding paragraph,including ones where there is no notion of a customer ora day, as long as the problem involves a set of vectors or,equivsJently, an N x M matrix X.Decision support and data mining on large datssetsoften involves, at the lowest level, obtaining answers toqueries, both exploratory queries as well as queries tover-ify hypotheses. These queries may require access to datarecords, either individuallyor in the aggregate: for one,some, or all customers; for one, some, or all days. TwotypicaJ queries are:● Queries on specific cells of the data matrix: ‘whatwas the amount of sales to GHI Inc. on July 11,1996 ?’● Aggregate queries unselected rows and columns: ‘findthe total sales to business customers (ABC, DEF,GHI, and h’LM) for the week ending July 12, 1996.’We study these two main classes of queries in this paper.There are three underlying assumptions/motivations be-hind●●●the present work:The data matrix is huge, of the order of several Giga-Bytes. For example, in large corporations like AT&T,there are millions of customers (= rows);The number of rows N is much larger than the


View Full Document

USC CSCI 599 - p289-korn

Documents in this Course
Week8_1

Week8_1

22 pages

Week2_b

Week2_b

10 pages

LECT6BW

LECT6BW

20 pages

LECT6BW

LECT6BW

20 pages

5

5

44 pages

12

12

15 pages

16

16

20 pages

Nima

Nima

8 pages

Week1

Week1

38 pages

Week11_c

Week11_c

30 pages

afsin

afsin

5 pages

October5b

October5b

43 pages

Week11_2

Week11_2

20 pages

final

final

2 pages

c-4

c-4

12 pages

0420

0420

3 pages

Week9_b

Week9_b

20 pages

S7Kriegel

S7Kriegel

21 pages

Week4_2

Week4_2

16 pages

sandpres

sandpres

21 pages

Week6_1

Week6_1

20 pages

4

4

33 pages

Week10_c

Week10_c

13 pages

fft

fft

18 pages

LECT7BW

LECT7BW

19 pages

24

24

15 pages

14

14

35 pages

Week9_c

Week9_c

24 pages

Week11_67

Week11_67

22 pages

Week1

Week1

37 pages

LECT3BW

LECT3BW

28 pages

Week8_c2

Week8_c2

19 pages

Week5_1

Week5_1

19 pages

LECT5BW

LECT5BW

24 pages

Week10_b

Week10_b

16 pages

Week11_1

Week11_1

43 pages

Week7_2

Week7_2

15 pages

Week5_b

Week5_b

19 pages

Week11_a

Week11_a

29 pages

LECT14BW

LECT14BW

24 pages

T7kriegel

T7kriegel

21 pages

0413

0413

2 pages

3

3

23 pages

C2-TSE

C2-TSE

16 pages

10_19_99

10_19_99

12 pages

s1and2-v2

s1and2-v2

37 pages

Week10_3

Week10_3

23 pages

jalal

jalal

6 pages

1

1

25 pages

T3Querys

T3Querys

47 pages

CS17

CS17

15 pages

porkaew

porkaew

20 pages

LECT4BW

LECT4BW

21 pages

Week10_1

Week10_1

25 pages

wavelet

wavelet

17 pages

October5a

October5a

22 pages

2

2

33 pages

rose

rose

36 pages

9_7_99

9_7_99

18 pages

Week10_2

Week10_2

28 pages

Week7_3

Week7_3

37 pages

Load more
Download p289-korn
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view p289-korn and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view p289-korn 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?