DOC PREVIEW
Duke CPS 296.1 - Mining and Monitoring Evolving Data

This preview shows page 1-2-3 out of 10 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

DEMON: Mining and Monitoring Evolving DataVenkatesh Ganti∗Johannes Gehrke†Raghu Ramakrishnan‡UW-Madison Cornell University [email protected] [email protected] [email protected] mining algorithms have been the focus of much re-search recently. In practice, the input data to a data min-ing process resides in a large data warehouse whose data iskept up-to-date through periodic or occasional addition anddeletion of blocks of data. Most data mining algorithmshaveeither assumed that the input data is static, or have been de-signed for arbitrary insertions and deletions of data records.In this paper, we consider a dynamic environment thatevolves through systematic addition or deletion of blocks ofdata. We introduce a new dimension called the data span di-mension, which allows user-defined selections of a temporalsubset of the database. Taking this new degree of freedominto account, we describe efficient model maintenance algo-rithms for frequent itemsets and clusters. We then describea generic algorithm that takes any traditional incrementalmodel maintenance algorithm and transforms it into an al-gorithm that allows restrictions on the data span dimension.In a detailed experimental study, we examine the validity andperformance of our ideas.1. IntroductionOrganizations have realized that the large amounts of datathey accumulate in their daily business operations can yielduseful “business intelligence,” or strategic insights, based onobserved patterns of activity. There is an increasing focuson data mining, which has been defined as the applicationof data analysis and discovery algorithms to large databaseswith the goal of discovering (predictive) models [9]. Severalalgorithms have been proposed for computing novel models,for more efficient model construction, to deal with new datatypes, and to quantify differences between datasets.Most data mining algorithms so far have assumed that theinput data is static and do not take into account that dataevolves over time. Recently, the problem of mining evolv-ing data has received some attention and incremental modelmaintenance algorithms for several data mining models have∗Supported by a Microsoft Research Fellowship†Work done when the author was at UW-Madison‡This research was supported by Grant 2053 from the IBM corporation.been developed [5, 10, 16, 7, 11]. These algorithms are de-signed to incrementally maintain a data mining model underarbitrary insertions and deletions of records to the database.But real-life data often does not evolve in an arbitraryway. Consider a data warehouse, a large collection of datafrom multiple sources consolidated into a common reposi-tory, to enable complex data analysis [4]. The data ware-house is updated with new batches of records at regular timeintervals, e.g., every day at midnight. Thus data in the datawarehouse evolves through addition and deletion of batchesof records at a time. We refer to data that changes throughad-dition and deletion of sets of records as systematically evolv-ing data. The main difference between arbitrary and system-atic evolution is that in the former an individual record canbe updated at any time, whereas in the latter sets of recordsare added together.In this paper, we assume a dynamic environment of sys-tematically evolving data and introduce the problem of min-ing systematically evolving data. The main contributions ofour work are:1. We present a DEMONic1view of the world by explor-ing the problem space of mining systematically evolv-ing data (Section 2). We introduce a new dimensioncalled the data span dimension, which takes the tempo-ral aspect of the data evolution into account and allowsan analyst to “mine” relevant subsets of the data.2. We describe new model maintenance algorithms withrespect to the selection constraints on the data span di-mension for two popular classes of data mining models:frequent itemsets and clustering (Section 3.1). These al-gorithms exploit the systematic block evolution to im-prove the state-of-the-art incremental algorithms. Wealso introduce a generic algorithm that takes any tra-ditional incremental model maintenance algorithm andderives an incremental algorithm that allows restrictionson the data span dimension (Section 3.2). In particular,the generic algorithm can be instantiated with our incre-mental algorithms in Section 3.1.1Data Evolution and MONitoring3. In an extensive experimental study, we evaluate our al-gorithms and compare them with previous work wher-ever possible (Section 4).2. DEMONIn this section, we introduce the problem of mining sys-tematically evolving data. We describe our model of system-atic data evolution in Section 2.1. In Section 2.2, we enumer-ate the problem space of mining systematically evolving databy introducing the data span dimension, which allows tem-poral restrictions on the data being mined. Then we refinethe type of restrictions by introducing the notion of a blockselection sequence in Section 2.3.2.1. Systematic data evolutionWe now describe our model of evolving data. We use theterm tuple generically to stand for the basic unit of infor-mation in the data, e.g., a customer transaction, a databaserecord, or an n-dimensional point. A block is a set of tu-ples. We assume that the database D consists of a (con-ceptually infinite) sequence of blocks D1,...,Dk,...whereeach block Dkis associated with an identifier k. We assumewithout loss of generality that all identifiers are natural num-bers and that they increase in the order of their arrival. Un-less otherwise mentioned, we use t to denote the identifierof the “latest” block Dt. We call the sequence of all blocksD1,...,Dtcurrently in the database the current databasesnapshot.Note that we do not assume that block evolution followsa regular period; different blocks may span different timeunits. For example, the first two blocks of data may be addedto the database on Saturday and Sunday, respectively, andthe third block on the following Friday. The framework cannaturally handle this type of irregular block evolution. Thelack of constraints on the time spanned by any block alsoallows us to incorporate hierarchies on the time dimension.(We just merge all blocks that fall under the same parent.)2.2. Data span dimensionWhen mining systematically evolving data, some applica-tions are interested in mining all the data accumulated thusfar, whereas some other applications are interested in miningonly a recently


View Full Document

Duke CPS 296.1 - Mining and Monitoring Evolving Data

Documents in this Course
Lecture

Lecture

18 pages

Lecture

Lecture

6 pages

Lecture

Lecture

13 pages

Lecture

Lecture

5 pages

Load more
Download Mining and Monitoring Evolving Data
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Mining and Monitoring Evolving Data and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Mining and Monitoring Evolving Data 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?