DOC PREVIEW
ICDMIEEE08

This preview shows page 1 out of 4 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

A Data Stream Mining SystemHetal Thakkar Barzan MozafariUniv ersity of California at Los Angeles{hthakkar, barzan, zaniolo}@cs.ucla.eduCarlo ZanioloAbstractOn-line data stream min ing has attracted much researchinterest, but systems that can be used as a workbench foronline mining have not been researched, since they posemany difficult research challenges. The proposed systemaddresses these challenges by an architecture based onthree main technical advances, (i) introduction of new con-structs and synoptic data structures whereby complex KDDqueries can be easily expressed and efficiently supported,(ii) an integrated library of mining algorithms that are fast& light enough to be effective on data streams, and (iii) sup-port for Mining Model Definition Language (MMDL) thatallows users to define new mining algorithms as a set oftasks and flows. Thus, the proposed system provides an ex-tensible workbench for online mining, which is beyond theexisting proposals for even static mining.1 IntroductionOn-line data stream mining plays a key role in growingnumber of real-world applications, including network traf-fic monitoring, intrusion d etection, web click-stream anal-ysis, and credit card fraud detection. Thus, many researchprojects have recently focused on designing fast mining al-gorithms, whereby massive data streams can be mined withreal-time response [10, 4, 13, 7]. Similarly, many researchprojects have also focused on managing the data streamsgenerated from these applications [9, 1, 6]. However, theproblem of supporting mining algorithms in such systemshas, so far, not received much research attention [12]. Thissituation seems unusual, since the need for a mining sys-tem for static data mining, was immediately recognized [8]and has lead to systems such as, Weka [5] and OLE DB forDM [11]. Furthermore, static mining algorithms can alsobe written in procedural language using a cache mining ap-proach that makes little use of DBMS essentials. However,online mining tasks cannot be deployed as stand-alone al-gorithms, since they require many DSMS essentials, suchas I/O buffering, windows, synopses, load shedding, etc.Clearly, KDD researchers and practitioners would ratherconcentrate on the complexities o f data mining tasks andavoid the complexities of managing data streams, by let-ting the mining system handle them. In short, while miningsystems are a matter of convenience for stored data, theyare a matter of critical necessity for data streams. Thus,this demo presents the SMM system, namely Stream MillMiner, which is specifically designed to address this criticalnecessity. Building such a system raises difficult researchissues, which SMM solves through an architecture basedon three main technical advances, as f ollows.• Extending recently developed DSMSs, which are cur-rently designed to only support simple queries, to ex-press complex mining queries,• Integrating a lib rary of mining algorithms that are fast& light enough to be effective on data streams, and• Supporting a higher level mining language, namelyMining Model Definition Language (MMDL), whichallows definition of mining models that encapsulate re-lated mining tasks and mining flows for ease-of-useand extensibility.Thus, SMM extends an existing DSMS, na mely StreamMill, with user-friendly, high-level mining models that areimplemented with a powerful SQL-based continuous querylanguage, namely Expressive Stream Language (ESL). ESLis an extension o f SQL based on User Defined Aggregates(UDAs). Therefore, this demo presents the following keyfeatures and methods of SMM.• Mining models and their use for online classification,clustering, and association rule mining,• Generic support for advanced meta concepts to im-prove accuracy of classifiers, e.g. ensembles, and• Definition of mining algorithms consisting of multipleprocessing steps as mining flows in MMDL.2 High-Level Mining ModelsAn on-line data stream mining system should allow theuser to (i) define new mining models and (ii) uniformly in-voke diverse set of built-in and user-defined mining algo-rithms. Existing solutions for static data mining, do not al-low (i) and simply focus on (ii), which results in a close sys-tem. However, online mining systems must provide an openframework, since new on-line mining algorithms are con-stantly being proposed [10]. SMM achieves both of thesegoals via supporting MMDL as we discuss next.SMM allows the user to define new mining models byspecifyin g the tasks that are associated with the model. Forinstance, most classifiers will consist of two tasks, learn-ing and predicting, whereas association rule mining consistsof finding frequent patterns and deriving rules from them,and so on. Furthermore, data cleaning and post-analysissteps can also be specified as tasks. Finally, the analystcan specify mining flows that connect these tasks to im-plement complex mining process, such as ensemble basedmethods [13, 4, 7]. The model definition specifies the tablesthat are shared by different tasks of the model. Thus, d iffer-ent instances of the m odel will work on separate instancesof these tables, but the tasks of the same model instanceshare these tables. Additionally, the model definition as-sociates a UDA with each individual task of the model asdiscussedinSection3.Example 1 defines a simple Naive Bayesian Classifier(NBC) and creates an instance of this model type in MMDL.In Example 1, the UDAs associated with theLearn andClassify tasks are LearnNaiveBay esian (omitted due to spaceconstraints) andClassifyNaiveBayesian (Example 3), respec-tively. Thus, MMDL allows the users to create arbitrarymining models and instantiate them uniformly. Once a min-ing model instance is created, the user can then invoke dif-ferent tasks of the model with a consistent syntax. For in-stance, Example 2 invokes theLearn task of the NBC in-stance created in Example 1. Note, we omit the discussionof the formal syntax here, due to space constraints.Example 1 Defining A ModelType for an NBCCREATE MODEL TYPE NaiveBayesianClassifier {SHAREDTABLES (DescriptorTbl),Learn (UDA LearnNaiveBayesian,WINDOW TRUE, PARTABLES(),PARAMETERS()),Classify (UDA ClassifyNaiveBayesian,WINDOW TRUE, PARTABLES(),PARAMETERS())};CREATE MODEL INSTANCE NaiveBayesianInstanceAS NaiveBayesianClassifier;Example 2 Invoking the Learn task of the NBC InstanceRUN NaiveBay esianInstance.Learn WITH TrainingSet;In Example 2, the TrainingSet isassumedtohavethesame schema as expected by the UDA


ICDMIEEE08

Download ICDMIEEE08
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view ICDMIEEE08 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view ICDMIEEE08 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?