View Full Document


Unformatted text preview:

A Data Stream Mining System Hetal Thakkar Barzan Mozafari Carlo Zaniolo University of California at Los Angeles hthakkar barzan zaniolo cs ucla edu Abstract On line data stream mining has attracted much research interest but systems that can be used as a workbench for online mining have not been researched since they pose many difficult research challenges The proposed system addresses these challenges by an architecture based on three main technical advances i introduction of new constructs and synoptic data structures whereby complex KDD queries can be easily expressed and efficiently supported ii an integrated library of mining algorithms that are fast light enough to be effective on data streams and iii support for Mining Model Definition Language MMDL that allows users to define new mining algorithms as a set of tasks and flows Thus the proposed system provides an extensible workbench for online mining which is beyond the existing proposals for even static mining 1 Introduction On line data stream mining plays a key role in growing number of real world applications including network traffic monitoring intrusion detection web click stream analysis and credit card fraud detection Thus many research projects have recently focused on designing fast mining algorithms whereby massive data streams can be mined with real time response 10 4 13 7 Similarly many research projects have also focused on managing the data streams generated from these applications 9 1 6 However the problem of supporting mining algorithms in such systems has so far not received much research attention 12 This situation seems unusual since the need for a mining system for static data mining was immediately recognized 8 and has lead to systems such as Weka 5 and OLE DB for DM 11 Furthermore static mining algorithms can also be written in procedural language using a cache mining approach that makes little use of DBMS essentials However online mining tasks cannot be deployed as stand alone algorithms since they require many DSMS essentials such as I O buffering windows synopses load shedding etc Clearly KDD researchers and practitioners would rather concentrate on the complexities of data mining tasks and avoid the complexities of managing data streams by letting the mining system handle them In short while mining systems are a matter of convenience for stored data they are a matter of critical necessity for data streams Thus this demo presents the SMM system namely Stream Mill Miner which is specifically designed to address this critical necessity Building such a system raises difficult research issues which SMM solves through an architecture based on three main technical advances as follows Extending recently developed DSMSs which are currently designed to only support simple queries to express complex mining queries Integrating a library of mining algorithms that are fast light enough to be effective on data streams and Supporting a higher level mining language namely Mining Model Definition Language MMDL which allows definition of mining models that encapsulate related mining tasks and mining flows for ease of use and extensibility Thus SMM extends an existing DSMS namely Stream Mill with user friendly high level mining models that are implemented with a powerful SQL based continuous query language namely Expressive Stream Language ESL ESL is an extension of SQL based on User Defined Aggregates UDAs Therefore this demo presents the following key features and methods of SMM Mining models and their use for online classification clustering and association rule mining Generic support for advanced meta concepts to improve accuracy of classifiers e g ensembles and Definition of mining algorithms consisting of multiple processing steps as mining flows in MMDL 2 High Level Mining Models An on line data stream mining system should allow the user to i define new mining models and ii uniformly invoke diverse set of built in and user defined mining algorithms Existing solutions for static data mining do not allow i and simply focus on ii which results in a close system However online mining systems must provide an open framework since new on line mining algorithms are constantly being proposed 10 SMM achieves both of these goals via supporting MMDL as we discuss next SMM allows the user to define new mining models by specifying the tasks that are associated with the model For instance most classifiers will consist of two tasks learning and predicting whereas association rule mining consists of finding frequent patterns and deriving rules from them and so on Furthermore data cleaning and post analysis steps can also be specified as tasks Finally the analyst can specify mining flows that connect these tasks to implement complex mining process such as ensemble based methods 13 4 7 The model definition specifies the tables that are shared by different tasks of the model Thus different instances of the model will work on separate instances of these tables but the tasks of the same model instance share these tables Additionally the model definition associates a UDA with each individual task of the model as discussed in Section 3 Example 1 defines a simple Naive Bayesian Classifier NBC and creates an instance of this model type in MMDL In Example 1 the UDAs associated with the Learn and Classify tasks are LearnNaiveBayesian omitted due to space constraints and ClassifyNaiveBayesian Example 3 respectively Thus MMDL allows the users to create arbitrary mining models and instantiate them uniformly Once a mining model instance is created the user can then invoke different tasks of the model with a consistent syntax For instance Example 2 invokes the Learn task of the NBC instance created in Example 1 Note we omit the discussion of the formal syntax here due to space constraints Example 1 Defining A ModelType for an NBC CREATE MODEL TYPE NaiveBayesianClassifier SHAREDTABLES DescriptorTbl Learn UDA LearnNaiveBayesian WINDOW TRUE PARTABLES PARAMETERS Classify UDA ClassifyNaiveBayesian WINDOW TRUE PARTABLES PARAMETERS CREATE MODEL INSTANCE NaiveBayesianInstance AS NaiveBayesianClassifier Example 2 Invoking the Learn task of the NBC Instance RUN NaiveBayesianInstance Learn WITH TrainingSet In Example 2 the TrainingSet is assumed to have the same schema as expected by the UDA associated with the Learn task the system checks this automatically Furthermore the RUN statement allows an additional USING clause to specify the parameters

Access the best Study Guides, Lecture Notes and Practice Exams

Loading Unlocking...

Join to view ICDMIEEE08 and access 3M+ class-specific study document.

We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view ICDMIEEE08 and access 3M+ class-specific study document.


By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?