New version page

Evolving Data Streams

Upgrade to remove ads

This preview shows page 1-2-3-4 out of 13 pages.

Save
View Full Document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience

Upgrade to remove ads
Unformatted text preview:

A Framework for On-Demand Classificationof Evolving Data StreamsCharu C. Aggarwal, Senior Member, IEEE, Jiawei Han, Senior Member, IEEE,Jianyong Wang, Member, IEEE, and Philip S. Yu, Fellow, IEEEAbstract—Current models of the classification problem do not effectively handle bursts of particular classes coming in at differenttimes. In fact, the current model of the classification problem simply concentrates on methods for one-pass classification modeling ofvery large data sets. Our model for data stream classification views the data stream classification problem from the point of view of adynamic approach in which simultaneous training and test streams are used for dynamic classification of data sets. This model reflectsreal-life situations effectively, since it is desirable to classify test streams in real time over an evolving training and test stream. The aimhere is to create a classification system in which the training model can adapt quickly to the changes of the underlying data stream. Inorder to achieve this goal, we propose an on-demand classification process which can dynamically select the appropriate window ofpast training data to build the classifier. The empirical results indicate that the system maintains a high classification accuracy in anevolving data stream, while providing an efficient solution to the classification task.Index Terms—Stream classification, geometric time frame, microclustering, nearest neighbor.æ1INTRODUCTIONIN recent years, advances in data storage technology haveled to the ability to store the data for real-timetransactions. Such processes lead to data which often growwithout limit and are referred to as data streams. Discus-sions on recent advances in data stream mining may befound in [4], [5], [6], [8], [9], [15], [17], [19], [21].One important data mining problem which has beenstudied in the context of data streams is that of classification[10], [11], [12]. The main thrust on data stream mining in thecontext of classification has been that of one-pass mining[8], [13], [16], [18], [23]. In reality, the nature of theunderlying changes in the data stream can impose con-siderable challenges. Previous work on stream classificationsimply treats it as a one-pass mining problem. This does nottreat the classification problem in the context o f theunderlying changes which have occurred in the stream. Ingeneral, the use of one-pass mining does not recognize thechanges which have occurred in the model since thebeginning of the stream construction process. This isgenerally quite important since data streams may evolveconsiderably over time [3]. While the work in [13] works ontime changing data streams, the focus is on providingeffective methods for incremental updating of the classifica-tion model. We note that the accuracy of such a modelcannot be greater than the best sliding window model on adata stream. This is because such a model implicitly uses asliding window which is equal to the history of the entiredata stream. As our empirical results will show, the truebehavior of the data stream is captured in a temporal modelwhich is sensitive to the level of evolution of the datastream. Therefore, a more temporally adaptive philosophyis desirable to improve the effectiveness of the underlyingalgorithms.The classification process may require simultaneousmodel construction and testing in an environment whichconstantly evolves over time. We assume that the testingprocess is performed concurrently with the trainingprocess. This is often the case in many practical applica-tions, in which only a portion of the data is labeled, whereasthe remaining is not. Therefore, such data can be separatedout into the (labeled) training stream and the (unlabeled)testing stream. The most effective classification model to beused does not stay constant over time, but varies withprogression of the data stream. If a static classificationmodel were used for an evolving test stream, the accuracyof the underlying classification process is likely to dropsuddenly when there is a sudden burst of records belongingto a particular class. In such a case, a classification modelwhich is constructed using a smaller history of data is likelyto provide better accuracy. On the other hand, if the streamhas been relatively stable over time, then using a longerhistory for training makes greater sense.In the classification process of an evolving data stream,either the short-term or long-term behavior of the streammay be more important, and it often cannot be knowna priori as to which one is more important. How do wedecide the window or horizon of the training data to use soas to obtain the best classification accuracy? Whiletechniques such as decision trees are useful for one-passmining of data streams [8], [13], [23], these cannot be easilyused in the context of an on-demand classifier in an evolvingIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 18, NO. 5, MAY 2006 577. C.C. Aggarwal and P.S. Yu are with the IBM T.J. Watson Research Center,19 Skyline Drive, Hawthorne, NY 10532.E-mail: {charu, psyu}@us.ibm.com.. J. Han is with the University of Illinois at Urbana-Champaign, Urbana, IL61801. E-mail: [email protected] J. Wang is with Tsinghua University, Beijing, China.E-mail: [email protected] received 28 Mar. 2005; revised 28 Oct. 2005; accepted 19 Jan.2006; published online 17 Mar. 2006.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-0114-0305.1041-4347/06/$20.00 ß 2006 IEEE Published by the IEEE Computer SocietyAuthorized licensed use limited to: University of Illinois. Downloaded on January 9, 2009 at 13:12 from IEEE Xplore. Restrictions apply.environment. This is because such a classifier requires rapidvariation in the horizon selection process due to data streamevolution. In this respect, nearest-neighbor classifiers tendto be more amenable to quick horizon adjustments becauseof their simplicity. However, it is still too expensive to keeptrack of the entire history of the data in its original finegranularity. Therefore, the on-demand classification processstill requires the appropriate machinery for efficient andadjustable statistical data collection in order to perform thecorresponding operations in an efficient way.In this paper, we will develop such an on-demandclassifier. The on-demand classifier is designed by adaptingthe (unsupervised) microclustering model [2] to


Download Evolving Data Streams
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Evolving Data Streams and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Evolving Data Streams 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?