DOC PREVIEW
Adaptive Nearest Neighbor Classification Algorithm

This preview shows page 1-2-3-4 out of 13 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

IntroductionANNCADNotationQuantization of the Feature SpaceBuilding a Classifier and Classifying Test PointsIncremental Updates of ClassifiersBuilding Several Classifiers Using Different GridsProperties of ANNCADPerformance EvaluationSynthetic Data SetsReal Life Data SetsConclusion and Future WorkAn Adaptive Nearest Neighbor ClassificationAlgorithm for Data StreamsYan-Nei Law and Carlo ZanioloComputer Science Dept., UCLA, Los Angeles, CA 90095, USA{ynlaw, zaniolo}@cs.ucla.eduAbstract. In this paper, we propose an incremental classification algo-rithm which uses a multi-resolution data representation to find adaptivenearest neighbors of a test point. The algorithm achieves excellent per-formance by using small classifier ensembles where approximation errorbounds are guaranteed for each ensemble size. The very low update costof our incremental classifier makes it highly suitable for data stream ap-plications. Tests performed on both synthetic and real-life data indicatethat our new classifier outperforms existing algorithms for data streamsin terms of accuracy and computational costs.1 IntroductionA significant amount of recent research has focused on mining data streamsfor applications such as financial data analysis, network monitoring, security,sensor networks, and many others [3,8]. Algorithms for mining data streamshave to address challenges not encountered in traditional mining of stored data:at the physical level, these include fast input rates and unending data sets,while, at the logical level, there is the need to cope with concept drift [18].Therefore, classical classification algorithms must be replaced by, or modifiedinto, incremental algorithms that are fast and light and gracefully adapt tochanges in data statistics [17,18,5].Related Works. Because of their good performance and intuitive appeal, deci-sion tree classifiers and nearest neighborhood classifiers have been widely used intraditional data mining tasks [9]. For data streams, several decision tree classi-fiers have been proposed—either as single decision trees, or as ensembles of suchtrees. In particular, VFDT [7] and CVFDT [10] represent well-known algorithmsfor building single decision tree classifiers, respectively, on stationary, and time-changing data streams. These algorithms employ a criterion based on Hoeffdingbounds to decide when a further level of the current decision tree should becreated. While this approach assures interesting theoretical properties, the timerequired for updating the decision tree can be significant, and a large amount ofsamples is needed to build a classifier with reasonable accuracy. When the size ofthe training set is small, the performance of this approach can be unsatisfactory.Another approach to data stream classification uses ensemble methods. Theseconstruct a set of classifiers by a base learner, and then combine the predictionsA. Jorge et al. (Eds.): PKDD 2005, LNAI 3721, pp. 108–120, 2005.c Springer-Verlag Berlin Heidelberg 2005An Adaptive Nearest Neighbor Classification Algorithm for Data Streams 109of these base models by voting techniques. Previous research works [17,18,5] haveshown that ensembles can often outperform single classifiers and are also suitablefor coping with concept drift. On the other hand, ensemble methods suffer fromthe drawback that they often fail to provide a simple model and understandingof the problem at hand [9].In this paper, we focus on building nearest neighbor (NN) classifiers for datastreams. This technique works well in traditional data mining applications, issupported by a strong intuitive appeal, and it rather simple to implement. How-ever, the time spent for finding the exact NN can be expensive and, therefore,a significant amount of previous research has focused on this problem. A well-known method for accelerating the nearest neighbor lookup is to use k-d trees[4]. A k-d tree is a balanced binary tree that recursively splits a d-dimensionalspace into smaller subregions. However, the tree can become seriously unbal-anced by massive new arrivals in the data stream, and thus lose the ability ofexpediting the search. Another approach to NN classifiers attempts to provideapproximate answers with error bound guarantees. There are many novel algo-rithms [11,12,13,14] for finding approximate K-NN on stored data. However, tofind the (1 + )-approximate nearest neighbors, these algorithms must performmultiple scans of the data. Also, the update cost of the dynamic algorithms[11,13,14] depends on the size of the data set, since the entire data set is neededfor the update process. Therefore, they are not suitable for mining data streams.Our ANNCAD Algorithm. In this paper, we introduce an Adaptive NNClassification Algorithm for Data-streams. It is well-known that when data isnon-uniform, it is difficult to predetermine K in the KNN classification [6,20].So, instead of fixing a specific number of neighbors, as in the usual KNN algo-rithm, we adaptively expand the nearby area of a test point until a satisfactoryclassification is obtained. To save the computation time for finding adaptive NN,we first preassigning a class to every subregion (cell). To achieve this, we decom-pose the feature space of a training set and obtain a multi-resolution data rep-resentation. There are many decomposition techniques for multi-resolution datarepresentations. The averaging technique used in this paper can be thought ofHaar Wavelets Transformation [16]. Thus, information from different resolutionlevels can then be used for adaptively preassigning a class to every cell. Thenwe determine to which cell the test point belongs, in order to predict its class.Moreover, because of the compact support property inherited from wavelets, thetime spent updating a classifier when a new tuple arrives is a small constant,and it is independent of the size of the data set. Unlike VDFT, which requires alarge data set to decide whether to expand the tree by one more level, ANNCADdoes not have this restriction.In the paper, we use grid-based approach for classification. The main char-acteristic of this approach is the fast processing time and small memory usage,which is independent of the number of data points. It only depends on the num-ber of cells of each dimension in the discretized space, which is easy to adjustin order to fulfill system constraints. Therefore, this approach has been widelyemployed in clustering problem. Some


Adaptive Nearest Neighbor Classification Algorithm

Download Adaptive Nearest Neighbor Classification Algorithm
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Adaptive Nearest Neighbor Classification Algorithm and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Adaptive Nearest Neighbor Classification Algorithm 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?