Adaptive Nearest Neighbor Classiﬁcation Algorithm

Home> Academic Documents> Adaptive Nearest Neighbor Classiﬁcation Algorithm

DOC PREVIEW

This preview shows page 1-2-3-4 out of 13 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

IntroductionANNCADNotationQuantization of the Feature SpaceBuilding a Classifier and Classifying Test PointsIncremental Updates of ClassifiersBuilding Several Classifiers Using Different GridsProperties of ANNCADPerformance EvaluationSynthetic Data SetsReal Life Data SetsConclusion and Future WorkAn Adaptive Nearest Neighbor ClassiﬁcationAlgorithm for Data StreamsYan-Nei Law and Carlo ZanioloComputer Science Dept., UCLA, Los Angeles, CA 90095, USA{ynlaw, zaniolo}@cs.ucla.eduAbstract. In this paper, we propose an incremental classiﬁcation algo-rithm which uses a multi-resolution data representation to ﬁnd adaptivenearest neighbors of a test point. The algorithm achieves excellent per-formance by using small classiﬁer ensembles where approximation errorbounds are guaranteed for each ensemble size. The very low update costof our incremental classiﬁer makes it highly suitable for data stream ap-plications. Tests performed on both synthetic and real-life data indicatethat our new classiﬁer outperforms existing algorithms for data streamsin terms of accuracy and computational costs.1 IntroductionA signiﬁcant amount of recent research has focused on mining data streamsfor applications such as ﬁnancial data analysis, network monitoring, security,sensor networks, and many others [3,8]. Algorithms for mining data streamshave to address challenges not encountered in traditional mining of stored data:at the physical level, these include fast input rates and unending data sets,while, at the logical level, there is the need to cope with concept drift [18].Therefore, classical classiﬁcation algorithms must be replaced by, or modiﬁedinto, incremental algorithms that are fast and light and gracefully adapt tochanges in data statistics [17,18,5].Related Works. Because of their good performance and intuitive appeal, deci-sion tree classiﬁers and nearest neighborhood classiﬁers have been widely used intraditional data mining tasks [9]. For data streams, several decision tree classi-ﬁers have been proposed—either as single decision trees, or as ensembles of suchtrees. In particular, VFDT [7] and CVFDT [10] represent well-known algorithmsfor building single decision tree classiﬁers, respectively, on stationary, and time-changing data streams. These algorithms employ a criterion based on Hoeﬀdingbounds to decide when a further level of the current decision tree should becreated. While this approach assures interesting theoretical properties, the timerequired for updating the decision tree can be signiﬁcant, and a large amount ofsamples is needed to build a classiﬁer with reasonable accuracy. When the size ofthe training set is small, the performance of this approach can be unsatisfactory.Another approach to data stream classiﬁcation uses ensemble methods. Theseconstruct a set of classiﬁers by a base learner, and then combine the predictionsA. Jorge et al. (Eds.): PKDD 2005, LNAI 3721, pp. 108–120, 2005.c Springer-Verlag Berlin Heidelberg 2005An Adaptive Nearest Neighbor Classiﬁcation Algorithm for Data Streams 109of these base models by voting techniques. Previous research works [17,18,5] haveshown that ensembles can often outperform single classiﬁers and are also suitablefor coping with concept drift. On the other hand, ensemble methods suﬀer fromthe drawback that they often fail to provide a simple model and understandingof the problem at hand [9].In this paper, we focus on building nearest neighbor (NN) classiﬁers for datastreams. This technique works well in traditional data mining applications, issupported by a strong intuitive appeal, and it rather simple to implement. How-ever, the time spent for ﬁnding the exact NN can be expensive and, therefore,a signiﬁcant amount of previous research has focused on this problem. A well-known method for accelerating the nearest neighbor lookup is to use k-d trees[4]. A k-d tree is a balanced binary tree that recursively splits a d-dimensionalspace into smaller subregions. However, the tree can become seriously unbal-anced by massive new arrivals in the data stream, and thus lose the ability ofexpediting the search. Another approach to NN classiﬁers attempts to provideapproximate answers with error bound guarantees. There are many novel algo-rithms [11,12,13,14] for ﬁnding approximate K-NN on stored data. However, toﬁnd the (1 + )-approximate nearest neighbors, these algorithms must performmultiple scans of the data. Also, the update cost of the dynamic algorithms[11,13,14] depends on the size of the data set, since the entire data set is neededfor the update process. Therefore, they are not suitable for mining data streams.Our ANNCAD Algorithm. In this paper, we introduce an Adaptive NNClassiﬁcation Algorithm for Data-streams. It is well-known that when data isnon-uniform, it is diﬃcult to predetermine K in the KNN classiﬁcation [6,20].So, instead of ﬁxing a speciﬁc number of neighbors, as in the usual KNN algo-rithm, we adaptively expand the nearby area of a test point until a satisfactoryclassiﬁcation is obtained. To save the computation time for ﬁnding adaptive NN,we ﬁrst preassigning a class to every subregion (cell). To achieve this, we decom-pose the feature space of a training set and obtain a multi-resolution data rep-resentation. There are many decomposition techniques for multi-resolution datarepresentations. The averaging technique used in this paper can be thought ofHaar Wavelets Transformation [16]. Thus, information from diﬀerent resolutionlevels can then be used for adaptively preassigning a class to every cell. Thenwe determine to which cell the test point belongs, in order to predict its class.Moreover, because of the compact support property inherited from wavelets, thetime spent updating a classiﬁer when a new tuple arrives is a small constant,and it is independent of the size of the data set. Unlike VDFT, which requires alarge data set to decide whether to expand the tree by one more level, ANNCADdoes not have this restriction.In the paper, we use grid-based approach for classiﬁcation. The main char-acteristic of this approach is the fast processing time and small memory usage,which is independent of the number of data points. It only depends on the num-ber of cells of each dimension in the discretized space, which is easy to adjustin order to fulﬁll system constraints. Therefore, this approach has been widelyemployed in clustering problem. Some


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4 out of 13 pages.

Please select your school