Chapter 6. Classification and PredictionNearest Neighbor ClassifiersNearest neighbor methodNearest-Neighbor ClassifiersDefinition of Nearest Neighbor1 nearest-neighborNearest Neighbor ClassificationNearest Neighbor Classification…Slide 9Slide 10Nearest neighbor Classification…Discussion on the k-NN AlgorithmLazy vs. Eager LearningLazy Learner: Instance-Based MethodsCase-Based Reasoning (CBR)Slide 16What Is Prediction?Linear RegressionNonlinear RegressionOther Regression-Based ModelsRegression Trees and Model TreesPredictive Modeling in Multidimensional DatabasesSlide 23Estimating Error Rates IEstimating Error Rates IIEstimating Error Rates IIISlide 27Slide 28Model Selection: ROC CurvesMetrics for Performance EvaluationMetrics for Performance Evaluation…Limitation of AccuracyCost MatrixComputing Cost of ClassificationCost vs AccuracyCost-Sensitive MeasuresMore measuresPredictor Error MeasuresSlide 40Ensemble MethodsEnsemble Methods: Increasing the AccuracyGeneral IdeaWhy does it work?Examples of Ensemble MethodsBagging: Boostrap AggregationBaggingBoostingSlide 49Slide 50Adaboost (Freund and Schapire, 1997)Example: AdaBoostSlide 53Illustrating AdaBoostSlide 5501/15/19Data Mining: Concepts and Techniques 1Chapter 6. Classification and PredictionWhat is classification? What is prediction?Issues regarding classification and predictionClassification by decision tree inductionBayesian classificationRule-based classificationClassification by back propagationSupport Vector Machines (SVM) Associative classification Lazy learners (or learning from your neighbors)Other classification methodsPredictionAccuracy and error measuresEnsemble methodsModel selectionSummary01/15/19Data Mining: Concepts and Techniques 2Nearest Neighbor ClassifiersBasic idea:If it walks like a duck, quacks like a duck, then it’s probably a duckTraining RecordsTest RecordCompute DistanceChoose k of the “nearest” records01/15/19Data Mining: Concepts and Techniques 3Nearest neighbor methodMajority vote within the k nearest neighbors K= 1: brownK= 3: greennew01/15/19Data Mining: Concepts and Techniques 4Nearest-Neighbor ClassifiersRequires three things–The set of stored records–Distance Metric to compute distance between records–The value of k, the number of nearest neighbors to retrieveTo classify an unknown record:–Compute distance to other training records–Identify k nearest neighbors –Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote)01/15/19Data Mining: Concepts and Techniques 5Definition of Nearest NeighborXXX(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x01/15/19Data Mining: Concepts and Techniques 61 nearest-neighborVoronoi Diagram01/15/19Data Mining: Concepts and Techniques 7Nearest Neighbor ClassificationCompute distance between two points:Euclidean distance Determine the class from nearest neighbor listtake the majority vote of class labels among the k-nearest neighborsWeigh the vote according to distance weight factor, w = 1/d2iiiqpqpd2)(),(01/15/19Data Mining: Concepts and Techniques 8Nearest Neighbor Classification…Choosing the value of k:If k is too small, sensitive to noise pointsIf k is too large, neighborhood may include points from other classes01/15/19Data Mining: Concepts and Techniques 9Nearest Neighbor Classification…Scaling issuesAttributes may have to be scaled to prevent distance measures from being dominated by one of the attributesExample: height of a person may vary from 1.5m to 1.8m weight of a person may vary from 90lb to 300lb income of a person may vary from $10K to $1M01/15/19Data Mining: Concepts and Techniques 10Nearest Neighbor Classification…Problem with Euclidean measure:High dimensional data curse of dimensionalityCan produce counter-intuitive results1 1 1 1 1 1 1 1 1 1 1 00 1 1 1 1 1 1 1 1 1 1 11 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 1vsd = 1.4142 d = 1.4142 Solution: Normalize the vectors to unit length01/15/19Data Mining: Concepts and Techniques 11Nearest neighbor Classification…k-NN classifiers are lazy learners It does not build models explicitlyUnlike eager learners such as decision tree induction and rule-based systemsClassifying unknown records are relatively expensive01/15/19Data Mining: Concepts and Techniques 12Discussion on the k-NN AlgorithmThe k-NN algorithm for continuous-valued target functionsCalculate the mean values of the k nearest neighborsDistance-weighted nearest neighbor algorithmWeight the contribution of each of the k neighbors according to their distance to the query point xqgiving greater weight to closer neighborsSimilarly, for real-valued target functionsRobust to noisy data by averaging k-nearest neighborsCurse of dimensionality: distance between neighbors could be dominated by irrelevant attributes. To overcome it, axes stretch or elimination of the least relevant attributes.wd xqxi12( , )01/15/19Data Mining: Concepts and Techniques 13Lazy vs. Eager LearningLazy vs. eager learningLazy learning (e.g., instance-based learning): Simply stores training data (or only minor processing) and waits until it is given a test tupleEager learning (the above discussed methods): Given a set of training set, constructs a classification model before receiving new (e.g., test) data to classifyLazy: less time in training but more time in predictingAccuracyLazy method effectively uses a richer hypothesis space since it uses many local linear functions to form its implicit global approximation to the target functionEager: must commit to a single hypothesis that covers the entire instance space01/15/19Data Mining: Concepts and Techniques 14Lazy Learner: Instance-Based MethodsInstance-based learning: Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classifiedTypical approachesk-nearest neighbor approachInstances represented as points in a Euclidean space.Locally weighted regressionConstructs local approximationCase-based reasoningUses symbolic representations and knowledge-based inference01/15/19Data Mining: Concepts and Techniques 15Case-Based Reasoning (CBR)CBR: Uses a database of
View Full Document