K-Nearest-NeighborDr. İbrahim ÇaparAssistant ProfessorDATA MININGCharacteristicsData-driven, not model-drivenMakes no assumptions about the dataBasic IdeaFor a given record to be classified, identify nearby records“Near” means records with similar predictor values X1, X2, … XpClassify the record as whatever the predominant class is among the nearby records (the “neighbors”)How to measure “nearby”?The most popular distance measure is Euclidean distance− + − + ⋯ + − The data must be standardizedChoosing kK is the number of nearby neighbors to be used to classify the new recordK=1 means use the single nearest recordK=5 means use the 5 nearest recordsTypically choose that value of k which has lowest error rate in validation dataLow k vs. High kLow values of k (1, 3, …) capture local structure in data (but also noise)High values of k provide more smoothing, less noise, but may miss local structureThe extreme case of k = nUsing K-NN for ClassificationThe majority vote determines class. In case of tide randomly go with one or use the closest to data point to make determinationUsing K-NN for Prediction Instead of “majority vote determines class” use average of response valuesMay be a weighted average, weight decreasing with distanceAdvantagesSimpleNo assumptions required about Normal distribution, etc.Effective at capturing complex interactions among variables without having to define a statistical modelShortcomingsRequired size of training set increases exponentially with # of predictors, pThis is because expected distance to nearest neighbor increases with p (with large vector of predictors, all records end up “far away” from each other)In a large training set, it takes a long time to find distances to all the neighbors and then identify the nearest one(s)These constitute “curse of dimensionality”Dealing with the CurseReduce dimension of predictors (e.g., with PCA)Computational shortcuts that settle for “almost nearest neighbors”Example: Riding MowersData: 24 households classified as owning or not owning riding mowersPredictors: Income(in $1000), Lot Size (in 1000sq feet)Decision Boundaries When K=1Decision Boundaries When K=15Which k is the
View Full Document