# BGSU STAT 402 - K-Nearest-Neighbor - 1 slide per page (15 pages)

Previewing pages 1, 2, 3, 4, 5 of 15 page document
View Full Document

## K-Nearest-Neighbor - 1 slide per page

Previewing pages 1, 2, 3, 4, 5 of actual document.

View Full Document
View Full Document

## K-Nearest-Neighbor - 1 slide per page

44 views

Pages:
15
School:
Bowling Green State University - Main Campus
Course:
Stat 402 - Regression Analysis
Unformatted text preview:

DATA MINING K Nearest Neighbor Dr brahim apar Assistant Professor Characteristics Data driven not model driven Makes no assumptions about the data Basic Idea For a given record to be classified identify nearby records Near means records with similar predictor values X1 X2 Xp Classify the record as whatever the predominant class is among the nearby records the neighbors How to measure nearby The most popular distance measure is Euclidean distance The data must be standardized Choosing k K is the number of nearby neighbors to be used to classify the new record K 1 means use the single nearest record K 5 means use the 5 nearest records Typically choose that value of k which has lowest error rate in validation data Low k vs High k Low values of k 1 3 capture local structure in data but also noise High values of k provide more smoothing less noise but may miss local structure The extreme case of k n Using K NN for Classification The majority vote determines class In case of tide randomly go with one or use the closest to data point to make determination Using K NN for Prediction Instead of majority vote determines class use average of response values May be a weighted average weight decreasing with distance Advantages Simple No assumptions required about Normal distribution etc Effective at capturing complex interactions among variables without having to define a statistical model Shortcomings Required size of training set increases exponentially with of predictors p This is because expected distance to nearest neighbor increases with p with large vector of predictors all records end up far away from each other In a large training set it takes a long time to find distances to all the neighbors and then identify the nearest one s These constitute curse of dimensionality Dealing with the Curse Reduce dimension of predictors e g with PCA Computational shortcuts that settle for almost nearest neighbors Example Riding Mowers Data 24 households classified as owning or not owning riding mowers Predictors Income in 1000 Lot Size in 1000sq feet Decision Boundaries When K 1 Decision Boundaries When K 15 Which k is the best

View Full Document

Unlocking...