BGSU STAT 402 - K-Nearest-Neighbor - 1 slide per page (15 pages)

Previewing pages 1, 2, 3, 4, 5 of 15 page document View the full content.
View Full Document

K-Nearest-Neighbor - 1 slide per page



Previewing pages 1, 2, 3, 4, 5 of actual document.

View the full content.
View Full Document
View Full Document

K-Nearest-Neighbor - 1 slide per page

39 views


Pages:
15
School:
Bowling Green State University - Main Campus
Course:
Stat 402 - Regression Analysis

Unformatted text preview:

DATA MINING K Nearest Neighbor Dr brahim apar Assistant Professor Characteristics Data driven not model driven Makes no assumptions about the data Basic Idea For a given record to be classified identify nearby records Near means records with similar predictor values X1 X2 Xp Classify the record as whatever the predominant class is among the nearby records the neighbors How to measure nearby The most popular distance measure is Euclidean distance The data must be standardized Choosing k K is the number of nearby neighbors to be used to classify the new record K 1 means use the single nearest record K 5 means use the 5 nearest records Typically choose that value of k which has lowest error rate in validation data Low k vs High k Low values of k 1 3 capture local structure in data but also noise High values of k provide more smoothing less noise but may miss local structure The extreme case of k n Using K NN for Classification The majority vote determines class In case of tide randomly go with one or use the closest to data point to make determination Using K NN for Prediction Instead of majority vote determines class use average of response values May be a weighted average weight decreasing with distance Advantages Simple No assumptions required about Normal distribution etc Effective at capturing complex interactions among variables without having to define a statistical model Shortcomings Required size of training set increases exponentially with of predictors p This is because expected distance to nearest neighbor increases with p with large vector of predictors all records end up far away from each other In a large training set it takes a long time to find distances to all the neighbors and then identify the nearest one s These constitute curse of dimensionality Dealing with the Curse Reduce dimension of predictors e g with PCA Computational shortcuts that settle for almost nearest neighbors Example Riding Mowers Data 24 households classified as owning or not owning riding mowers Predictors Income in 1000 Lot Size in 1000sq feet Decision Boundaries When K 1 Decision Boundaries When K 15 Which k is the best



View Full Document

Access the best Study Guides, Lecture Notes and Practice Exams

Loading Unlocking...
Login

Join to view K-Nearest-Neighbor - 1 slide per page and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view K-Nearest-Neighbor - 1 slide per page and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?