Artificial Intelligence: Representation and Problem Solving15-381April 26, 2007Clustering(including k-nearest neighbor classification, k-means clustering, cross-validation, and EM, with a brief foray into dimensionality reduction with PCA)Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: ClusteringA different approach to classification2•Nearby points are likely to be members of the same class.•What if we used the points themselves to classify?classify x in Ck if x is “similar” to a point we already know is in Ck.•Eg: unclassified point x is more similar Class 2 than Class 1.•Issue: How to define “similar” ?Simplest is Euclidean distance:•Could define other metrics depending on application, e.g. text documents, images, etc.1 2 3 4 5 6 700.511.522.5x1x2xClass 1Class 2Class 3Potential advantages:• don’t need an explicit model• the more examples the better• might handle more complex classes• easy to implement• “no brain on part of the designer”Nearest neighbor classification on the iris datasetd(x, y) =!i(xi− yi)2Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: ClusteringA complex, non-parametric decision boundary•How do we control the complexity of this model?-difficult•How many parameters?-every data point is a parameter!•This is an example of a non-parametric model, ie where the model is defined by the data. (Also, called, instance based)•Can get very complex decision boundaries3example from Martial Herbert Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: ClusteringExample: Handwritten digits•Use Euclidean distance to see which known digit is closest to each class.•But not all neighbors are the same:•“k-nearest neighbors”:look at k-nearest neighbors and choose most frequent.•Cautions: can get expensive to find neighbors4from LeCun etal, 1998digit data available at:http://yann.lecun.com/exdb/mnist/Error Bounds for NN 8• Amazing fact: asymptotically, err(1-NN) < 2 err(Bayes):eB≤ e1NN≤ 2eB−MM − 1e2Bthis is a tight upper bound, achieved in the “zero-information” casewhen the classes have identical densities.• For K-NN there are also bounds. e.g. for two classes and odd K:eB≤ eKNN≤(K−1)/2!i=0"ki#$ei+1B(1 − eB)k−i+ ek−iB(1 − eB)i+1%• For more on these bounds, see the book A Probabilistic Theory ofPattern Recognition, by L. Devroye, L. Gyorfi & G. Lugosi (1996).Example: USPS Digits 9• Take 16x16 grayscale images (8bit) of handwritten digits.• Use Euclidean distance in raw pixel space (dumb!) and 7-nn.• Classification error (leave-one-out): 4.85%.Example 7 Nearest NeighboursNonparametric (Instance-Based) Models 10• Q: W hat are the parameters in K-NN? What is the complexity?A: the scalar K and the entire training set.Models which need the entire training set at test time but(hopefully) have very few other parameters are known asnonparametric, instance-based or case based.• What if we want a classifier that uses only a small number ofparameters at test time? (e.g. for speed or memory reasons)Idea 1: single linear boundary, of arbitrary orientationIdea 2: many boundaries, but axis-parallel & tree structured1 2 3 4 5 6 7 8 9 1012345678910x1x2x1x2t1 t2t3t4t5ABCDEFLinear Classification for Binary Output 11• Goal: find the line (or hyperplane) which best separates two classes:c(x) = sign[x#w&'()weight− w0&'()threshold]• w is a vector perpendicular to decision boundary• This is the opposite of non-parametric: only d + 1 parameters!• Typically we augment x with a constant term ±1 (“bias unit”) andthen absorb w0into w, so we don’t have to treat it specially.1 2 3 4 5 6 7 8 9 1012345678910x1x2example nearest neighborsexample from Sam Roweis Digits are just represented as a vector.Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: ClusteringThe problem of using templates (ie Euclidean distance)•Which of these is more like the example? A or B?•Euclidean distance only cares about how many pixels overlap.•Could try to define a distance metric that is insensitive to small deviations in position, scale, rotation, etc.•Digit example: -60,000 training images, -10,000 test images-no “preprocessing”5example A Bfrom Simard etal, 1998Classifiererror rate on test data (%)linear12.0k=3 nearest neighbor(Euclidean distance)5.02-layer neural network (300 hidden units)4.7nearest neighbor(Euclidean distance)3.1k-nearest neighbor (improved distance metric)1.1convolutional neural net0.95best (the conv. net with elastic distortions)0.4humans0.2 - 2.5performance results of various classifiers(from http://yann.lecun.com/exdb/mnist/)ClusteringMichael S. Lewicki ! Carnegie MellonArtificial Intelligence: ClusteringClustering: Classification without labels:7•In many situations we don’t have labeled training data, only unlabeled data.•Eg, in the iris data set, what if we were just starting and didn’t know any classes?1 2 3 4 5 6 700.511.522.5petal length (cm)petal width (cm)Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: ClusteringTypes of learning8world(or data)model{θ1, . . . , θn}desired output{y1, . . . , yn}supervisedworld(or data)model{θ1, . . . , θn}unsupervisedworld(or data)model{θ1, . . . , θn}model outputreinforcementreinforcementnext weekMichael S. Lewicki ! Carnegie MellonArtificial Intelligence: ClusteringA real example: clustering electrical signals from neuronsAn application of PCA: Spike sortingoscilloscopesoftwareanalysiselectrodefiltersamplifierA/DPrincipal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU!!!!? 59An extracellular waveform with many different spikes0 5 10 15 20 25msecHow do we s ort the different spikes?Principal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU!!!!? 6Basic problem: only information is signal. The true classes are always unknown.Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: ClusteringAn extracellular waveform with many different neural “spikes”An extracellular waveform with many different spikes0 5 10 15 20 25msecHow do we sort the different spikes?Principal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU!!!!? 610Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: ClusteringSorting with level detection11Sorting with level detection!0.5 0 0.5 1 1.5msecPrincipal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU!!!!? 7Sorting with level detection!0.5 0 0.5 1 1.5msec!0.5 0 0.5 1 1.5msecLevel detection doesn’t
View Full Document