UW-Madison ECE 738 - Pattern Classification - An Introduction

Unformatted text preview:

Pattern Classification: An IntroductionOutlineAn ExampleStatistical Pattern ClassificationPattern Classification ProblemDecision Function and Prob. Mis-ClassificationMAP: Maximum A Posteriori ClassifierMaximum Likelihood ClassifierNearest-Neighbor ClassifierWhat is “Clustering”?Clustering Problem Statementk-means Clustering AlgorithmA Numerical ExampleKmeans Algorithm DemonstrationScatter MatricsScattering CriteriaClassifier Design IssuesSlide 18Feature RepresentationSymbolic Feature EncodingFeature Transformation: Why?Feature Transformation: How?Feature Dimension ReductionIrrelevant Feature ReductionRedundant Feature ReductionHigher Dimension FeaturesData SamplingPattern Classification: An Introduction(C) 2001-2005 by Yu Hen Hu2Outline•Statistical Pattern Recognition –Maximum Posterior Probability (MAP) Classifier –Maximum Likelihood (ML) Classifier –K-Nearest Neighbor Classifier –MLP classifier(C) 2001-2005 by Yu Hen Hu3An Example•Consider classify eggs into 3 categories with labels: medium, large, or jumbo. •The classification will be based on the weight and length of each egg. •Decision rules: 1. If W < 10 g & L < 3cm, then the egg is medium2. If W > 20g & L > 5 cm then the egg is jumbo3. Otherwise, the egg is large•Three components in a pattern classifier:–Category (target) label–Features–Decision ruleWL(C) 2001-2005 by Yu Hen Hu4Statistical Pattern Classification•The objective of statistical pattern classification is to draw an optimal decision rule given a set of training samples. •The decision rule is optimal because it is designed to minimize a cost function, called the expected risk in making classification decision.•This is a learning problem!Assumptions1. Features are given.•Feature selection problem needs to be solved separately.•Training samples are randomly chosen from a population2. Target labels are given•Assume each sample is assigned to a specific, unique label by the nature.•Assume the label of training samples are known.(C) 2001-2005 by Yu Hen Hu5Pattern Classification Problem Let X be the feature space, andC = {c(i), 1  i  M} be M class labels. For each x  X, it is assumed that the nature assigned a class label t(x)  C according to some probabilistic rule. Randomly draw a feature vector x from X, P(c(i)) = P(x c(i)) is the a priori probability that t(x) = c(i) without referring to x.P(c(i)|x) = P(x c(i)|x) is the posteriori probability that t(x) = c(i) given the value of xP(x|c(i)) = P(x |x c(i)) is the conditional probability (a.k.a. likelihood function) that x will assume its value given that it is drawn from class c(i).P(x) is the marginal probability that x will assume its value without referring to which class it belongs to.Use Bayes’ Rule, we haveP(x|c(i))P(c(i)) = P(c(i)|x)P(x)Also, MiicPicxPicPicxPxicP1))(())(|())(())(|()|)(((C) 2001-2005 by Yu Hen Hu6Decision Function and Prob. Mis-Classification•Given a sample x  X, the objective of statistical pattern classification is to design a decision rule g(x)  C to assign a label to x. •If g(x) = t(x), the naturally assigned class label, then it is a correct classification. Otherwise, it is a mis-classification. •Define a 0-1 loss function:)()( if1)()( if0))(|(xtxgxtxgxgxGiven that g(x) = c(i*), then Hence the probability of mis-classification for a specific decision g(x) = c(i*) isClearly, to minimize the Pr. of mis-classification for a given x, the best choice is to choose g(x) = c(i*) if P(c(i*)|x) > P(c(i)|x) for i  i*)|*)(()|*)()(()|0*))()(|((xicPxicxtPxicxgxP)|*)((1)|1*))()(|((xicPxicxgxP(C) 2001-2005 by Yu Hen Hu7MAP: Maximum A Posteriori ClassifierThe MAP classifier stipulates that the classifier that minimizes pr. of mis-classification should choose g(x) = c(i*) ifP(c(i*)|x) > P(c(i)|x), i  i*.This is an optimal decision rule. Unfortunately, in real world applications, it is often difficult to estimate P(c(i)|x).Fortunately, to derive the optimal MAP decision rule, one can instead estimate a discriminant function Gi(x) such that for any x  X, i  i*.Gi*(x) > Gi(x) iff P(c(i*)|x) > P(c(i)|x)Gi(x) can be an approximation of P(c(i)|x) or any function satisfying above relationship.(C) 2001-2005 by Yu Hen Hu8Maximum Likelihood Classifier Use Bayes rule,p(c(i)|x) = p(x|c(i))p(c(i))/p(x).Hence the MAP decision rule can be expressed as: g(x) = c(i*) ifp(c(i*))p(x|c(i*)) > p(c(i))p(x|c(i)), i  i*.Under the assumption that the a priori Pr. is unknown, we may assume p(c(i)) = 1/M. As such, maximizing p(x|c(i)) is equivalent to maximizing p(c(i)|c). •The likelihood function p(x|c(i)) may assume a uni-variate Gaussian model. That is, p(x|c(i)) ~ N(i,i) i,i can be estimated using samples from {x|t(x) = c(i)}.•A priori pr. p(c(i)) can be estimated as:|X|c(i)} t(x) t.s. x # {x;))((icP(C) 2001-2005 by Yu Hen Hu9Nearest-Neighbor Classifier •Let {y(1), • • •, y(n)}  X be n samples which has already been classified. Given a new sample x, the NN decision rule chooses g(x) = c(i) if is labeled with c(i). •As n , the prob. Mis-classification using NN classifier is at most twice of the prob. Mis-classification of the optimal (MAP) classifier. •k-Nearest Neighbor classifier examine the k-nearest, classified samples, and classify x into the majority of them. •Problem of implementation: require large storage to store ALL the training samples.||)(||.*)(1xiyMiniyni(C) 2001-2005 by Yu Hen Hu10What is “Clustering”?What can we learn from these “unlabeled” data samples?–Structures: Some samples are closer to each other than other samples–The closeness between samples are determined using a “similarity measure”–The number of samples per unit volume is related to the concept of “density” or “distribution”0 20 40 60 80 10000.51-0.5 0 0.5 1 1.500.511.5(C) 2001-2005 by Yu Hen Hu11Clustering Problem Statement •Given a set of vectors {xk; 1  k  K}, find a set of M clustering centers {w(i); 1  i  c} such that each xk is assigned to a cluster, say, w(i*), according to a distance (distortion, similarity) measure d(xk, w(i)) such that the average distortion is minimized. •I(xk,i) = 1 if x is assigned to cluster i with cluster center w(I); and = 0 otherwise -- indicator function.  ciKkkkiWxdixIKD1 1))(,(),(1(C) 2001-2005 by Yu Hen Hu12k-means


View Full Document

UW-Madison ECE 738 - Pattern Classification - An Introduction

Download Pattern Classification - An Introduction
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Pattern Classification - An Introduction and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Pattern Classification - An Introduction 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?