1Machine LearningMachine Learning1010--701/15701/15--781, Spring 2008781, Spring 2008Theory of Classification Theory of Classification andandNonparametric ClassifierNonparametric ClassifierEric XingEric XingLecture 2, January 16, 2006Reading: Chap. 2,5 CB and handoutsOutlinez What is theoretically the best classifier z Bayesian decision rule for Minimum Errorz Nonparametric Classifier (Instance-based learning)z Nonparametric density estimationz K-nearest-neighbor classifierz Optimality of kNN2Classificationz Representing data:z Hypothesis (classifier)Decision-making as dividing a high-dimensional space z Distributions of samples from normal and abnormal machine3Basic Probability Conceptsz A sample spaceSis the set of all possible outcomes of a conceptual or physical, repeatable experiment. (Scan be finite or infinite.)z E.g., Smay be the set of all possible outcomes of a dice roll: z E.g., Smay be the set of all possible nucleotides of a DNA site: z E.g., Smay be the set of all possible positions time-space positions of a aircraft on a radar screen: {}GC,T,A,≡S{}61,2,3,4,5,≡S},{},{},{omax+∞××≡ 036000RSRandom Variablez A random variable is a function that associates a unique numerical value (a token) with every outcome of an experiment. (The value of the r.v. will vary from trial to trial as the experiment is repeated) z Discrete r.v.:z The outcome of a dice-rollz The outcome of reading a nt at site i: z Binary event and indicator variable:z Seeing an "A" at a site ⇒ X=1, o/wX=0. z This describes the true or false outcome a random event.z Can we describe richer outcomes in the same way? (i.e., X=1, 2, 3, 4, for being A, C, G, T) --- think about what would happen if we take expectation of X.z Unit-Base Random vectorXi=[XiA, XiT, XiG, XiC]', Xi=[0,0,1,0]' ⇒ seeing a "G" at site i z Continuous r.v.:z The outcome of recording the true location of an aircraft: z The outcome of observing the measured location of an aircraftωωSSXX((ωω))= [x= [x11, x, x22, , ……, , xxnn]]TTiXtrueXobsX4Discrete Prob. Distributionz (In the discrete case), a probability distribution P on S(and hence on the domain of X ) is an assignment of a non-negative real number P(s) to each s∈S(or each valid value of x) such that Σs∈SP(s)=1. (0≤P(s) ≤1)z intuitively, P(s) corresponds to the frequency (or the likelihood) of getting sin the experiments, if repeated many timesz call θs= P(s) the parameters in a discrete probability distributionz A probability distribution on a sample space is sometimes calleda probability model, in particular if several different distributions are under considerationz write models as M1, M2, probabilities as P(X|M1), P(X|M2)z e.g., M1may be the appropriate prob. dist. if X is from "fair dice", M2is for the "loaded dice". z M is usually a two-tuple of {dist. family, dist. parameters}z Bernoulli distribution: Ber(p)z Multinomial distribution: Mult(1,θ)z Multinomial (indicator) variable: . , w.p. and ],,[ where , ∑∑[1,...,6]∈[1,...,6]∈11110654321====⎥⎥⎥⎥⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎢⎢⎢⎢⎣⎡=jjjjjjjXXXXXXXXXXθθ()xkxkxTxGxCxAjjkTGCAjXPjxpθθθθθθθ==×××====∏}face-dice index the where,{))(( 1Discrete Distributions⎩⎨⎧==−=101xpxpxPfor for )(xxppxP−−=11 )()(⇒5z Multinomial distribution: Mult(n,θ)z Count variable:Discrete Distributions∑=⎥⎥⎥⎦⎤⎢⎢⎢⎣⎡=jjKnxxxX where , M1xKxKxxKxxxnxxxnxpKθθθθ!!!!!!!!)(LLL21212121==Continuous Prob. Distribution[](), ,21xxXP∈()[]()xXPxXP,∞−∈=<z A continuous random variable X can assume any value in an interval on the real line or in a region in a high dimensional spacez A random vector X=[x1, x2, …, xn]Tusually corresponds to a real-valued measurements of some property, e.g., length, position, …z It is not possible to talk about the probability of the random variable assuming a particular value --- P(x) = 0z Instead, we talk about the probability of the random variable assuming a value within a given interval, or half intervalzzArbitrary Boolean combination of basic propositions6Continuous Prob. Distributionz The probability of the random variable assuming a value within some given interval from x1to x2is defined to be the area under the graph of the probability density function betweenx1andx2.z Probability mass: note that z Cumulative distribution function (CDF):z Probability density function (PDF): ()∫∞−=<=xdxxpxXPxP')'()([](), )( ,∫=∈2121xxdxxpxxXP()xPdxdxp=)(. 1 )( =∫+∞∞−dxxpCar flow on Liberty Bridge (cooked up!)xxpdxxp∀>=∫+∞∞−,)( ; 1 )( 0z Uniform Probability Density Functionz Normal (Gaussian) Probability Density Functionz The distribution is symmetric, and is often illustrated as a bell-shaped curve. z Two parameters, µ(mean) and σ(standard deviation), determine the location and shape of the distribution.z The highest point on the normal curve is at the mean, which is also the median and mode.z The mean can be any numerical value: negative, zero, or positive.z Multivariate GaussianContinuous Distributionselsewhere for )/()(01=≤≤−=bxaabxp22221σµσπ/)()(−−=xexpµµxxff((xx))µµxxff((xx))()()()⎭⎬⎫⎩⎨⎧−Σ−−Σ=Σ−µµπµrrrXXXpTn12122121exp),;(//7Class-Conditional Probabilityz Classification-specific Dist.: P(X|Y)z Class prior (i.e., "weight"): P(Y)),;()|(1111Σ==µrXpYXp),;()|(2222Σ==µrXpYXpThe Bayes Rulez What we have just did leads to the following general expression:This is Bayes Rule)()()|()|(XPYpYXPXYP =8The Bayes Decision Rule for Minimum Errorz The a posteriori probability of a samplez Bayes Test:z Likelihood Ratio:z Discriminant function:)()()()()()|()|( XqXpXpXpiYPiYXpXiYPiiiiii≡=====∑ππ=)(Xh=)(XlExample of Decision Rulesz When each class is a normal …z We can write the decision boundary analytically in some cases … homework!!9Bayes Errorz We must calculate the probability of errorz the probability that a sample is assigned to the wrong classz Given a datum X, what is the risk?z The Bayes error (the expected risk): More on Bayes Errorz Bayes error is the lower bound of probability of classification errorz Bayes classifier is the theoretically best classifier that minimize probability of classification errorz Computing Bayes error is in general a very complex problem. Why?z Density estimation:z Integrating density function:10Learning Classifierz The decision rule:z Learning strategies z Generative Learningz Parametric z Nonparametric z
View Full Document