1CS 559: Machine LearningCS 559: Machine Learning Fundamentals and Applications3rdSet of NotesInstructor: Philippos MordohaiWebpage: www.cs.stevens.edu/~mordohaiEmail:Philippos Mordohai@stevens eduE-mail: [email protected]: Lieb 215OverviewOverview•DHS Ch2 (cont ):Discriminant FunctionsDHS Ch. 2 (cont.): Discriminant Functions for the Normal Density• DHS Ch. 3: Maximum-Likelihood & BiP EiiBayesian Parameter Estimation Pattern Classification, Chapter 2 2DHS Chapter2DHS Chapter 2 Bayesian Decision Theory (S 2629)(Sections 2-6,2-9)Discriminant Functions for the Normal DensityDiscriminant Functions for the Normal DensityBayes Decision Theory –Discrete Features3Discriminant Functions for the Normal DiDensity•We saw that the minimum error-rateWe saw that the minimum errorrate classification can be achieved by the discriminant functiondiscriminant functiong(x) =lnP(x |)+lnP()gi(x) = lnP(x | i) + lnP(i)Cflii l•Case of multivariate normal)(lnln12ln)()(1)(1iiitiiPdxxxg4Pattern Classification, Chapter 2)(lnln22ln2)()(2)(iiiiiiPxxxg•Casei=2I(Istands for the identity matrix)Case i I(Istands for the identity matrix)function)ntdiscrimina(linear)(0itiiwxwxg1:wherefunction)nt discrimina(linear )(0iiiwxwxgcategory)thfor thethresholdthecalledis()(ln21 ; 202iiitiiiiwPwwProve it!category)th for thethresholdthecalled is (0iiw5Pattern Classification, Chapter 2Linear MachinesLinear Machines–A classifier that uses linear discriminantA classifier that uses linear discriminant functions is called “a linear machine”– The decision surfaces for a linear machine are pieces of hyperplanes defined by:gi(x) = gj(x)6Pattern Classification, Chapter 27Pattern Classification, Chapter 2– The hyperplane separatingRiand Rj)(d)(tt)()( :boundaryDecision )(and )(00xgxgwxwxgwxwxgjijtjjitii0)(0 xxwtw)()()(ln)(21220jiijijiPPxw)(220jijjijiPalways orthogonal to the line linking the means)(1xthen)(P)(Pif8Pattern Classification, Chapter 2)(2x then )(P)(P ifji0ji9Pattern Classification, Chapter 210Pattern Classification, Chapter 2• Case i= (covariances of all classes are identical but arbitrary!)are identical but arbitrary!)HyperplaneseparatingRandR–Hyperplaneseparating Riand Rj1iiw).()()()(/)(ln)(2110 jijitjijijiiiPPx(the hyperplane separating Riand Rjis jijiijgenerally not orthogonal to the line between the means)11Pattern Classification, Chapter 212Pattern Classification, Chapter 213Pattern Classification, Chapter 2• Case i= arbitrary– The covariance matrices are different for each category)( 0itiitiwxwxWxxg1W:)(10iiiiwhereg w2 W1i1iiii)(lnln2121 w10 iiiitiiP(Hyperquadrics which are: hyperplanes, pairs of hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloidshyperhyperboloids)hyperparaboloids, hyperhyperboloids)14Pattern Classification, Chapter 215Pattern Classification, Chapter 216Pattern Classification, Chapter 2Bayes Decision Theory – Discrete Features•Components ofxare binary or integer valued x canComponents of xare binary or integer valued, x can take only one of m discrete values v1, v2, …, vm• Case of independent binary features in 2 category problemproblemLet x = [x1, x2, …, xd]t where each xiis either 0 or 1, with probabilities:pi= P(xi= 1 | 1)qi= P(xi= 1 | 2)qi(i|2)17Pattern Classification, Chapter 2)1()|(11ppxPdxixiii)1()|()1()|(111qqxPppxPdxxiiiii:ratio Likelihood)1()|(12qqxPiii)11()( )|()|( 11qpqpxPxPdxixiii:functionnt Discrimina1)|(12qqxPiii)()(ln11ln)1(lng(x) 211PPqpxqpxdiiiiiii18Pattern Classification, Chapter 2)(01wxwxgidii)1(:qpwhereii:,...,1 )1()1(ln ianddipqqpwiiii)()(ln1ln :10Ppwanddi0g(x) if and 0g(x) if )(121210decidePqii19Pattern Classification, Chapter 2Exercise:DHS Problem212Exercise: DHS Problem 2.12Let ωmax(x) be the state of nature for which P(|)P(|)f lli1P(ωmax|x)>=P(ωi|x) for all i=1,…,c• Show that P(ωmax|x)>=1/c• Show that for the minimum-error-rate decision rule the average probability of error is given by • Use these two results to show that P(error)<= (1)/dxxpxPerrorP)()|(1)(max(c-1)/c• Describe a situation for which P(error)=(c-1)/cPattern Classification, Chapter 3 20Discriminant Functions for the Normal DiF lid #5Density•From slide #5• Case i= 2I:where )(0itiiwxwxg )(ln21 ; :where202iitiiiiPww)(2202iiiii21Discriminant Function ExampleDiscriminant Function ExampleO. Veksler 22Discriminant Function ExampleDiscriminant Function ExampleO. Veksler 23Discriminant Function ExampleDiscriminant Function ExampleO. Veksler 24Discriminant Function ExampleDiscriminant Function ExampleO. Veksler 25Chapter 3:Chapter 3:Maximum-Likelihood & Bayesian Parameter Estimation (part 1)Pattern Classification, Chapter 3 26IntroductionIntroduction• Data availability in a Bayesian frameworkyy– We could design an optimal classifier if we knew:• p(i) (priors)•p(x|)(class-conditional densities)•p(x | i) (class-conditional densities)Unfortunately, we rarely have this complete information!• Design a classifier from a training sampleggp– No problem with prior estimation– Samples are often too small for class-conditional estimation (large dimension of featurespace)estimation (large dimension of feature space)Pattern Classification, Chapter 3 27Parameter EstimationParameter Estimation•Use apriori information about the problemUse a priori information about the problem•E.g.: Normalityofp(x|i)E.g.: Normality of p(x | i)p(x | i) ~ N( i, i)p(|i)(ii)•Simplify problemSimplify problem– From estimating unknown distribution function–To estimating parameters gpPattern Classification, Chapter 3 28Estimation techniquesEstimation techniques•Maximum-Likelihood(ML) andBayesianMaximumLikelihood (ML) and Bayesian estimation•Results areoften identical, but theResults are often identical, but the approaches are fundamentally
View Full Document