Gaussian density/distributionThe Gaussian density function of n-dimensional vectors is:g(x; µ, C) =1(√2π)n|C|1/2e−12(x−µ)TC−1(x−µ)Here µ is the distribution mean and C is the covariance matrix. The value |C| is the determinantof the matrix C. The parameters µ, C can be estimated from the data by:µ =Pmi=1xim, C =Pi(xi− u)(xi− u)Tmor C =Pi(xi− u)(xi− u)Tm − 1The special n=1 case is:g(x; µ, σ) =1σ√2πe−12σ2(x−µ)2The parameters µ, C can be estimated from the data by:µ =Pmi=1xim, C =Pi(xi− u)2mor C =Pi(xi− u)2m − 1Discriminant functions for Gaussian densitySuppose the positive examples are distributed with paramters µ1, C1, and the negative examplesare distributed with paramters µ2, C2. When should we decide that x is positive and not negative?Call h1the hypothesis that x is positive, and h2the hypothesis that x is negative. We shoulddecide that x is positive if:P (h1|x) > P (h2|x)Observe that P (x|h1) = g(x; µ1, Σ1), and similarly Observe that P (x|h2) = g(x; µ2, Σ2). Therefore:ML classifier: g(x; µ1, C1) > g(x; µ2, C2)MAP classifier: g(x; µ1, C1)P (h1) > g(x; µ2, C2)P (h2)Taking logarithm, the MAP classifier is:−12(x − µ1)TC−11(x − µ1) −n ln 2π2−ln |C1|2+ P (h1)+12(x − µ2)TC−12(x − µ2) +n ln 2π2+ln |C2|2− P (h2)> 0This simplifies to:(x − µ2)TC−12(x − µ2) − (x − µ1)TC−11(x − µ1) + ln |C2| − ln |C1| + 2(P (h1) − P (h2)) > 0This always gives a quadratic expression. The main problem with this approach is that the covari-ance matrices have O(n2) variables that need to be determined. Reliable estimation may require alot of training data. Instead, this model is typically used with additional simplifying assumptionsthat reduce the quadratic expression to a linear expression.1Case 1: C1= C2= σ2IHere the discriminant function is:|x − µ2|2− |x − µ1|2+ 2(P (h1) − P (h2)) > 0This can be further simplified to:d(x) =wTx + b > 0 where:w = (µ1− µ2)b =12(|µ1|2− |µ2|2) + σ2(ln P (h1) − ln P (h2))Typically only the value of w is computed, the values of wTxiare calculated for all i, and the valueof b is determined using ad-hoc techniques for computing a threshold.Case 2: C1= C2= CIn this case it can be shown that the discriminant function is:d(x) =wTx + b > 0 where:w = C−1(µ1− µ2)b =12(µT2C−1µ2− µT1C−1µ1) + σ2(ln P (h1) − ln P (h2))Typically only the value of w is computed, the values of wTxiare calculated for all i, and the valueof b is determined using ad-hoc techniques for computing a threshold.Case 3: arbitrary C1, C2Here we can get an arbitrary quadratic function as the discriminant.d(x) = (x − µ2)TC−12(x − µ2) − (x − µ1)TC−11(x − µ1) + ln |C2| − ln |C1| + 2(P (h1) − P (h2)) > 02Examplex1x2y2 6 +3 8 +4 6 +3 4 +1 -2 -3 0 -5 -2 -3 -4 -Assume P (h1) = P (h2). We have: µ1=36µ2=3−2Case 1:w =08, b = −16The value of b was computed by observing the sorted products of wTx. They are:-4 · 8 -2 · 8 -2 · 8 0 · 8 4 · 8 6 · 8 6 · 8 8 · 8- - - - + + + +Select b to maximize the margins.Case 2:C =10 00 16w =1/10 00 1/16·08=01/2b = −1d(x) = 8x2− 16d(x) =12x2− 1The value of b was computed by observing the sorted products of wTx. They are:-4 ·12-2 ·12-2 ·120 ·124 ·126 ·126 ·128 ·12- - - - + + + +Select b to maximize the margins.Case 3:C1=1/2 00 2C−11=2 00 1/2, C2=2 00 2C−12=1/2 00 1/2Substituting in d(x) for Case 3 we get:d(x) = −x2 + 0.1875x21− 1.125x1+
View Full Document