Note on Fisher Linear DiscriminantsPietro PeronaCalifornia Institute of [email protected] 8, 20001 IntroductionWe have N data points Xi∈ RD. The points belong to G groups. Considerthe problem of finding a linear transformation a to one-dimensional space suchthat the points Xia = Zi∈ R are easy to classify. For simplicity we will assumethat the N data points Xihave zero mean.Some notation: given a matrix A indicate with Aiand Ajthe i-th row andthe j-th column of A, and with Ajithe i, j-th element of A. Moreover:[i] = g g is the group of point i (1)X =X1. . .XN∈ RN×Dthe data (2)n =n1, . . . , nG number of data points in each group (3)N = diag(n) = GTG =n1, 0, 0, . . . , 00, n2, 0, . . . , 0. . .0, 0, . . . , 0, ng(4)Gji= δ([i], j) i.e. G =1, 0, . . . , 0. . .0, . . . , 0, 1∈ RN×G(5)Mg=1ngX[i]=gXiMean of j-th coordinate in group g (6)M = N−1GTX =M1. . .MG∈ RG×DMatrix collecting the means of each group(7)A = UALAVTAthe singular value decomposition of A (8)D = I − GN−1GT(9)X0= X − GM = DX the data, each referred to group’s mean (10)12 Optimization problemIn order for the points Zito be easy to classify one would like to simultaneouslymaximize the between-clusters distance and minimize the within cluster distance.These quantities may be defined as:Between-clusters distance – Consider the means Mgof each group g. Onewould like to maximize their spread around the overall mean (the origin,since X is zero-mean):B = (GM )T(GM) = XTGN−1GTX (11)Notice that each mean Mgis counted ngtimes in order to reflect thefrequency of group g.Within-clusters distance – Consider the spread of the points around eachgroup’s center. One would like to minimize:W = XT0X0= XTDTDX (12)In order to optimize both quantities simultaneously Fisher prop ose d to max-imize their ratio with respect to the transformation a:J(a) =aTBaaTW a(13)Taking the derivative with respect to a and equating to zero:DJ(a) =2BaaTW a − 2W aaTBa(aTW a)2= 0 (14)define λ.=aTBaaTW a(15)⇒ Ba = λW a aTW a 6= 0 (16)(17)Therefore in order to find the value of a we need to solve the generalized eigen-vector problem Ba = λW a subject to aTW a 6= 0.2.1 Eigenvector problemCall W†the generalized inverse of W , i.e. the inverse restricted to the subspacewhere W is nonsingular. Then:Ba = λW a aTW a 6= 0 (18)W†= VX0L†2X0VTX0(19)⇒ W†Ba = λa (20)define(UW B, LW B, VW B).= SVD(W†B) (21)⇒ a = V1W B(22)22.2 Alternative approachAn equivalent approach consists in calculating a coordinate transformation a =Sb such that aTW a = kbk2. In this case one may calculate the b that maximizesthe numerator, subject to kbk = 1. One must, however, pay attention to thefact that the solution b must not be in the null space of S.From the definition of W and B etc.:W = XT0X0= VX0L2X0VTX0(23)S = VX0L−1X0(24)a = Sb (25)b = S−1a = LX0VTX0a (26)aTW a = aTVX0L2X0VTX0a = bTb (27)aTBa = bTSTBSb (28)(U, L, V ).= SVD(STBS) (29)⇒ b = V1(30)⇒ a = SV1= VX0L−1X0V1(31)3 Code and ReferencesCheck out the Matlab function fisherLD.m written by Markus Weber. A prizeto whoever figures out how the code works.You will find Fisher linear discriminants discussed in B. Ripley Pattern recog-nition and neural networks, Cambridge University Press, 1996 (SFL library, calln. qa 76.87 r56
View Full Document