Nonextensive Entropic KernelsAndr´e F. T. Martins†‡[email protected]´ario A. T. Figueiredo‡[email protected] M. Q. Aguiar][email protected] A. Smith†[email protected] P. Xing†[email protected]†School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA‡Instituto de Telecomunica¸c˜oes /]Instituto de Sistemas e Rob´otica, Instituto Superior T´ecnico, Lisboa, PortugalAbstractPositive definite kernels on probability mea-sures have been recently applied in structureddata classification problems. Some of thesekernels are related to classic information the-oretic quantities, such as mutual informationand the Jensen-Shannon divergence. Mean-while, driven by recent advances in Tsal-lis statistics, nonextensive generalizations ofShannon’s information theory have been pro-posed.This paper bridges these two trends. Weintroduce the Jensen-Tsallis q-difference, ageneralization of the Jensen-Shannon diver-gence. We then define a new family of nonex-tensive mutual information kernels, whichallow weights to be assigned to their ar-guments, and which includes the Boolean,Jensen-Shannon, and linear kernels as par-ticular cases. We illustrate the performanceof these kernels on text categorization tasks.1. Intro duct io nThere has been recent interest in kernels on probabil-ity distributions, to tackle several classification prob-lems (Moreno et al., 2003; Jebara et al., 2004; Hein& Bousquet, 2005; Lafferty & Lebanon, 2005; Cuturiet al., 2005). By mapping data points to fitted dis-tributions in a parametric family where a kernel is de-fined, a kernel is automatically induced on the originalinput space. In text categorization, this appears asan alternative to the Euclidean geometry inherent toApp eari ng in Proceedings of the 25thInternational Confer-ence on Machine Learning, Helsinki, Finland, 2008. Copy-right 2008 by the author(s)/owner(s).the usual bag-of-words vector representations. In fact,approaches that map data to a statistical manifold,where well-motivated non-Euclidean metrics may bedefined (Lafferty & Lebanon, 2005), outperform SVMclassifiers with linear kernels (Joachims, 1997). Someof these kernels have a natural information theoreticinterpretation, creating a bridge between kernel meth-ods and information theory (Cuturi et al., 2005; Hein& Bousquet, 2005).We reinforce that bridge by introducing a new class ofkernels rooted in nonextensive (NE) information the-ory. The Shannon and R´enyi entropies (R´enyi, 1961)share the extensivity property: the joint entropy of apair of independent random variables equals the sumof the individual entropies. Abandoning this propertyyields the so-called NE entropies (Havrda & Charv´at,1967; Tsallis, 1988), which have raised great interestamong physicists in modeling certain phenomena (e.g.,long-range interactions and multifractals) and as gen-eralizations of Boltzmann-Gibbs statistical mechanics(Abe, 2006). NE entropies have also been recentlyused in signal/image processing (Li et al., 2006) andother areas (Gell-Mann & Tsallis, 2004).The main contributions of this paper are:• Based on the new conce pt of q-convexity anda related q-Jensen inequality, we introduce theJensen-Tsallis q-difference, a NE generalizationof the Jensen-Shannon (JS) divergence.• We propose a broad family of positive definite(pd) kernels, which are interpretable as NE mu-tual information (MI) kernels. This family rangesfrom the Boolean to the linear kernels, and alsoincludes the JS kernel (Hein & Bousquet, 2005).• We extend results of Hein and Bousquet (2005) byproving positive definiteness of kernels based onNonextensive Entropic Kernelsthe unbalanced JS divergence. As a side note, weshow that the parametrix approximation of themultinomial diffusion kernel introduced by Laf-ferty and Lebanon (2005) is not pd in general.Our main purpose is to present new theoretical insightsabout kernels on measures by unifying some well-known instances into a common parametrized family.This family allows reinterpreting these kernels in lightof NE information theory, a connection that to ourknowledge had not been presented before. The factthat some members of this family are novel pd kernelsleads us to include a set of text categorization experi-ments that illustrates their effectiveness.The paper is organized as follows. Sec. 2 reviews NEentropies, while Jensen differences and divergences arediscussed in Sec. 3. In Sec. 4, the concepts of q-differences and q-convexity are introduced and used todefine the Jensen-Tsallis q-difference. Sec. 5 presentsthe new family of entropic kernels. Sec. 6 reports ex-periments on text categorization and Sec. 7 presentsconcluding remarks and future research directions.Although, for simplicity, we focus on discrete distribu-tions on finite sets, most results are valid in arbitrarymeasured spaces, as shown by Martins et al. (2008).2. Nonext ensive Information TheoryLet X denote a random variable (rv) taking values in afinite set X = {x1, . . . , xn} according to a probabilitydistribution PX. An entropy function is said to beextensive if it is additive over independent variables.For example, the Shannon entropy (Cover & Thomas,1991), H(X) , −E[ln PX], is extensive: if X and Y areindependent, then H(X, Y ) = H(X)+H(Y ). Anotherexample is the family of R´enyi entropies (R´enyi, 1961),parameterized by q ≥ 0,Rq(X) ,11 − qlnnXi=1PX(xi)q, (1)which includes Shannon’s entropy as a special casewhen q → 1.In classic information theory, extensivity is considereddesirable, and is enforced axiomatically (Khinchin,1957), to express the idea borrowed from thermo-dynamics that “independent systems add their en-tropies.” In contrast, the Tsallis entropies abandonthe extensivity requirement (Tsallis, 1988). These NEentropies, denoted Sq(X), are defined as follows:Sq(X) , −Eq(lnqPX) =1q − 1 1 −nXi=1PX(xi)q!,(2)where Eq(f) ,Pni=1P (xi)qf(xi) is the unnormalizedq-expectation, and lnq(y) , (y1−q−1)/(1−q) is the so-called q-logarithm. It is noteworthy that when q → 1,we get Eq→ E, lnq→ ln, and Sq→ H; i.e., the familyof Tsallis entropies also includes Shannon’s entropy.For the Tsallis family, when X and Y are independent,extensivity no longer holds; instead, we haveSq(X, Y ) = Sq(X) + Sq(Y ) −(q −1)Sq(X)Sq(Y ), (3)where the parameter q ≥ 0 is called entropic index.While statistical physics has been the main applica-tion of Tsallis entropies, some attempts have beenmade to
or
We will never post anything without your permission.
Don't have an account? Sign up