Nonextensive Entropic Kernels

Home> Academic Documents> Nonextensive Entropic Kernels

DOC PREVIEW

This preview shows page 1-2-3 out of 8 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Nonextensive Entropic KernelsAndr´e F. T. Martins†‡[email protected]´ario A. T. Figueiredo‡[email protected] M. Q. Aguiar][email protected] A. Smith†[email protected] P. Xing†[email protected]†School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA‡Instituto de Telecomunica¸c˜oes /]Instituto de Sistemas e Rob´otica, Instituto Superior T´ecnico, Lisboa, PortugalAbstractPositive definite kernels on probability mea-sures have been recently applied in structureddata classification problems. Some of thesekernels are related to classic information the-oretic quantities, such as mutual informationand the Jensen-Shannon divergence. Mean-while, driven by recent advances in Tsal-lis statistics, nonextensive generalizations ofShannon’s information theory have been pro-posed.This paper bridges these two trends. Weintroduce the Jensen-Tsallis q-difference, ageneralization of the Jensen-Shannon diver-gence. We then define a new family of nonex-tensive mutual information kernels, whichallow weights to be assigned to their ar-guments, and which includes the Boolean,Jensen-Shannon, and linear kernels as par-ticular cases. We illustrate the performanceof these kernels on text categorization tasks.1. Intro duct io nThere has been recent interest in kernels on probabil-ity distributions, to tackle several classification prob-lems (Moreno et al., 2003; Jebara et al., 2004; Hein& Bousquet, 2005; Lafferty & Lebanon, 2005; Cuturiet al., 2005). By mapping data points to fitted dis-tributions in a parametric family where a kernel is de-fined, a kernel is automatically induced on the originalinput space. In text categorization, this appears asan alternative to the Euclidean geometry inherent toApp eari ng in Proceedings of the 25thInternational Confer-ence on Machine Learning, Helsinki, Finland, 2008. Copy-right 2008 by the author(s)/owner(s).the usual bag-of-words vector representations. In fact,approaches that map data to a statistical manifold,where well-motivated non-Euclidean metrics may bedefined (Lafferty & Lebanon, 2005), outperform SVMclassifiers with linear kernels (Joachims, 1997). Someof these kernels have a natural information theoreticinterpretation, creating a bridge between kernel meth-ods and information theory (Cuturi et al., 2005; Hein& Bousquet, 2005).We reinforce that bridge by introducing a new class ofkernels rooted in nonextensive (NE) information the-ory. The Shannon and R´enyi entropies (R´enyi, 1961)share the extensivity property: the joint entropy of apair of independent random variables equals the sumof the individual entropies. Abandoning this propertyyields the so-called NE entropies (Havrda & Charv´at,1967; Tsallis, 1988), which have raised great interestamong physicists in modeling certain phenomena (e.g.,long-range interactions and multifractals) and as gen-eralizations of Boltzmann-Gibbs statistical mechanics(Abe, 2006). NE entropies have also been recentlyused in signal/image processing (Li et al., 2006) andother areas (Gell-Mann & Tsallis, 2004).The main contributions of this paper are:• Based on the new conce pt of q-convexity anda related q-Jensen inequality, we introduce theJensen-Tsallis q-difference, a NE generalizationof the Jensen-Shannon (JS) divergence.• We propose a broad family of positive definite(pd) kernels, which are interpretable as NE mu-tual information (MI) kernels. This family rangesfrom the Boolean to the linear kernels, and alsoincludes the JS kernel (Hein & Bousquet, 2005).• We extend results of Hein and Bousquet (2005) byproving positive definiteness of kernels based onNonextensive Entropic Kernelsthe unbalanced JS divergence. As a side note, weshow that the parametrix approximation of themultinomial diffusion kernel introduced by Laf-ferty and Lebanon (2005) is not pd in general.Our main purpose is to present new theoretical insightsabout kernels on measures by unifying some well-known instances into a common parametrized family.This family allows reinterpreting these kernels in lightof NE information theory, a connection that to ourknowledge had not been presented before. The factthat some members of this family are novel pd kernelsleads us to include a set of text categorization experi-ments that illustrates their effectiveness.The paper is organized as follows. Sec. 2 reviews NEentropies, while Jensen differences and divergences arediscussed in Sec. 3. In Sec. 4, the concepts of q-differences and q-convexity are introduced and used todefine the Jensen-Tsallis q-difference. Sec. 5 presentsthe new family of entropic kernels. Sec. 6 reports ex-periments on text categorization and Sec. 7 presentsconcluding remarks and future research directions.Although, for simplicity, we focus on discrete distribu-tions on finite sets, most results are valid in arbitrarymeasured spaces, as shown by Martins et al. (2008).2. Nonext ensive Information TheoryLet X denote a random variable (rv) taking values in afinite set X = {x1, . . . , xn} according to a probabilitydistribution PX. An entropy function is said to beextensive if it is additive over independent variables.For example, the Shannon entropy (Cover & Thomas,1991), H(X) , −E[ln PX], is extensive: if X and Y areindependent, then H(X, Y ) = H(X)+H(Y ). Anotherexample is the family of R´enyi entropies (R´enyi, 1961),parameterized by q ≥ 0,Rq(X) ,11 − qlnnXi=1PX(xi)q, (1)which includes Shannon’s entropy as a special casewhen q → 1.In classic information theory, extensivity is considereddesirable, and is enforced axiomatically (Khinchin,1957), to express the idea borrowed from thermo-dynamics that “independent systems add their en-tropies.” In contrast, the Tsallis entropies abandonthe extensivity requirement (Tsallis, 1988). These NEentropies, denoted Sq(X), are defined as follows:Sq(X) , −Eq(lnqPX) =1q − 1 1 −nXi=1PX(xi)q!,(2)where Eq(f) ,Pni=1P (xi)qf(xi) is the unnormalizedq-expectation, and lnq(y) , (y1−q−1)/(1−q) is the so-called q-logarithm. It is noteworthy that when q → 1,we get Eq→ E, lnq→ ln, and Sq→ H; i.e., the familyof Tsallis entropies also includes Shannon’s entropy.For the Tsallis family, when X and Y are independent,extensivity no longer holds; instead, we haveSq(X, Y ) = Sq(X) + Sq(Y ) −(q −1)Sq(X)Sq(Y ), (3)where the parameter q ≥ 0 is called entropic index.While statistical physics has been the main applica-tion of Tsallis entropies, some attempts have beenmade to


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 8 pages.

Please select your school