Modeling Hidden Topics on Document Manifold

Home> Academic Documents> Modeling Hidden Topics on Document Manifold

DOC PREVIEW

This preview shows page 1-2-3 out of 10 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Modeling Hidden Topics on Document ManifoldDeng Cai, Qiaozhu Mei, Jiawei Han, Chengxiang ZhaiDepartment of Computer ScienceUniversity of Illinois at Urbana-Champaign{dengcai2, qmei2, hanj, czhai}@cs.uiuc.eduABSTRACTTopic modeling has been a key problem for document analysis.One of the canonical approaches for topic modeling is ProbabilisticLatent Semantic Indexing, which maximizes the joint probabilityof documents and terms in the corpus. The major disadvantage ofPLSI is that it estimates the probability distribution of each docu-ment on the hidden topics independently and the number of param-eters in the model grows linearly with the size of the corpus, whichleads to serious problems with overfitting. Latent Dirichlet Allo-cation (LDA) is proposed to overcome this problem by treating theprobability distribution of each document over topics as a hiddenrandom variable. Both of these two methods discover the hiddentopics in the Euclidean space. However, there is no convincing evi-dence that the document space is Euclidean, or flat. Therefore, it ismore natural and reasonable to assume that the document space isa manifold, either linear or nonlinear. In this paper, we consider theproblem of topic modeling on intrinsic document manifold. Specif-ically, we propose a novel algorithm called Laplacian ProbabilisticLatent Semantic Indexing (LapPLSI) for topic modeling. LapPLSImodels the document space as a submanifold embedded in the am-bient space and directly performs the topic modeling on this doc-ument manifold in question. We compare the proposed LapPLSIapproach with PLSI and LDA on three text data sets. Experimen-tal results show that LapPLSI provides better representation in thesense of semantic structure.Categories and Subject DescriptorsG.3 [Mathematics of Computing]: Probability and Statistics—Statistical computing; H.3.1 [InformationStorageand Retrieval]:Content Analysis and Indexing—Indexing methodsGeneral TermsAlgorithms, Performance, TheoryKeywordsProbabilistic Latent Semantic Indexing, Manifold Regularization,Document Representation, Generative ModelPermission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.CIKM’08, October 26–30, 2008, Napa Valley, California, USA.Copyright 2008 ACM 978-1-59593-991-3/08/10 ...$5.00.1. INTRODUCTIONDocument representation has been a key problem for documentanalysis and processing[8][10][11]. The Vector Space Model (VSM)might be one of the most popular models for document represen-tation. In VSM, each document is represented as a bag of words.Correspondingly, the inner product (or, cosine similarity) is used asthe standard similarity measure for documents or documents andqueries. Unfortunately, it is well known that VSM has severe draw-backs, mainly due to the ambiguity of words (polysemy) and thepersonal style and individual differences in word usage (synonymy).To deal with these problems, IR researchers have proposed sev-eral dimensionality reduction techniques, most notably Latent Se-mantic Indexing (LSI) [8]. LSI uses a Singular Value Decompo-sition (SVD) of the term-document matrix X to identify a linearsubspace (so-called latent semantic space) that captures most ofthe variance in the data set. The general claim is that similaritiesbetween documents or between documents and queries can be morereliably estimated in the reduced latent space representation than inthe original representation. LSI received a lot of attentions duringthese years and many variants of LSI have been proposed [1][20].Despite its remarkable success in different domains, LSI has anumber of deficits, mainly due to its unsatisfactory statistical for-mulation [12]. To address this issue, Hofmann [11] proposed agenerative probabilistic model named Probabilistic Latent Seman-tic Indexing (PLSI). PLSI models each word in a document as asample from a mixture model, where the mixture components aremultinomial random variables that can be viewed as representa-tions of “topics.” Each document is represented as a list of mixingproportions for these mixture components and thereby reduced toa probability distribution on a fixed set of topics. This distribu-tion is the “reduced representation” associated with the document.The major disadvantage of PLSI is that it estimates the probabilitydistribution of each document on the hidden topics independentlyand the number of parameters in the model grows linearly with thesize of the corpus. This leads to serious problems with overfitting[16][5][19]. Latent Dirichlet Allocation (LDA) is then proposed toovercome this problem by treating the probability distribution ofeach document over topics as a K-parameter hidden random vari-able rather than a large set of individual parameters, where the Kis the number of hidden topics.Both of the above two topic modeling approaches discover thehidden topics in the Euclidean space. However, there is no convinc-ing evidence that the documents are actually sampled from a Eu-clidean space. Recent studies suggest that the documents are usu-ally sampled from a nonlinear low-dimensional manifold which isembedded in the high-dimensional ambient space [10][23]. Thus,the local geometric structure is essential to reveal the hidden se-mantics in the corpora.In this paper, we propose a new algorithm called LaplacianProbabilisticLatentSemantic Indexing (LapPLSI). LapPLSI con-siders the topic modeling on the document manifold. It models thedocument space as a submanifold embedded in the ambient spaceand directly perform the topic modeling on this document mani-fold in question. By discovering the local neighborhood structure,our algorithm can have more discriminating power than PLSI andLDA. Specifically, LapPLSI first builds an nearest neighbor graphto model the local document manifold structure. It is natural to as-sume that two sufficiently close documents have similar probabilitydistribution over different topics. The nearest neighbor graph struc-ture is then incorporated into the log-likelihood maximization as aregularization term for LapPLSI. In this way, the topic model esti-mated by LapPLSI maximizes the joint probability over the corpusand


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 10 pages.

Please select your school