DOC PREVIEW
Stanford CS 224 - Study Notes

This preview shows page 1-2 out of 7 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

A Labeled LDA Approach to Understanding the Dynamics ofCollaborationNikhil JohriDepartment of Computer ScienceStanford UniversityStanford, CA, [email protected] propose a topic modeling approachto understand the nature of academic col-laborations between individuals. Specifi-cally, we use Labeled LDA (Ramage etal., 2009), a variation of the popular topicmodel Latent Dirichlet Allocation (Blei etal., 2003), to train a set of author-specifictopics over the ACL corpus. The ACLcorpus ranges from 1965 to 2009, and wetrain a separate topic model for each yearover the papers published upto the givenyear. Once we have trained these models,we examine the influence present in a pub-lication from each of its authors. We usea function of the cosine similarity scorebetween the document’s term vector andeach author’s topic signature in the yearpreceding the document’s publication asour metric. We suggest a theory of thedifferent types of academic collaborationprevalent and discuss how our system per-forms at classifying high impact papersinto these types. Finally, we demonstratehow our system can be applied to answerseveral questions regarding the nature ofcollaborations in Computational Linguis-tics research.1 IntroductionThe understanding of academic collaboration hasattracted much interest in the realm of the socialsciences, and is recently gaining traction in com-puter science, particularly from the viewpoint ofsocial network analysis.We propose a theoretical framework for deter-mining the types of collaboration present in a doc-ument, based on factors such as the number ofestablished authors, the presence of unestablishedauthors and the similarity of the established au-thors’ past work to the document’s term vector.These collaboration types attempt to describe thenature of co-authorships between of students andadvisors as well as those solely between estab-lished authors in the field. We present a decisiondiagram for classifying papers into these types, aswell as a description of the intuition behind eachcollaboration class.Once we have a proposed theory, we attempt tocreate a system that can automatically categorizecollaborative works into their collaboration types.Labeled Latent Dirichlet Allocation (LLDA) (Ra-mage et al., 2009) is used to realize this task.LLDA has found success in a number of naturallanguage processing topic modeling tasks, such asthe credit attribution problem. For our system, weuse LLDA to train topic models over the ACL cor-pus for every year. Each topic, in this case, per-tains to a single author. Using the author signa-tures so obtained, we come up with a series ofmetrics to determine how we might classify eachdocument.We qualitatively analyze our results by exam-ining the categorization of several high impactpapers. With consultation from prominent re-searchers and textbook writers in the field, we ver-ify how accurate these results are. However, giventhe subjective nature of the categorization and col-laboration types, it is hard to come up with a rea-sonable quantitative evaluation.2 Related WorkIn recent years, popular topic models such as LDA(Blei et al., 2003) have been increasingly usedto study the history of science by observing thechanging trends in term based topics (Hall et al.,2008), (Gerrish and Blei, 2010). In the case ofHall et al., regular LDA topic models were trainedover the ACL anthology on a per year basis, andthe changing trends in topics were studied fromyear to year. Gerrish and Blei’s work computeda measure of influence by using Dynamic TopicModels (Blei and Lafferty, 2006) and studyingthe change of statistics of the language used in acorpus.These models propose interesting ideas for uti-lizing topic modeling to understand of scien-tific history. However, our primary interest, inthis paper, is the study of academic collabora-tion between different authors; we therefore lookto learn models for authors instead of only doc-uments. Popular topic models for authors in-clude the Author-Topic Model (Rosen-Zvi et al.,2004), a simple extension of regular LDA thatadds an additional author variable over the topics.The Author-Topic Model learns a distribution overwords for each topic, as in regular LDA, as wellas a distribution over topics for each author. Al-ternatively, Labeled LDA (Ramage et al., 2009),another LDA variation, offers us the ability to di-rectly model authors as topics by considering themto be the topic labels for the documents they au-thor.In this work, we use Labeled LDA to directlymodel probabilistic term ‘signatures’ for authors.As in (Hall et al., 2008) and (Gerrish and Blei,2010), we learn a new topic model for each yearin the corpus, allowing us to account for changingauthor interests over time.3 Methodology3.1 Labeled Latent Dirichlet AllocationLatent Dirichlet Allocation, or LDA (Blei et al.,2003), is a widely popular technique of probabilis-tic topic modeling where each document in a cor-pus is modeled as a mixture of ‘topics’, whichthemselves are probability distributions over thewords in the vocabulary of the corpus. LDA iscompletely unsupervised, assumes that a latenttopic layer exists and that each word is generatedfrom one underlying topic from this set of latenttopics.Labeled Latent Dirichlet Allocation, orLLDA (Ramage et al., 2009), is a variation on theregular LDA topic model whereby a one-by-onerelation is defined between topics and tags (or inour domain, authors). This constrains each topicto correspond to exactly one author when learningthe model. As a result, we retrieve for each authora direct distribution over terms. This distributionserves as a ’signature’ for an author, dominatedby the terms frequently used by the author. It isuseful to note that the advantage of using LabeledLDA over another topic model comes in thefact that topics are constrained to pre-assignedlabels. This assures us that each topic created willcorrespond to a single author, unlike regular LDAwhere terms cluster based on semantic topics.We train a separate LLDA model for each yearin the corpus, training on only those papers writtenbefore and during the given year. Thus, we haveseparate ‘signatures’ for each author for each year,and each signature only contains information forthe specific author’s work upto and including thegiven year.3.2 Types of CollaborationThere are several ways one can envision to differ-entiate between types of academic collaborations.We focus on three factors when


View Full Document

Stanford CS 224 - Study Notes

Documents in this Course
Load more
Download Study Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?