New version page

Beyond Keyword Search: Discovering Relevant Scientific Literature

Upgrade to remove ads

This preview shows page 1-2-3 out of 9 pages.

Save
View Full Document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience

Upgrade to remove ads
Unformatted text preview:

Beyond Keyword Search: Discovering Relevant ScientificLiteratureKhalid El-AriniComputer Science DepartmentCarnegie Mellon [email protected] GuestrinMachine Learning DepartmentCarnegie Mellon [email protected] scientific research, it is often difficult to express informa-tion needs as simple keyword queries. We present a morenatural way of searching for relevant scientific literature.Rather than a string of keywor d s, we define a query as asmall set o f papers deemed relevant to the research taskat ha n d . By op t imiz in g an objective function based ona fine-grained notion of influence between documents, ourapproach efficiently selects a set of highly relevant articles.Moreover, as scientists trust some authors more than oth-ers, results are personalized to individual preferences. Ina user study, researchers found the papers recommendedby our method to be more useful, trustworthy and diversethan those selected by popular alternatives, such as GoogleScholar and a state-of-the-art topic modeling approach.Categories and Subject DescriptorsG.3 [Mathematics of Computing]: Probability and Statis-tics; H.3.3 [Information Storage and Retrieval]: Infor-mation Search and Retrieval—query formulation,relevancefeedback,retrieval modelsGeneral TermsAlgorithms,ExperimentationKeywordspersonalization, citation analysis1. INTRODUCTIONFor generations, scientists have built upon the publishedwork of their predecessors and contemporaries in order tomake new discoveries. However, as the number of publi-cations has grown, it has become increasingly difficult forscientists to find relevant prior work for their particular re-search. In fact, as early as 1755, the French philosopher De-nis Diderot presciently forewarned that there would come aPermission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.KDD’11, August 21–24, 2011, San Diego, California, USA.Copyright 2011 ACM 978-1-4503-0813-7/11/08 ...$10.00.day when “it will be a lmo s t as convenient to search for somebit of truth concealed in nature as it will be to find it hid-den away in an immense multitude of bound volumes,” [13].Today, we can quantify this “immense multitude” to includetens of millions of articles published in tens of thousands ofjournals and conferences [43].Currently, researchers primarily rely on keyword searchof online indices such as Google Scholar and PubMed tohelp them combat this overload of information. While thesetools are indispensable, there are many instances where aresearcher’s information need cannot be easily specified asa simple string of keywords. Often, such a keyword queryis either overly broad, returning many articles that are atbest loosely related to the researcher’s specific need, or toonarrow, potentially returning no articles at all. In theseoccasions, it may be more natural fo r the scientist to specifyhis query as a small set of pap ers rather than as a set ofwords. In particular, having already read some articles thatare related to the specific task at ha n d , the scientist can ask,“given that these paper s represent my immediate researchfocus, what else sh o u ld I read?”.Here, we present an algorithm for discovering r elevant sc i-entific literature by responding to queries of this form. Moreformally, given a small set of papers Q that we refer to asthe query set, we seek to return a set A of add itio n a l papersthat are related to the concept defined by the query. Intu-itively, a paper that cites all of the articles in Q is likely torepresent related research. Likewise, a pap er that is citedby every article in Q might contain relevant background in-formation. However, it is restrictive to require the papersin A to have a direct citation to or from every article inthe query set, as such p a pers are not guaranteed to exist.Instead, we wish to select a set A that maximizes a moregeneral notion of influence to and from the papers in Q.2. MODELING SCIENTIFIC INFLUENCETo d efi n e a notio n of influenc e in scientific literature, weobserve that the content of a publication is an amalgam ofseveral sources, combining cited prior work with the authors’novel insights and background experience. For a given col-lection of articles, ideas travel from cited papers to citingpapers, and from earlier to subsequent papers by the sameauthor (Figure 1A). Our no tio n of influence should capturethis transfer of ideas, modeling both the extent to whichideas travel between documents, as well as their topical mat-ter. To achieve such fine-grain ed detail, we define influencewith respect to the individu a l concepts found in a documentcollection, wh ich could be, e.g., technical terms or informa-424 novelty1334494671829 76570.10.05oxygenFigure 2: An example from the PNAS data set, il-lustrating the edge weight computation for a node inGoxygen. Solid black edges indicate citations, while dot-ted black edges indicate common authorship. The dottedgray edge refers to a paper sharing an author with 9467,but not containing the concept “oxygen.” Edge weightsare assigned proportional to the bar chart, indicating theprevalence of “oxygen” in each parent nod e. The barsover 1829 and 7657 are shortened to one third of theiroriginal height (indicated in light gray), such that thecontribution due to common authorship is equivalent tothat of a single paper. The novelty node is only used tonormalize the edge weights, and in this case is dominatedin influence by the other articles.tive phrases.1For example, we might say that the ideastransferred from one paper to another involve the concepts“energy” o r “nitric oxide.”For each co n c ep t c in our vocabulary of concepts C, wedefine a direct ed , acyclic graph Gc, where the nodes r ep -resent papers that contain c and the edges represent cita-tions and common authorship. Figures 1B and 1C show twosuch graphs for a subset of articles from the Proceedings ofthe National Academy of Sciences (PNAS), for th e concepts“plant”and“stress.” While a path between two nodes in sucha graph may indicate influence with respect to a particularconcept, mere existence of a path


Download Beyond Keyword Search: Discovering Relevant Scientific Literature
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Beyond Keyword Search: Discovering Relevant Scientific Literature and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Beyond Keyword Search: Discovering Relevant Scientific Literature 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?