Stanford CS 224 - Learning Paraphrase Models from Google New Headlines - D93100

Home> Schools> Stanford University> Computer Science (CS) > CS 224> Learning Paraphrase Models from Google New Headlines

DOC PREVIEW

Stanford CS 224 - Learning Paraphrase Models from Google New Headlines

School name Stanford University

Course Cs 224- N Natural Language Processing with Deep Learning

Pages 7

This preview shows page 1-2 out of 7 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Learning Paraphrase Models from Google New HeadlinesDave Kale(with thanks to Teg Grenager and Bill MacCartney)CS 224N Final Project, Spring [email protected] sources like the clusters of news head-lines at Google News present an exciting op-portunity to learn paraphrase models fromdata automatically. We present both a noveldataset and a novel approach to automatic,unsupervised learning of paraphrase mod-els from that datset. Leveraging existingNLP tools such as the Stanford Parser andlexical resources such as WordNet and In-fomap, we constructed a system that firstaligns the typed dependency graphs of largenumbers of parallel headlines (on the orderof hundreds) and then uses aligned paths be-tween corresponding nodes as candidates toa paraphrase extraction system. We presentsome preliminary results in the form of ac-tual learned paraphrase models. This projectserves as a proof of concept for this approachand sheds some light on likely next steps.1 IntroductionParaphrases can prove especially troublesome to thetask of recognizing textual entailment (RTE). Twomultiword phrases that possess virtually the samemeaning can vary in terms of length, syntactic unitsand structure, individual words, and more. Manytextual entailment systems include an alignment stepduring which corresponding words and structuresare matched up or nodes and edges in typed de-pendency graphs are aligned (MacCartney et al.,2006). Such multiword expressions are particu-larly challenging because they span multiple nodesand edges and can greatly effect the performanceof an alignment algorithm. In many RTE systemsthis bad performance is tolerated and compensatedfor in later steps. In other cases RTE systems at-tempt to collapse common multiword expressionsinto single nodes; this is done predominantly fornamed entities, compound nouns, and prepositionalphrases. Nevertheless, no RTE systems have pre-sented a comprehensively robust and broad methodfor handling paraphrasing.One approach is the creation and use of a phrasalresource similar in practice to many more commonlexical resources (WordNet, InfoMap, etc.). Perhapsthe most famous is the Discovery of Inference Rulesfrom Text (DIRT) database first presented in (Lin andPantel, 2001), which was constructed automaticallyand is used most commonly for question answering.(Lin and Pantel, 2001) extracted their paraphrasesby matching up typed dependency tree paths withhigh distributional similarity (i.e., statistically sim-ilar contexts). Other work has made similar useof distributional similarity and contexts (Hasegawaet al., 2004). More recent work has leveraged thecommon existence of Named Entities (NEs) in textslike newspaper articles, detecting, clustering, andutilizing them as anchors for paraphrase candidates(Shinyama and Sekine, 2003). Such systems extractparaphrases from phrases or contexts involving NEsalmost exclusively. Both approaches deal with thetask of finding (or creating) an appropriate dataset,which is time-consuming and challenging. Para-phrases, by definition, express some sort of com-mon semantic content and so paraphrases should beextracted from passages with similar, if not identi-cal, meaning; however, identifying and correspond-ing such passages requires some strategy for recog-nizing this kind of entailment or relation. While boththe approaches above find ways around this prob-lem, they do so by trading off the scope and gen-erality of the paraphrases they are able to extract(Sekine, 2005).Thus, paraphrase modeling, particularly whendone automatically on unlabeled texts, is an openproblem and remains quite challenging. We presentan alternative approach inspired by the availabil-ity of a novel data source and recent advances inthe area of textual entailment and graph alignment.The implemented system acts as proof of conceptfor both the dataset and the approach, demonstratespromising results, and opens the door to promisingfuture work.1.1 Google News DatasetThe main problem with the task of learning para-phrase models from data automatically is the verynature of the data itself. By nature the selection andpreparation of appropriate data ”begs the question.”To discover and extract paraphrases requires the cor-respondence of semantically similar passages andphrases; however, detection of semantically simi-lar passages requires some understanding of notionsof paraphrase and entailment. Many systems getaround this question with a variety of approxima-tions (like bag-of-words keyword matching, distri-butional similarity, restricting data to named enti-ties, etc.), but this places restrictions on the even-tual results. Another possibility would be to compileand annotate a large dataset by hand (rather, by thehands of unpaid undergraduate research assistants),but this is tedious and annoying.The strategy we have adopted is to allow largecorporations with nearly unlimited resources andgenerally public products to do this on our behalf.We are, of course, referring to Google and, in par-ticular, to its Google News service. Google News(in its own words) is ”a computer-generated newssite that aggregates headlines from more than 4,500English-language news sources worldwide, groupssimilar stories together and displays them accord-ing to each reader’s personalized interests” (GoogleNews website, 2007). They cluster articles based notonly on text-based features of the headlines and ar-ticles themselves but also on characteristics of theirrespective publications, publication time, web statis-tics, etc. The end result is a website that is updatedmore than once per day and that at any one time hasarticle headlines numbering in the hundreds of thou-sands. The articles are partitioned into one of seventopics (World, U.S., Business, Sci/Tech, Sports, En-tertainment, and Health) and, within each topic, thenmore precisely assigned to clusters of very similararticles. Within each topic, the top 20 most salient(according to some Google measure) clusters aredisplayed, each containing between 300 and 1500articles and headlines.The end result is a large, publicly availabledatabase of news headline clusters. Within eachcluster the headlines similarity is consistent andsomewhat uncanny – syntactic structure, wordchoice, and style vary but it is clear that the large ma-jority of headlines refer to the same event and con-tain very similar semantic content. These clustersare a fertile field for the harvesting of large classes ofparaphrases grounded

View Full Document