DOC PREVIEW
Stanford CS 224 - Study Notes

This preview shows page 1-2 out of 7 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Quote Clustering in Online NewsOrr Keshet and Evan Rosen{okeshet,erosen}@cs.stanford.eduAbstractThe notion that information movesthrough social net works has been widelydiscussed[3], however, with the growingav ailability of large digital corpora, theability to quantitatively mo del this phe-nomenon is new. To this end we explorealargecorpusofonlinenewsquotationslooking for cases of noisy reproductionand the factors which influence such noise.An essential step in this process is distin-guishing mutational variants from entirelyindependent but similar quotations. Giventhat the question of whether two quotesreally were derived from the same originalutterance cannot be known with completecertainty, it is not immediately apparenthow to make progress on such a task. Ourappraoch is twofold: On the one hand weattempt to frame the problem in termsof supervised learning, annotating datausing Mechanical Turk. On the otherhand, we approach the problem more fromthe perspective of unsupervised clustering,projecting the data into a variety of metricswhich allows us to test and extend ourlinguistic intuitions about the dataset.1IntroductionFor the sake of this pro ject, we restrict our attentionto the inspection of quote distance metrics. Suchdistance metrics can serve as the basis for arbitrarilycomplex clustering algorithms, a s in [5, 4], and theyalso provide a relatively direct way to test linguisticintuitions. In the spirit of the formal definition ofametric,weprimarilytestourdistancemetricsonpairs of quotes, ignoring the more global implicationslike transitivity for the moment.The first component of this project involved sort-ing this large set of quote pairs by various distancemetrics and hand-inspecting the results. In this waywere were able to confirm and reject a several plau-sible hypothesis regar ding mutations of online newquotes. In general, our results are mostly negative,suggesting that noisy reproductions in the news areless frequent than originally expected. While therejection of such so cialogical and linguistic assump-tions is useful, we wanted to provide a systematicway to ensure that we were in fact looking at theright dimensions of variation, or to at least distin-guish between the usefulness of the metrics whichwe proposed.The design of a sensible evalua tion scheme consti-tutes the second significant component of this work.This task presents two main obsta cles . While a hu-man reader consecutively reading two news articleson the same subject might easily tell when two dif-ferent quotations are in fact variants of one another,it is much less obvious how to confidently re ject twoquotes as being va riants of one another. The secondissue arises more as a feature of the dataset. We onlyhave access to the quotes themselves and thus can-not use any of the contextual information present inthe article and webpage. Were we able to view theoriginal new article in its entirety, it is likely that ahuman could make much more confident estimatesabout common quote origins.Despite these callenges, we decided that humanevaluation was still o ur best option and provides areasonable baseline against which to compare our au-tomated distance metrics. For example, a pair ofquotes with edit distance one, which differ in o nly ssingle semantically distinct word, could easily be mis-taken to be variants of the same quote by an edit dis-tance metric, but would be immediately picked outby a human evaluator as semantically incompatible.We annotate quote pairs as plausible variants of asingle original utterance using Amazon’s MechanicalTurk and then compare these annotations with theresult of a rudimentary clustering algorithm basedon each metric on its own.2DatasetThe dataset for this work is taken from a la rge corpusof mainstream news web sites and blogs, collected aspart of the MemeTracker project [5]. This amountsto a large set of text files which, for each w ebpagerecord the url, date of publication, any quotations,and any outgoing url link, where the date of publica-tion corresponds to the time at which the page waspushed to an RSS feed. While most of the pages arein English, a significant proportion are in Spanishand French and even several less common languageslike Bahasa Indonesia.Interestingly these rarer languages seemed to showup disproportionately in the set of close but non-identical quote pairs. For example, the followingIndonesian phrase s, which litera lly translates to “atechnique of buying a home without capital,” allshowed up many times in the cor pus.“teknik m embeli rumah tanpa modal”“ada teknik membeli rumah tanpa modal”“mau tahu teknik membeli rumah tanpamodal”“mau tahuteknik membeli rumah tanpamodal”We suspect that this result might be due to a dif-ferent use of quotation marks in other languages, asthis quote, when taken literally, seems unlikely tohave been a real persons statement. Another plau-sible explanation is that these quote va irations ar ethe result of a less standardized orthography. Thissituation seems to arise in places in which the thenational written language is learned as second lan-guage for many speakers. Another source of compli-cation when dealing with foreign languages resultsfrom the fact that non-ascii characters have been re-placed with spaces. All of this is to say that the prob-lem of disambiguating quotes is very language spe-cific and therefor e we only focus on English quotes,though non-english quotes are still in the data, andmake up edges in the link graph.It is also worth noting that the dataset is extremelylarge. One month’s worth of compressed quote datais approximately 1GB and consists of approximately10 million webpages. We had access to data from2008 to the present, but never constructed a singlegraph from the entirety of the quotes, due to timeand memory constraints.3Methodology3.1 Graph GenerationTo b etter focus on requotation and reference in on-line news, we wanted to bias our dataset towardswell connected web pages. That is, we looked for acollection of webpages which had at least one linkto other webpages in the corpus. By filtering outany pages which did not contain any quotes or linksto other pages in the corpus, we were able to sig-nificantly reduce the pages under consideration toapproximately 300,000 per month. This significantlyreduced the computational costs of our algorithm.Moreover, this filtering has the effect of reducing thenumber of connected components, making for moreinformative


View Full Document

Stanford CS 224 - Study Notes

Documents in this Course
Load more
Download Study Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?