DOC PREVIEW
Stanford CS 224 - Lecture Notes

This preview shows page 1-2-3 out of 10 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS 224N Final Project: Unsupervised Clustering of People,Places, and Organizations in Wikileaks Cables with NLP CuesXuwen Cao, Beyang LiuMarch 11, 20111 Purpose and OverviewOur goal is to extract the names of key entities from written U.S. diplomatic communications and thento apply natural-language- and sentiment-based clustering to these entities using contextual featuresextracted from the data. We extract entities using an off-the-shelf statistical NLP package, and thenseek to generate meaningful clusters of entities in an unsupervised fashion. To do so, we experiment withdifferent models, feature sets, and clustering algorithms.For the purposes of this project, we define key entities to be people, nations, and organizations thatoccur at least k times in the dataset. We determined k through empirical evaluation of the entitiesextracted from the data, and also the need of different scenarios.2 DataOur data is the set of American diplomatic cables published by Wikileaks (http://mirror.wikileaks.info/) as of 10 February 2011 as part of its Cablegate release. It is available for public download as atorrent on the Cablegate website (http://213.251.145.96/cablegate.html). The data encompasses3891 cables, mostly sent to and from American embassies abroad by U.S. State Department officials. Thecables, which were initially intended only for internal State Department use, encompass 7 levels of clas-sification - “CONFIDENTIAL,” “CONFIDENTIAL/NOFORN,” “SECRET,” “SECRET/NOFORN,”“UNCLASSIFIED,” “UNCLASSIFIED/FORN,” and “OFFICIAL USE ONLY.” (The designation “NO-FORN” indicates that the cable should not be viewed by foreign eyes). Each cable includes a headerindicating the identification number of the cable, the date of its transmission, its classification level,the sender (usually an embassy or U.S. government office in Washington D.C.), the recipient (usu-ally a set of embassies and/or government offices in Washington D.C.), tags indicating subject mat-ter, a subject line, and finally, the body. The contents of the cables include many frank assessmentsof foreign governments, individuals, and organizations by State Department officials. They also shedlight on American diplomatic tactics and present accounts of events previously unreported in the me-dia. An overview of the revelations provided by the release of the cables can be found at: http://www.nytimes.com/2010/11/29/world/29cables.html.3 Stanford NLP Package ComponentsThe Stanford NLP group offers an integrated suite of core Natural Language Processing tools (http://nlp.stanford.edu/software/corenlp.shtml), which we used for this project. We used the NamedEntity Recognition (NER) and Part of Speech Tagging (POS) components of the package.3.1 Named Entity RecognitionThe NER component utilizes a Conditional Random Field (CRF) classifier [3], and identifies the followingcategories of amed entities: PERSON, LOCATION, ORGANIZATION, MISC. It uses a combination of1three CRF sequence taggers trained on various corpora. Our program shows that there are (lexicallyunique) 21584 people, 7754 locations and 24891 organizations in leaked cables.3.2 Part-of-Speech TaggingWe utilize a maximum entropy POS tagger [5] to assign each word in the data one of the POS labelsfrom the Penn Treebank tag set [4].4 Model: K-means Clustering in Sentiment/Frequency Space4.1 HypothesisWe choose sentiment scores and entity frequency as our two major features in K-means. In our dataanalysis, we assume two weak correlations - firstly, the sentiment of adjectives reflect US’s attitude towardsa given entity; secondly, the frequency of entity appearance in the cables correlates to its importance toUS. Therefore these two characteristics will allow us to make interesting discovery about US diplomacy.Moreover, the choice will allow us to evaluate clustering result extrinsically.4.2 Data PreparationAfter we process the cables with Stanford NLP Package, we obtain XML files of original cable annotatedwith named entity labels and POS tags. We compress these files for easier text processing, example:The[the,DT,O]Political[Political,NNP,O]Director[Director,NNP,O]emphasized[emphasize,VBD,O],[,,,,O]however[however,RB,O],[,,,,O]that[that,IN,O]Abbas[Abbas,NNP,PERSON]’[’,POS,O]remark[remark,NN,O]was[be,VBD,O]meant[mean,VBN,O]only[only,RB,O]as[as,IN,O]an[a,DT,O]example[example,NN,O],[,,,,O]and[and,CC,O]not[not,RB,O]as[as,IN,O]an[a,DT,O]explicit[explicit,JJ,O]suggestion[suggestion,NN,O].[.,.,O]For simplicity, we focus on the LOCATION entities(countries, regions and cities) We extract every wordtagged both as LOCATION and noun (NN, NNS, NNP, NNPS) and record the adjectives (POS tagged’JJ’, ’JJR’ or ’JJS’) in the sentences containing the LOCATION word. After we set up a dictionary ofkeys(LOCATION) and values(adjectives), we compute a sentiment score for every adjectives collectedbased on SentiWordNet (http://sentiwordnet.isti.cnr.it/). The SentiWordNet annotates eachword with three scores in the intervals [0,1], namely positivity, negativity and neutrality [1]. In ourcase, we compute the sentiment score as the difference of positivity and negativity.For each of the dictionary keys (LOCATION), a weighted average is obtained for all adjectives withnon-zero sentiment score. Only non-zero sentiment scores are included as the vocabulary size of Sen-tiWordNet is limited and the LOCATION words that appear more frequently will be penalized moreheavily if most adjectives are assigned zero scores. Since all the cables are related to US to some extent,we calibrate the mean of dataset by the scores attributed to US (location entity ’u.s.’, ’us’, ’united states’and ’usa’), which is around -0.02. In the last step, we normalize the scores by the maximum absolutescore so that the data points are more sparse in the interval [-1, 1].For the entity frequency feature, we simply keep counts of how many times a given LOCATIONappear across all cables. Due to the power law distribution, the majority of frequencies clutter at thelower end (zero), thus we used the logarithm and normalized the log of frequencies to interval [0, 1].Moreover, to only include the sentiment scores that are statistically sound, we used a threshold for theappearance frequency, empirically determined as 200.4.3 K-meansAfter the data preparation step, a 2D space of sentiment scores and entity frequencies is obtained. Weuse K-means for clustering the location entities by


View Full Document

Stanford CS 224 - Lecture Notes

Documents in this Course
Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?