SWARTHMORE CS 97 - Report on Political Leaning Classification - D1737867

Home> Schools> Swarthmore College> (CS) > CS 97> Report on Political Leaning Classification

DOC PREVIEW

SWARTHMORE CS 97 - Report on Political Leaning Classification

School name Swarthmore College

Course Cs 97- Computer Perception

Pages 8

This preview shows page 1-2-3 out of 8 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Appeared in: Proceedings of the Class of 2005 Senior Conference, pages 34–41Computer Science Department, Swarthmore CollegeReport onPolitical Leaning ClassificationBen Mitchell & Zach PezzementiComputer Science DepartmentSwarthmore CollegeSwarthmore, PA 19081{mitchell,zap}@cs.swarthmore.eduAbstractThe tasks of document classification andsentiment classification have been ex-plored in the literature, but to our knowl-edge the task of political classification hasnot. We use a modified form of a docu-ment classification algorithm (Hu and Liu,2004) to classify newspapers as liberal,conservative, or neutral based on theirtext. By using a cosine similarity met-ric in our feature space, we were ableto achieve distances that separated openlyliberal from openly conservative papers.According to the same metric, we foundTime and Newsweek to be fairly centrist,as their distances from liberal and conser-vative papers were about the same, whilethe Chicago Tribune displayed a distinctliberal bias. This feature space showspromise for further sentiment or documentclassification work.1 IntroductionDocument classification is the task of grouping a setof documents based on their content, usually into afixed number of predefined categories. Documentclassification schemes have been developed for usein specific domains, such as classifying news sto-ries (Yang et al., 1999) or grouping web posted jobopenings (Cohen and Hirsh, 1998), as well as moregeneric algorithms designed to work across manydomains (Schohn and Cohn, 2000). The classesare usually broad topics, picked in advance (forexample, classifying sports articles as being aboutbaseball, football, or basketball). A fairly simpleBayesian bag-of-words model has been shown to besuccessful in document classification tasks (Bakerand McCallum, 1998).Sentiment classification attempts to group docu-ments according to the sentiment of the author withrespect to the subject. Most previous studies havedefined the sentiment classification task as integrat-ing aspects of document classification and text sum-marization (Fei et al., 2004; Hu and Liu, 2004; Panget al., 2002). The goal is typically to classify eachdocument (often a product review) as being a mem-ber of one of two classes, either positive or nega-tive, though attempts at more complex classificationschemes have been made(Yi et al., 2003).We expected the problem of political sentimentclassification to require somewhat different tech-niques from those used in document classification orstandard sentiment classification. Firstly, in typicalsentiment classification tasks, the text used as inputis written specifically to communicate the informa-tion the algorithm is trying to extract. A product re-view, for example, is written with the intention of ex-pressing the sentiment of the reviewer with respectto the product being reviewed. The sentiment we aretrying to detect, on the other hand, is not necessarilystated explicitly within the text. Similarly, most doc-ument classifiers need only identify the main topicof a passage in order to make their classification de-cision, whereas we specifically want to avoid distin-guishing between articles based primarily on theirmain topic. To help avoid classifying based on con-34Appeared in: Proceedings of the Class of 2005 Senior Conference, pages 34–41Computer Science Department, Swarthmore CollegeNation Freq National Review Freqdlc 61 guevara 33un 60u.n. 29durbin 42gannon 26henry 40official 26trotsky 39chavez 23falluja 35pollack 22guernica 32kim 20deutscher 31ortega 19nevada 28mithal 15women’s 25post-war 14Table 1: Top Ten Most Frequent Words WhichOccur in Only One Corpustent, we limited our data to articles on a single topic;we chose the United States’ war in Iraq as a topicsince it was frequently in the news and was also asubject of political contention.Preliminary tests suggested that unigram proba-bilities are insufficient for our classification task (seeTable 1). The results in this table represent the topten words in each of two corpora, where words areranked by number of occurrences, and words thatappeared in both corpora were eliminated. To a hu-man observer, there does not appear to be a strongsignal of political leaning in these data. For this rea-son, we used a more complex feature space to do ourclassification.2 ProcedureThe features we deal with for classifying documentsare distributions of association rule confidences asdescribed in (Hu and Liu, 2004). For a given doc-ument consisting of a set of words W divided intoa set of sentences S, an association rule expressesthe likelihood that two separate word phrases X andY will occur in the same sentence, with an implica-tion that the presence of X causes Y to appear. Therule is defined X → Y where X ⊂ W , Y ⊂ Wand X ∩ Y = ∅. That is, both X and Y are wordphrases that do not overlap. For our purposes, X andY are always single words. Two statistics, supports and confidence c, are calculated for each possibleword association (every pair of words which occurtogether in at least one sentence). Support is a mea-surement of the number of times we see a place inthe text where the two words could be associated,and it is defined as the percent of sentences in thecorpus that contain either X or Y ,occ(X∪Y )|S|whereocc(w) is the number of sentences containing w.Confidence is then a measurement of how stronglywe believe the presence of X causes the presenceof Y , and it is measured as the percent of sentencescontaining X which also contain Y ,occ(X∧Y )occ(X).By imposing thresholds on both c and s (c-thresh and s-thresh), we select for a given docu-ment a number of association rules which both occursomewhat frequently (high support) and have fairlystrong causality (high confidence). We further fil-ter these rules by requiring that the “term-sentencefrequency” of the second term in the rule, Y ,besmaller than a third threshold, t-thresh. The term-sentence frequency of a word is defined as the num-ber of sentences containing that word divided by thetotal number of sentences in the document. This re-striction eliminates unimportant rules on very com-mon words like “the” and “of,” which would other-wise have very high confidence. The particular val-ues used for these thresholds were s-thresh=0.01, c-thresh=0.1, and t-thresh=0.2. The number of associ-ation rules which pass this final threshold define thelength of our feature vector for a given document.Some sample rules are given in Table

View Full Document