DOC PREVIEW
Stanford CS 224 - Study Notes

This preview shows page 1-2-3-4-5 out of 14 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Correlating Language Model Clusters of News Agencies and Political Campaigns with Political BiasesCS224N Final Project - Winter 2011 - Yaron Friedman, Issao Fujiwara Abstract Our objective is to attempt to understand how high-level features such as political affiliation and orientation is expressed in low-level statistical features such as 1,2,3-gram language models. We crawl and scrape English text from articles from eight major news agency’s sites and nine politician’s campaign sites. We train language models with each corpus, and then evaluate each other corpus with these models. We apply clustering techniques to the resulting data, analyzing the correlation of these clustered results with political affiliation and orientation. We observed interesting features in this data, such as the consistent clustering of the obama and biden corpora, as well as a significant split of the candidates cluster by party affiliation. Finally we define a method for extracting relevant words in distinguishing the different corpora that provides us some intuition about the results observed. Data Collection Our goal here was to collect text data from different news sources and politicians’ campaign material. We were unable to locate good stock sources for these, so we decided to fetch the data ourselves by extracting English text from relevant websites. This process turned out to be significantly harder than we originally thought, but with lots of heuristics we were able to get a decent amount of high quality English text from most sources that we looked at. It’s worth noting that we weren’t able to necessary gather complete texts, but we have high confidence that the text we gathered is primarily article content. For the crawling component of obtaining the data, we investigated using many different free web crawlers that are publicly available. Among the few crawlers that we investigated, we spent the most with Heritrix [1], the crawler used by the Internet Archive, and WebSPHINX [2] developed at CMU. While both of these are powerful and sophisticated crawlers, it was actually quite hard to get them to reliably achieve our goal of reliably obtaining as many HTML pages served from a given domain as possible. Surprisingly to us, we ended up having the best luck with the recursive download function of wget [3], a widely available UNIX command-line tool:wget -r -l inf -D <domain> <target> -R w -o log-<source> With that strategy, we were able to reliably download large HTML corpora for each of the news agency and presidential candidate websites that we were interested in.Once we obtained the raw HTML data for each corpus, we set out to extract the relevant English text from each of the corpora. Our strategy involved parsing the HTML for each page using the BeautifulSoup [4] library and iterating over several heuristics in order to remove as much non-English text as possible. We first stripped out any text that was in a <script>, <style>, <head>, <meta> or <option> tag as well as HTML comments. From there we extracted all text that was a descendant of a <p> and flattening all the text nodes in its sub-trees. By itself, that strategy was enough to get us most of the way towards having only English text in our data, but there was still a lot of extraneous content in that data that made bulding language models from it hard. Some examples included lots of navigational text, copyright notices, other meta English text such as notices about browser compatibility and error pages, and finally ads, comments and related tweets and Facebook messages. We applied a few more heuristics in order to gather a corpus that was clean enough to use for building language models. We started with a few rules, evaluated the resulting output, and iteratively refined it until we had text that appeared to be strictly article text. We ended up with the following heuristics:● Covert all text to lowercase● Collapsed spaces● Strip punctuation (: ; “ - *) that’s not useful and split remaining punctuation marks from surrounding text. We tried to follow the same pattern here as the training data provided to us in PA1.● Removed text blocks which contained any of the following words:○ “copyright”○ “current browser”○ “all rights reserved”○ “privacy statement”○ “paid for by”○ “terms of service”○ “we are not liable”○ “we experienced an error”● Replace urls with a “[url]” token. The intent here is that we’re reducing sparseness because urls are not likely to be repeated but still attempting to preserve sentence structure and whether particular sources contain embedded citations.● Remove single sentences (while treating ellipsis as a period) as well as sentences which contain fewer than 10 words. These heuristic were very useful for stripping out navigational text and comments embedded in the pages.● Remove short text segments. A text segment was the unit of text defined by one <p> subtree. Note that for some forms of NLP analysis, this heuristic (as well as the one above) would bias the language of the news sources and have the potential to modify the perceived writing style (i.e. one news source may use lots of short sentences) but since we’re focused on n-gram language models, we don’t believe this bias towards larger blocks of text impacts our language models’ behavior significantly.Our scraping and post-processing resulted in the following corpora for evaluation:Source Raw Data Parsed and Processed Datawww.latimes.com 612MB 2.4MBwww.cnn.com 475MB 12.9MBwww.huffingtonpost.com 711MB 6.8MBwww.foxnews.com 126MB 1.7MBwww.nytimes.com 210MB 2.7MBwww.bbc.co.uk 165MB 3.2MBwww.msnbc.msn.com 545MB 14.8MBwww.washingtonpost.com 230MB 2.3MBwww.4biden.com 11MB 3MBwww.barackobama.com 20MB 1.6MBwww.ronpaulforcongress.com 5MB 152KBwww.johnmccain.com 5MB 137KBwww.jerrybrown.org 14MB 914KBkucinich.house.gov 102MB 11.9MBchrisdodd.com 5MB 85KBfreestrongamerica.com (Mitt Romney)62MB 3.2MBhillary4president.org 35MB 504KB Future Work for Data Extraction Component Even with our heuristics which aggressively strip short text (comments, tweets, etc), we observed that there was


View Full Document

Stanford CS 224 - Study Notes

Documents in this Course
Load more
Download Study Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?