DOC PREVIEW
Stanford CS 224 - Political Party, Gender, and Age Classification Based on Political Blogs

This preview shows page 1-2-3-4 out of 11 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Michelle Hewlett Elizabeth Lingg Political Party Gender and Age Classification Based on Political Blogs Introduction Motivation The ability to classify or identify individuals based on their writing is an important problem in machine learning and natural language processing Is there a difference in writing style based on gender Do individuals under 25 use different punctuation than those 25 or older Is it possible to determine someone s political ideologies simply based on keywords There are many potential applications in targeted advertising search author information and identification We examine the problem of identifying bloggers based on features in their blog posts Our goal is to identify bloggers age gender and political party Data Data collection was a challenge for this project There are no known public corpora for blogs Also we were interested in recent blog data about the upcoming election and there were no public corpora available for this specific task We found 500 blogs online with 10 entries each or less if the blogger had written less than 10 entries We used a variety of different media the authors website Blogger com LiveJournal Myspace etc We collected blogs with recent entries We also hand labeled information that the blogger provided such as age gender and political party We confirmed that the self identified political party was correct by reading the blog Experimental Method We used two primary methods of classification for political party gender and age First we did classification based on salient features We separated our data into a training set and a test set using hold out cross validation We generated a feature vector based on the training data and tested it with the held out test data Secondly we used k means clustering on the features over the entire data set Classifier Testing and Results Political Party In order to find features based on political party we generated a list of the most common unigrams bigrams and trigrams used in the data We then weeded out noninformative n grams such as the a or else To find good features we computed the probability of each n gram This was determined by calculating the relative frequency of the n gram by party For example if Republicans used the word freedom with the three times as frequently as Democrats used the word freedom the probability of the writer who uses the word freedom being Republican was computed to be 75 For simplicity we only considered the probability of the writer being a member of the majority parties Republican and Democrat The following is a list of some of the probabilities generated We list the probability of the writer being a member of the Republican Party The probability that the writer is a member of the Democratic Party 1 probability that the writer is Republican Hussein Probability Republican 79 Bush Probability Republican 33 Clinton Probability Republican 29 McCain Probability Republican 48 Obama Probability Republican 52 Cheney Probability Republican 16 Muslims Probability Republican 84 Jesus Probability Republican 68 God Probability Republican 73 liberals Probability Republican 78 Liberals Probability Republican 85 Republicans Probability Republican 34 Saddam Hussein Probability Republican 50 President Bush Probabiltiy Republican 52 President Obama Probability Republican 70 President McCain Probability Republican 94 in Iraq Probability Republican 58 God bless Probability Republican 72 God Bless Probability Republican 54 President Barack Obama Probability Republican 83 Barack Hussein Obama Probability Republican 93 troops in Iraq Probability Republican 23 We found that there was a significant difference in the words and phrases that Republicans and Democrats used For testing we used hold out cross validation We separated the data into a randomly generated training set and test set with the training set consisting of 80 of the data and the test set consisting of 20 of the data We recomputed the feature vector each time with the new probabilities given the training data and tested it on the held out data set We created a feature vector using some of the more frequently used and informative features Features that had about a 50 probability for Republicans and Democrats were left out as they were not very informative Also because bigrams and trigrams were infrequent they were not used in the feature vector Features fi were set to have the probabilities calculated in the training data in the same manner as given above Weights wi were set to be equal for all features except the unigram liberals which was given three times the weight of the other features This was because of its high frequency of occurrence We then summed over all the weights for each feature multiplied by the feature probability to get the probability used by the classifier wi f i i We classified writers using the test data with a high probability of being a member of the Republican Party 49 as Republican and those with a low probability of being a member of the Republican Party 29 as Democrat Those with probabilities in the middle were not classified or classified as Unknown Using this heuristic we were able to classify 30 60 of the test set with the remaining 40 70 being classified as Unknown We were able to achieve fairly high accuracy 94 in the best case and 80 on average The following graph shows the results using hold out cross validation on five randomly generated test and training sets Classifier Testing and Results Gender In order to find features based on gender we conducted a literature review In Gender Genre and Writing Style in Formal Written Texts by Argamon Koppel Fine and Shimoni it was found that women use more pronouns and fewer proper nouns than men We decided to investigate this as well as other features such as word and sentence length We calculated the relative frequency of the various pronouns for men and women For example if women used the word myself three times more frequently than males used the word myself the probability of the writer who uses the word myself being female was computed to be 75 We also computed probabilities for average sentence length average word length and percentage of proper nouns The percentage of proper nouns was calculated by dividing the average number of proper nouns for writers of a given gender by the average number of proper nouns overall The following is a list of some of probabilities generated We list the probability of the writer being male The probability that


View Full Document

Stanford CS 224 - Political Party, Gender, and Age Classification Based on Political Blogs

Documents in this Course
Load more
Download Political Party, Gender, and Age Classification Based on Political Blogs
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Political Party, Gender, and Age Classification Based on Political Blogs and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Political Party, Gender, and Age Classification Based on Political Blogs 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?