DOC PREVIEW
Stanford CS 224 - Lecture Notes

This preview shows page 1-2 out of 5 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Author Identification for LiveJournalThe problemFeaturesThe 3 ClassifiersResultsAuthor Identification for LiveJournalAlyssa LiangThe problem•LiveJournal – a blogging website•Given a document (an entry), identify the author•Hierarchical classification•first classify by gender•then classify author based on genderDocumentMale FemaleMale 1 Male 2 Female 1 Female 2 Female 3Features•Unigrams & Bigrams•Average sentence and word length•Number of words and distinct words •Number of sentences in paragraph•Number of UPPERCASE characters•Number of words not in the dictionary•Number of words with length <= 4•Number of characters in italics, bold, and striked outThe 3 Classifiers•Naïve Bayes – generative model•Assumes features in document are independent of each other•Implemented multi-variate Bernoulli model•Only represented if feature appeared in document, not number of times feature appears•Decision Trees•An internal nodes is a test of a feature, and each branch from the node represents the values it can take•A leaf node represents a classification•Build a smallish tree from the training data using minimum average entropy•Maximum Entropy – conditional model•“model all that is known and assume nothing is unknown”•Tries to find most uniform model that satisifies constraints, i.e. maximize the entropyResults•Hierarchical classification has no benefits•Need to improve gender classification – could use different featuresTraining Set Test Set decision tree Naïve Bayes maxent decision tree Naïve Bayes maxent gender 1.000 0.628 0.796 0.639 0.659 0.752 female 1.000 0.606 0.855 0.590 0.637 0.702 male 1.000 0.566 0.919 0.552 0.545 0.628 all 1.000 0.427 0.700 0.350 0.459 0.550 author 0.770 0.618 0.728 0.559 0.634 0.644 hierarchical overall 0.770 0.388 0.580 0.357 0.418 0.484 HierarchicalFeature Reduction0.3500.4000.4500.5000.5500.6000.6500.7000.750all 512 256 128 64 32 16 8 4# featuresaccuracytraining settest test•Feature Reduction (on gender classification)•took 512 most important features and reran maxent training; then took 256 most important features, etc.•Proved to be very stable•Best features consisted mostly of bigrams (many of which contained punctuation).•Also chose features where there was a large difference between male and female (number of distinct words, UPPERCASE letters, short


View Full Document

Stanford CS 224 - Lecture Notes

Documents in this Course
Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?