Author Identification for LiveJournalThe problemFeaturesThe 3 ClassifiersResultsAuthor Identification for LiveJournalAlyssa LiangThe problem•LiveJournal – a blogging website•Given a document (an entry), identify the author•Hierarchical classification•first classify by gender•then classify author based on genderDocumentMale FemaleMale 1 Male 2 Female 1 Female 2 Female 3Features•Unigrams & Bigrams•Average sentence and word length•Number of words and distinct words •Number of sentences in paragraph•Number of UPPERCASE characters•Number of words not in the dictionary•Number of words with length <= 4•Number of characters in italics, bold, and striked outThe 3 Classifiers•Naïve Bayes – generative model•Assumes features in document are independent of each other•Implemented multi-variate Bernoulli model•Only represented if feature appeared in document, not number of times feature appears•Decision Trees•An internal nodes is a test of a feature, and each branch from the node represents the values it can take•A leaf node represents a classification•Build a smallish tree from the training data using minimum average entropy•Maximum Entropy – conditional model•“model all that is known and assume nothing is unknown”•Tries to find most uniform model that satisifies constraints, i.e. maximize the entropyResults•Hierarchical classification has no benefits•Need to improve gender classification – could use different featuresTraining Set Test Set decision tree Naïve Bayes maxent decision tree Naïve Bayes maxent gender 1.000 0.628 0.796 0.639 0.659 0.752 female 1.000 0.606 0.855 0.590 0.637 0.702 male 1.000 0.566 0.919 0.552 0.545 0.628 all 1.000 0.427 0.700 0.350 0.459 0.550 author 0.770 0.618 0.728 0.559 0.634 0.644 hierarchical overall 0.770 0.388 0.580 0.357 0.418 0.484 HierarchicalFeature Reduction0.3500.4000.4500.5000.5500.6000.6500.7000.750all 512 256 128 64 32 16 8 4# featuresaccuracytraining settest test•Feature Reduction (on gender classification)•took 512 most important features and reran maxent training; then took 256 most important features, etc.•Proved to be very stable•Best features consisted mostly of bigrams (many of which contained punctuation).•Also chose features where there was a large difference between male and female (number of distinct words, UPPERCASE letters, short
View Full Document