Gender Classification of Japanese AuthorsGendered Speech in JapaneseCorporaOur Baseline - The “Boku” TestClassifiers UsedChasen: Segmenter and POS-taggerFeaturesSlide 8PowerPoint PresentationSVM PerformanceConclusionGender Classificationof Japanese AuthorsDavid Edwards & Cybelle SmithGendered Speech in JapaneseGender of speaker may be overtly marked: Gender-specific first-person pronouns 僕 ,boku, male; 僕 , ore, male; 僕 ,watashi, female or neutralQuestion: Does gender have less-overt effects on Japanese texts as well?Can word choice, morphology, writing style indicate gender, even in noisy environments like fiction writing?Corpora“Peace” Corpus•29 personal essays by middle school students•Topic: “Peace”•29 authors:–22 female–7 male“Bookstudio” Corpus•485 installments of online novels•Genre: Fantasy•40 authors–20 female–20 male•Also collected ~181 installments from authors of unknown gender (for future research)Our Baseline - The “Boku” TestCorpus Male AccuracyFemale AccuracyOverall AccuracyPeace .71 1.0 .93Bookstudio .91 .43 .67Classifiers UsedNaïve Bayes: Build conditional probabilities of features given gender Calculate probability of test data given a particular gender Select highest-probability genderSVM: Used the LIBSVM free classifying tool Find dividing hyperplane in num-feature dimensional space - Requires problem-specific parameters chosen via cross-validation Apply hyperplane to test dataAlso attempted Logistic RegressionChasen: Segmenter and POS-taggerStem Pronun Lemma Part of Speech -ciation 僕僕 - 僕僕僕 僕僕僕 僕 僕僕 - 僕僕僕 僕 僕 僕僕 - 僕僕僕 - 僕僕僕僕 僕僕僕僕 僕僕 僕僕 - 僕僕僕僕僕 僕 僕僕 僕僕僕 - 僕僕 僕僕僕僕僕僕僕僕 僕僕僕僕僕僕 僕僕 僕僕 僕僕 - 僕僕 - 僕僕僕 僕 僕 僕僕僕 僕僕僕僕僕僕僕僕僕僕 僕僕僕 僕僕 僕僕僕 - 僕僕僕 僕僕 僕 僕僕 - 僕僕FeaturesStem Pron Lemma POS僕僕 僕僕僕 僕僕 僕僕僕 - 僕僕KURAki kuraki KURAi adjective - independentFeatures僕僕僕僕僕僕僕Kanji (Chinese character)Hiragana (phonetic)Katakana (phonetic, like italics)Feature Indic Stem Lem Pron POS Quot WS SPDWS1SPDWS2Male Accuracy .29 .67 .68 .70 .80 .23 .66 .49 .87Female Accuracy .51 .77 .78 .74 .45 .33 .85 .81 .68Overall Accuracy .40 .72 .73 .72 .63 .28 .76 .66 .77Single-feature performance on Naive-Bayes:Trial Stem Lem Pron POS Quot WS SPDWS1SPDWS2MaleAcc.FemaleAcc.OverallAcc.1 X X .63 .73 .682 X X .81 .73 .773 X X .70 .76 .734 X X .68 .76 .725 X X .68 .78 .736 X X X X X X X X .70 .70 .707 X X X X X X X .70 .73 .71Multi-feature performance on Naive-Bayes:SVM Performance•Optimizations: –Scaling counts to avoid swamping low-frequency features –Selecting optimal error rate and kernel parametersAccuracyFeatures No ScalingScaling Cross Validation (Training Set)Cross Validation (Test Set)All features (except quotations)50.6% 48.5% 79.7% 50.0%Part of Speech50.9% 53.0% 68.0% 47.3%Wordshape 50.6% 63.3% 75.2% 50.6%Pronunciation 50.6% 64% 77.8% 51.8%Conclusion•Without considering gendered pronouns, we achieved similar performance•Most-indicative feature: wordshape (use of kanji vs. hiragana vs. katakana etc.), especially where multiple options exist•Point of interest: male and female Japanese authors differ not just in the words they use, but how they choose to write those
View Full Document