Stanford CS 224n - Literary Style Classification with Deep Linguistic Analysis Features - D1685466

Home> Schools> Stanford University> Computer Science (CS) > CS 224n> Literary Style Classification with Deep Linguistic Analysis Features

Stanford CS 224n - Literary Style Classification with Deep Linguistic Analysis Features

School name Stanford University

Course Cs 224n- Natural Language Processing with Deep Learning

Pages 7

Download Save

Unformatted text preview:

Literary Style Classification with Deep Linguistic Analysis FeaturesHyung Jin Kim [email protected] Chung [email protected] Lee [email protected] on the assumption that people in sameprofessional area have similar literacy style,we inferred that similar literary styles can bedifferentiated by unique terms or expressionsthey use within the class. In this project, weconcentrated on utilizing as many featuresof various characteristics as possible, andextracting the most meaningful ones fromthem. As classifier, we used two differenttypes of machine learning algorithms: Sup-port Vector Machine and Naive Bayes Clas-sifier. As a result, our best model success-fully classifies the literacy style of authorswith 84% accuracy, which is surprisingly bet-ter than random guess (33.3%).1. IntroductionIn this project, we demonstrate that selecting deeplinguistic analysis features such as semantic relation-ship frequencies reduce classification error significantlyover more commonly used ”shallow” features such asfrequencies of articles(the-a), pronouns(he-him), pro-sentences(yes,okay).Our model is built and tested on Twitter tweets.We categorized Twitter’s tweets into three differentgroups: politician group, celebrity group, and techni-cian group. This list of each group is obtained by ”we-follow.com”, which already classified Twitter users bytheir profession or interest.From the data sets, we extract features and imple-ment java classes called feature extractor. Throughthe feature extractors such as Information Gain Fea-ture Extractor and TF-IDF(Term Frequency-InverseDocument Frequency), we utilize two machine learningalgorithms , SVM and NB with the extracted features.Our best result, NB classifier, shows the accuracy ofCopyright(c) by Wonhong Lee, Minjong Chung and HyungJin Kim. All right reserved.84% on classification, which is surprisingly better thanrandom guess(33.33%).2. Prior WorksBefore starting to work our project, we research otherpapers, which are related to our project. We refer-enced the below three papers.The first one was ”Linguistic correlates of style: au-thorship classification with deep linguistic analysis fea-tures” by Michael Gamon in 2004. Although author-ship identification has been a interesting topic in thefield of natural language processing, professional iden-tification has not been researched actively. However,with the development of various social network web-sites, providing professional identification became abig issue to e-commerce companies because the compa-nies can categorize their customers effectively as wellas their products. Therefore, we will implement pro-fessional identification based on the methods which areused in the first reference paper. The methods of stylecategorization used on first paper are Frequencies offunction words (Monsteller et al. 1964 ), Word lengthand sentence length (dating back to 1851 accordingto Holmes. 1998), Word tags (Aragon et al. 1998),Stability features (Koppel et al. 2003). Based on theabove methods, we added several methods for cate-gorizations like: POS tagging, TF-IDF and Manualword selection approach. Details of each methods willbe further explained below.The second one was ”Short Text Classification in Twit-ter to Improve information Filtering” by Bharath Sri-ram and Dave Fuhry in 2010. This paper classi-fies incoming tweets into categories such as News(N),Events(E), Opinions (O), and Private Messages (PM),while we classifies the tweets into different categoriessuch as Celebrity group, Politician Group, and Techni-cian Group. Therefore, from our classification model,we can estimate where the writing styles of twitterusers are belong to the above three different groups.Also, we use more features such as Term Frequency-Inverse Document Frequency(TFIDF) and POS tag-Department of Computer Science, Stanford Universityging, and analyze and compare between two differentmachine learning techniques, Support Vector Machinesand Naive Bayes Classifier. Also, because we use thesame domain, ”Twitter.com”, comparing with the ref-erence paper’s results can be a meaningful work.The last one was ”Entity Based Sentiment Analysison Twitter” by Siddharth Batra and Deepak Rao on2010, which was one of the final project done in the lastyear’s CS224N class. In this paper, the authors workedto extract word clouds for entity words by classify-ing the opinions as either positive, negative or neutral.The main difference between our project and their ap-proach is that they only consider opinion sentences orphrases. However, sentences or phrases based on factusually contain more important information, and wetry to utilize the fact based tweets in our project.3. ApproachOur tasks on this project largely concentrated on uti-lizing as many features of sentences as possible, andreducing the dimensionality of features to avoid over-fitting and to improve the accuracy of the classifierswe built.3.1. Features3.1.1. Basic featuresBasically, we used the occurrence of each word as afeature (binary feature) by parsing every sentence intraining dataset. Since there are lots of irrelevantwords in Twitter data such as URL, Twitter ID ofusers, or typos, we used stem extraction algorithm tofind the root word of them and filtered all the inap-propriate features. Additionally, we also ignored stop-words as feature words which we believed don’t playimportant role in classification of literary style.Even though removing seemingly irrelevant wordsmakes sense at first, we found out that sometimes us-ing stopwords or those grammatically incorrect wordsas features can lead to better performance. It happenssince although the word ”the” can occur on many sen-tence, the frequencies or positions of the word in sen-tence may be meaningful for classification.Instead of using binary value, we also tried to givea weight to the value of each feature dimension. Weused TF-IDF weighting to encode the importance ofeach feature.3.1.2. Syntactic featuresTo incorporate the syntactic information of sentences,we brought the power of the well-stabilized POS tag-ger. We found out that concatenate POS tag to theword, and using it as feature can lead to the improve-ment of the performance on some classifiers. Equippedwith POS tagger, we were able to select specific classof words for feature such as nouns, verbs, or adjective,some of which shows slight improvement when usingsolely.3.1.3. Semantic featuresSynonyms of a word

View Full Document


School:
Email:
New Password:
Confirm Password:

Stanford CS 224n - Literary Style Classification with Deep Linguistic Analysis Features

Sign up for free to view:

Please select your school